Rating the Ratings Systems

by Ben Vollmayr-Lee

Ratings systems are used for the purpose evaluating teams objectively, and have taken quite a hold in the college sports arena, from the BCS (Bowl Championship Series) method for selecting college football bowl berths, to the RPI (ratings percentage index) system used to some degree in NCAA tournament selection in college basketball. Ratings systems are motivated by the need to factor in schedule strength in cases where the size of the league is large and the quality of teams' schedules vary considerably; cases where the straight win-loss record is not informative enough. For example, the win-loss record in NCAA basketball is less informative than it is in the NBA, because in the latter case the variation in strength of schedule is much smaller.

Creating ratings systems has become something of a hobby for mathematically inclined sports fans, so a tour of the web will reveal a large number of them. While many of these systems may be roughly equivalent in quality, a few may be gems and a few may be inferior, there is little evidence available to guide the choice among them. This is a shame, because with all the effort going into ratings systems, we could have learned a lot more about what is important for a quality ratings system, and, equally importantly, how well they actually perform. It's my opinion that we have more ratings systems than we need, and we know less about all of them than we should.

I'm working on addressing that imbalance, so let's begin by figuring out how to evaluate a ratings system. While there are quite a few criteria one can use to evaluate rating systems, they all basically boil down to two issues, which I list in the order of importance:

Ability to predict outcomes of future games
Ability to predict margin of victory or probability of outcome for future games.

I'll discuss the second category later. Let's talk about what the first category isn't. One way to test ratings systems is to take the end of season ratings and go back to see how many games they retrodictively explain. At face value this seems reasonable enough; most quality systems peg in around 75% or so, while some lesser quality systems (like RPI) are noticeably lower. However, this is NOT the best criterion to use for testing a ratings systems, and quite possibly not even a GOOD criterion.

What's wrong with it? First off, according to this criterion there is a clear answer what the best ratings system is. I just construct a prediction function for each game that looks at the two teams' ratings, possibly includes some home court factor, and then ticks off a '1' if the game is won by the predicted team, and a '0' if not. The sum total of these is the number of correctly predicted games, which is a function of the set of team ratings. Now I just fiddle around to find which set of team ratings maximize this number and I've reached my theoretical perfection (Note: there will actually be more than one set of team ratings that result in the maximum win prediction, but they will all share the same maximum.) That doesn't seem so bad, until you scratch a little below the surface: the maximum number of games predicted correctly can vary depending on whether you allow a home court factor to be part of the prediction or not. It can also vary ...

This page maintained by Ben Vollmayr-Lee. Last updated November 29, 2001.