The "RPI" is the Rating Percentage Index concocted by the NCAA to "assist"
in determining which teams to let into the national basketball championship
tournaments and how to seed them. To calculate a team's RPI, you take
its winning percentage, twice the average winning percentage of the
team's opponents, and the average of the winning percentage of the opponent's
opponent's, add them together and divide by four.
Some observations, then, on the RPI:
- If you have a good RPI, and you beat a weak opponent, your RPI can
go down. Conversely, if you have a bad RPI, and you lose to a good
opponent, your RPI can go up. This is a shortcoming. As it happens, if
you play most of your games against
teams that about your strength, the RPI
yields a decent approximation of your strength; unfortunately, The
Committee uses it to try to select the top 50 or so teams in the
country, and while the "bubble teams" often have played a fair number of
games against teams at their level, its use for seeding purposes is
really dubious.
- If you subtract off .500 from the wins/games of each team, you get a
number whose average will be zero. While some teams have better records
than others, when you average over all the opponents you get a number that
is closer to zero, and, in fact, if you keep calculating not just your
opponent's opponent's average, but their opponent's opponent's, and so on,
these numbers get very close to zero very quickly.
- I subtract off .500 because, in fact, once I have done this, I can keep
adding these averages over and over, and since they are going to zero instead
of to .500, the total actually plateaus. Not only does it change more and
more slowly, but it will, in fact, converge to a final answer.
- If I multiply by 2 after subtracting .500, I get a number that varies
from -1 to 1, and, in fact, is (W-L)/(W+L). I like this better; it will
simply result in a doubling of the RPI.
I don't understand why the opponent's average is weighted twice as much as
the surrounding terms, but I would propose that, in fact, we go to a system
that just adds these percentages to infinity. This can be done by a computer
exactly and relatively quickly through the use of some linear algebra and a
cute stunt.
Mathematics
You probably don't want to proceed any further if you are afraid of
mathematics and/or don't know any linear algebra. What I'm going to do
is to arrange the winning percentages into a vector w; from the schedule
I can arrange a matrix C, each of whose rows totals 1, such that Cw
becomes the average winning percentage of the teams' opponents, again
arranged into a vector. (The entries of C, then, are simply the number of
games any two teams played against each other divided by the number of games
the team corresponding to the row played; for example, Duke plays North
Carolina twice in a season, so if Duke played 34 games and UNC played
32 games, the entry corresponding to Duke-UNC is 2/34 while the entry for
UNC-Duke is 2/32. The laws of matrix multiplication, then, mean that
multiplication by this matrix averages the values of one's opponents'.)
CCw is then the average for the opponent's opponents, and so on; what I've
proposed, then, is r=w+Cw+C^2w...=(1+C+C^2+...)w, where by 1 I mean the
identity matrix.
The mathematically inclined people who should be reading this will know
that, in some sense, (1-C)(1+C+C^2+...)=(1+C+C^2+... -C-C^2...)=1 so long
as C^n gets small when n gets big; by multiplying both sides of my equation
above by (1-C), I get (1-C)r=w, or r=w+Cr, so that a team's RPI is its
winning percentage plus the average of its opponents' RPIs, which seems
itself to be a fine ab initio definition of our RPI.
Thus, if we can invert the matrix 1-C by our usual laws of matrix inversion,
we can just multiply that by w to get r.
Unfortunately, we can't.
Each row of C, you remember, adds up to 1, so that each row of 1-C will add
to zero. The columns are linearly dependent, and 1-C is therefore not
invertible. So we've reduced this problem to one of inverting a noninvertible
matrix.
Consider a matrix all of whose entries are the same, say E. Er, then,
would be equal to the average of r for all teams. Another upside to
subtracting off .500 to put the average at zero; Er=0. (Incidentally,
because different teams play different numbers of games, it may not work
out exactly to zero; let me then not actually subtract off .500 exactly,
but whatever I need to to make Er=0. Not only is this actually necessary
to make the C^n actually go to zero, but it also gives the closest approximation
in a least squares sense to w when we end up multiplying (1-C) by the
solution r that this will give us; remember, since 1-C is singular, no matter
what we choose for r (1-C)r will be restricted to a subspace that doesn't
necessarily exactly include w, but it gets darn close, and this little
hack we're doing gives us the point of closest approach.)
Since Er=0, and (1-C)r=w, (1-C+E)r=w. So, by adding some number to each
entry of 1-C, we get a matrix 1-C+E whose inverse, M, can be determined by
standard methods.
In some sense we don't have to calculate M; all we really need to do is
solve (1-C)r=w for r. There is a little value in calculating M:
- Modulo early season tournaments, we know the schedule before the
season, and can thus calculate M before the season begins, rather than
having to wait until we know w.
- Having an actual matrix M for which r=Mw can provide a certain amount
of insight into how closely a team is tied to another. Some of you probably
don't care, but I think this sort of thing is enlightening; a team might
think it's doing well because it has some good wins over teams with good
records, but its row in the matrix is dominated by entries against Towson St.
and Yale.
Again, we have this kernel problem with (1-C), this singularity problem,
that the equation (1-C)r=w is not solvable. (1-C+E)r=w is the equation
you should actually try to solve.