Assessing Elo Rating systems for NFL teams and Chess masters

Competitor Ratings

Head to head matches characterize many sports such as football and chess. The pattern of wins and losses indicate the relative strengths of the competitors. This should also indicate how a team will do when facing other opponents in the future. One frequently used rating method is the Elo rating system pioneered by Arpad Elo, a Hungarian-born American physicist.

Ratings for the competitors are a predictor of the outcome of the match. Following the match, the rating for each competitor is adjusted up or down depending on the outcome. From a starting value, each competitor’s rating moves up or down with victories or losses. Victories over strong opponents increasing one’s rating the most. The amount one’s rating improves with a victory depends on the way the system is tuned.

In this blog I will discuss the tradeoffs of this tuning. A sensitive tuning means that one’s rating improves rapidly with each victory, but can also drop sharply after a loss. This means that  sensitive tuning is responsive to recent history, but introduces variability in one’s match-to-match ratings. I will close by discussing how this rating sensitivity should be a consideration in weighing recent performance over a long-term record, length of season, and different in win percentages of stronger versus weaker teams.

Elo ratings

Many variations of the original Elo system have been adapted for the specific characteristics of different sports. For purposes of this blog, I will use a generic version with parameters that are used in many variations of the rating system. Ratings for teams are relative to each other, so the probably that team A wins in a head-to-head competitions with team B is


and the probability that B beats A is


where R_{a} is the rating for team A and R_{b} is the rating for team B.

After a win the rating for team A is updated with

R_{a(n)}=R_{a(n-1)}+K(1- E_{a(n-1)})

and after a loss the rating is updated to

R_{a(n)}=R_{a(n-1)}+K(0- E_{a(n-1)})

where K is the sensitivity to the latest result. In many ratings systems, K values range from 10 to 40 (see wikipedia)

The sensitivity parameter

The remainder of this blog looks at the sensitivity of K to recent wins and its variability over time as the competitor wins and loses according to the team’s true probability. To assess the sensitivity for a given K, I examine the movement in ratings for two competitors that repeatedly play each other. Both start at 1400 and one team has a string of consecutive wins. For several values of K, the following graph shows how the rating for the winning team increase as wins accumulate.

consecutive wins ratings

and how the expected win probability versus the same opponent increases

consecutive wins percentage

The ratings and projected win probabilities rapidly increase initially, but slow down as the opponent’s weakness reveals itself. A higher K speeds the increase in ratings and win probability. From this point of view, a high K value allows a competitor to quickly ascend to the rating deserved in this competition. A high K value, however, also leads to swings in ratings once a best estimate of the win probably is reached.

To see this, I created two competitors with ratings, R_{a} and R_{b}, matching their long term win probabilities,  E_{a} and E_{b},  and simulated 10,000 head-to-head competitions for different K values. This produces histories for ratings and win probabilities for various rating levels and K values with the ratings variability summarized by the variance of the historical ratings. Ratings corresponded to probabilities from 0.05 to 0.95 in steps of 0.05. K values were from 0 to 51 in steps of 1. When plotted, the results were


When K=0, the ratings are not updated, so the variance is zero. The variance increases linearly with K.

variance = 46.2 K

with F_{(1,968)}=169300 for p < .001.

To summarize, the following table shows the trade off between responsiveness and ratings noise: 1)the K-factor, 2) the number of consecutive wins to go from 50 percent projected win probability to a 66.7 probability, 3) the standard deviation of the rating when the rated win probability matches the actual win probability (1400 is mean rating), 4) probability of winning again after a single win where both start with 1400 ratings, 5) probability of winning after a single loss where both start with 1400 ratings

K Wins to 66.7% Stdev Win 1 Lose 1
5 29 15.2 0.507 0.493
10 14 21.5 0.514 0.486
15 9 26.3 0.522 0.478
20 7 30.4 0.529 0.471
25 5 34 0.536 0.464
30 4 37.2 0.543 0.457
35 3 40.2 0.55 0.45
40 3 43 0.557 0.443
45 3 45.6 0.564 0.436
50 2 48.1 0.571 0.429

Adjusting K

Some systems, such as that used by FIDE for chess ratings, set K according to experience, age and current rating. Generally their systems have lower K values for players with shorter histories. In some cases, this might not be the best policy for setting K. In an ideal world, matches involving newer players yield the most information. Since the results are on a blank canvas, the ratings should be quick to reward talented play. Over an extended time period, results should stabilize to a “true” rating which should drift indicating improvement but not be overly jumpy.In this scenario, a higher initial K would quickly adjust the rating to the “true” value. Lower subsequent values of K would minimize the noise in mature ratings. This jumpiness in predictions might be especially problematic for sports, such as baseball, where the winning percentages of the best teams are far below 100 percent.

2 Responses to “Assessing Elo Rating systems for NFL teams and Chess masters”
  1. Vikrant says:

    As in the case of Roger Federer, his ratings should get effected more from his recent performances to give us the idea about what are his chances of winning which lacks in the Elo’s Rating model. I think K rightly balances the equation and it can absorb many factors into it. Trade-off between variance and K is extremely important consideration.
    I have only one concern how to balance K with the situation as K is considered as Response time here and held to be constant but I think it also can change over time and as problem of variance will not be that much of a problem in place of better little correction by increase of K. If a team is winning 9 matches out of 10. Its recent performances will dominate older K and value of K should increase.
    I totally agree with that new players should be given higher ratings based on their recent performances. Higher effect will be of K rather than their Ratings initially. This will eliminate our bias towards experience and will give a proper chance. And this model can really help in betting industry.

Check out what others are saying...
  1. […] the terminal rating for each ordering using a K-factor of 32 (for an explanation of K, see my previous post).  The histogram below shows that in all cases, the winning team’s rating improved from […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: