### Assessing Elo Rating systems for NFL teams and Chess masters

#### Competitor Ratings

Head to head matches characterize many sports such as football and chess. The pattern of wins and losses indicate the relative strengths of the competitors. This should also indicate how a team will do when facing other opponents in the future. One frequently used rating method is the Elo rating system pioneered by Arpad Elo, a Hungarian-born American physicist.

Ratings for the competitors are a predictor of the outcome of the match. Following the match, the rating for each competitor is adjusted up or down depending on the outcome. From a starting value, each competitor’s rating moves up or down with victories or losses. Victories over strong opponents increasing one’s rating the most. The amount one’s rating improves with a victory depends on the way the system is tuned.

In this blog I will discuss the tradeoffs of this tuning. A sensitive tuning means that one’s rating improves rapidly with each victory, but can also drop sharply after a loss. This means that  sensitive tuning is responsive to recent history, but introduces variability in one’s match-to-match ratings. I will close by discussing how this rating sensitivity should be a consideration in weighing recent performance over a long-term record, length of season, and different in win percentages of stronger versus weaker teams.

#### Elo ratings

Many variations of the original Elo system have been adapted for the specific characteristics of different sports. For purposes of this blog, I will use a generic version with parameters that are used in many variations of the rating system. Ratings for teams are relative to each other, so the probably that team A wins in a head-to-head competitions with team B is

$E_{a}=\frac{1}{1+10^{\frac{R_{b}-R_{a}}{400}}}$

and the probability that B beats A is

$E_{b}=\frac{1}{1+10^{\frac{R_{a}-R_{b}}{400}}}$

where $R_{a}$ is the rating for team A and $R_{b}$ is the rating for team B.

After a win the rating for team A is updated with

$R_{a(n)}=R_{a(n-1)}+K(1- E_{a(n-1)})$

and after a loss the rating is updated to

$R_{a(n)}=R_{a(n-1)}+K(0- E_{a(n-1)})$

where K is the sensitivity to the latest result. In many ratings systems, K values range from 10 to 40 (see wikipedia)

#### The sensitivity parameter

The remainder of this blog looks at the sensitivity of K to recent wins and its variability over time as the competitor wins and loses according to the team’s true probability. To assess the sensitivity for a given K, I examine the movement in ratings for two competitors that repeatedly play each other. Both start at 1400 and one team has a string of consecutive wins. For several values of K, the following graph shows how the rating for the winning team increase as wins accumulate.

and how the expected win probability versus the same opponent increases

The ratings and projected win probabilities rapidly increase initially, but slow down as the opponent’s weakness reveals itself. A higher K speeds the increase in ratings and win probability. From this point of view, a high K value allows a competitor to quickly ascend to the rating deserved in this competition. A high K value, however, also leads to swings in ratings once a best estimate of the win probably is reached.

To see this, I created two competitors with ratings, $R_{a}$ and $R_{b}$, matching their long term win probabilities,  $E_{a}$ and $E_{b}$,  and simulated 10,000 head-to-head competitions for different K values. This produces histories for ratings and win probabilities for various rating levels and K values with the ratings variability summarized by the variance of the historical ratings. Ratings corresponded to probabilities from 0.05 to 0.95 in steps of 0.05. K values were from 0 to 51 in steps of 1. When plotted, the results were

When K=0, the ratings are not updated, so the variance is zero. The variance increases linearly with K.

variance = 46.2 K

with $F_{(1,968)}=169300$ for p < .001.

To summarize, the following table shows the trade off between responsiveness and ratings noise: 1)the K-factor, 2) the number of consecutive wins to go from 50 percent projected win probability to a 66.7 probability, 3) the standard deviation of the rating when the rated win probability matches the actual win probability (1400 is mean rating), 4) probability of winning again after a single win where both start with 1400 ratings, 5) probability of winning after a single loss where both start with 1400 ratings

K Wins to 66.7% Stdev Win 1 Lose 1
5 29 15.2 0.507 0.493
10 14 21.5 0.514 0.486
15 9 26.3 0.522 0.478
20 7 30.4 0.529 0.471
25 5 34 0.536 0.464
30 4 37.2 0.543 0.457
35 3 40.2 0.55 0.45
40 3 43 0.557 0.443
45 3 45.6 0.564 0.436
50 2 48.1 0.571 0.429

Some systems, such as that used by FIDE for chess ratings, set K according to experience, age and current rating. Generally their systems have lower K values for players with shorter histories. In some cases, this might not be the best policy for setting K. In an ideal world, matches involving newer players yield the most information. Since the results are on a blank canvas, the ratings should be quick to reward talented play. Over an extended time period, results should stabilize to a “true” rating which should drift indicating improvement but not be overly jumpy.In this scenario, a higher initial K would quickly adjust the rating to the “true” value. Lower subsequent values of K would minimize the noise in mature ratings. This jumpiness in predictions might be especially problematic for sports, such as baseball, where the winning percentages of the best teams are far below 100 percent.