Computer Ranking Analysis

All of these reports use the information published by MasseyRatings at College Football Ranking Composite with Dr. Massey's kind permission.

The general idea is to provide the ranking information in all formats that would be needed to count "votes" in a team-quality poll were the computer rating systems "voters" in such a poll. There are many ways to count ranked ballots to create a composite rank, and since there's no best way to do that, I try to provide enough information to use any way to do so.

In general I believe that any computer ranking regardless how bad is likely better than any human's opinion no matter how 'expert' the human because the computer rating takes into account every game that contributes to each of the 8,325 team-vs-team comparisons and the human is subjectively biased by having only a very small subset of the games played as direct influence.

Just as a human team-quality poll can result in a better ranking than that of any individual voter's subjective rankings, a "poll" of the computer rankings each of which is based upon different objective measurements can result in a measurably better list.

I only include computer ratings that rank all teams in the field, so my list will never exactly match Dr. Massey's, which includes human top 25's and a few computer "top n" where n is less than the number of teams in the field.

Top 25

Truncates every computer rating's ranking at 25 and then counts the ballots the same way media polls do, using a 25-24-23-... point assignment for teams ranked 1-2-3....

I only include this report to demonstrate how much information is left out of the usual media presentation of their poll results. In addition to the number of "points" I include the number of ballots that listed the team in the top 25 and the number of votes for each rank for which the team was voted in the top 25.

Top 25 Correlations

See below for a general discussion of rank-correlations. The top 25 correlations are a bit different, since not every ranking includes the same teams. The reports are described in Top 25 Correlations.

Week 4

Week 3

Week 2

Week 1

Pre-Season

Borda

The usual method of counting top 25 ballots is a variation on the Borda Count. In its basic form, teams get one point for each team they are ranked better than. In a 130-team field, a #1 vote is worth 129, #2 worth 128 down to #130 worth 0. When all teams are ranked the order is the same as the average rank over all ratings.

Majority Consensus

My consensus rank is based upon the Bucklin vote-counting method. For each team find the best rank for which a majority of the ratings agree the team should be ranked at least that highly. I use a strict majority, namely 50% + one rating. When there are an odd number of ratings this is the same as the arithmetic median. For an even number of ratings it is the best rank worse than the median.

Unless otherwise stated this is the ranking used in reports that include a team's rank.

Pairwise Matrix

Even when the majority ranks team A better than team B, it is possible that team B is ranked better on more ballots than team A. In Condorcet voting, the ballots are translated into pairwise comparisons between alternatives. The method suffers from a lack of transitivity: team A > team B and team B > team C does not imply team A > team C!

Approval voting does not require a ranked list. Instead the voter just lists the alternatives that meet or exceed their minimum criteria. For division 1A football the criterion for selecting the four playoff teams is "one of the best four teams in the field." As I suggested in Committee "consensus" this would be a much better way for the committee to amalgamate its collective thoughts than the "top-25" poll they chose. My report just counts the number of voters that rank the team in its top 4.

Correlations

The basis for measuring how alike two ordinal rankings are is the distance metric. This is the number of swaps required by a bubble sort to place one of the lists in the same order as the other. The distance varies from zero (the lists are identical) to the total number of team-pairs (the lists are reversals of each other; 8,128 for a 128-team field.) For each ranking I report the contribution to the distance function by each team.

The distance is the number of discordant pairs - the number of pairs where the teams' relative orders are reversed in the two rankings. When the teams are in the same relative order in both lists the pair is said to be concordant. When the teams have the same rank in either list the pair is ignored.
These can be turned into rank correlation coefficients in several ways. The two I calculate are:
Kendall's tau:

τ = #Concordant pairs - #Discordant pairs

½ × #Ranked × (#Ranked-1)

Goodman and Kruskal's gamma:

γ = #Concordant pairs - #Discordant pairs

#Concordant pairs + #Discordant pairs

These give -1 ≤ { τ, γ } ≤ 1 with |τ| ≤ |γ|. Both will be -1 if the teams are in exactly reverse order, 0 if the relationship is perfectly random (whatever that means!) and +1 if the rankings are identical. The τ and γ are the same if there are no ties (but notice that ties in the Majority Consensus rank are to be expected, in which case τ will be closer to zero than γ.)

There are more ways to aggregate team ranks by conference than ratings, but I have chosen these.

Rank Distribution by Conference
includes every rank for every team in the conference. In addition to the average rank there's a count of team ranks in the ranges, 1-25, 26-50, and so on.
Team Consensus Ranks by Conference
lists the consensus ranks of teams by conference.
Pairwise Comparison of Teamranks by Conference
compares the consensus rank of each conference team to that of every member of other conferences. The entries represent the number of times out of a thousand that a team from the row conference would be expected to have a better rank than that of a team from the column conference.

Weighted Violations

Roughly one in five games result in the worse-ranked team winning no matter which rating produces the ranking. Were it not so sport would not be interesting. One measure of how well a rating represents results-to-date is the count of Retrodictive Ranking Violations, the number of games in which after the ranking takes into account team A beat team B it still ranks team B better than team A.

Motivated by Potemkin's idea that instead of just counting the number of RRVs we shuold take into account the size of the violation (rank difference) and the importance (how highly the loser is ranked) I came up with a Weighted RRV value that combines the size of the upset (in scores and rank difference) with importance (loser's rank.) The "importance" component also takes into account that violations later in the season (when the rating has more input) should count more.

WRRV = ⌈ (WS - LS)÷S ⌉ × (WR - LR) × ƒ(LR)

WS
Winner's score
LS
Loser's score
⌈ (WS - LS)÷S ⌉ =
the margin of victory in number of scores
(⌈x⌉ is the least integer ≥ x)
S=8 for football
WR
Winner's rank
LR
Loser's rank
Note that (WR-LR) is positive for all ranking violations by definition.
ƒ(LR) =
Log _{_{LRⁱ M^j}} M^i+j
where i and j are chosen to make ƒ(1) any predetermined value
ƒ(M) is exactly 1.
M is the number of ranked teams (130 in 2017 for Division 1A.)

See WRRV 2.1 for graphs of ƒ(LR) for different choices of i and j.