Calculating the Distance Matrix

© Copyright 2006, Paul Kislanko

I was asked recently to explain some of the terms I was using with regard to the "Distance Matrix" and "connectivity." The terms are borrowed from Graph Theory, and it is quite possible (likely?) I've been misusing them whilst confusing visitors to the reports. So I wrote this to provide precise definitions for the terms as I use them. Folks who are intimidated by mathematical-looking expressions can skip to the plain-language explanations.

The Incidence Matrix IM is defined by
im i , j = { 1 if team i and team j have played}
0 otherwise

IM is a very sparse matrix. Even at the end of the season only about 10 percent of all possible team pairs will have played each other. However, its properties make it very easy to calculate how closely related two teams are by opponents' opponent chains.

If we let A = IM2, then A i , j is nonzero if teams i and j are opponents' opponents, and the nonzero value is the number of opponents they have in common. (A i, i is the number of teams that have played team i.) In general, for A = IMd, A i , j gives the count of number of unique paths of length d that connect team i and team j.

Define P( i , j) to be the smallest value of d { d = 1, 2, 3, ...} such that IMd i , j ≠ 0 if there is one. If P is defined for all team pairs (i , j) the field is said to connected. When the field is connected, the largest value of P(i,j) for all teams i and j is said to be the diameter (D) of the schedule graph.

• For all teams t, D(t) = MAX( P(t, i) ) for all teams i ≠ t ≤ D. One measure of how "connected" a team is to the field is given by this "diameter with respect to team t."
• If there is a subset of teams such that all teams in the subset only play each other then the field is never connected. We can always redefine the field to be the complement of this subset, and consider the subset a different schedule graph.
• If the field will eventually be connected by scheduled but not-yet-played games, we can measure the progress towards connectivity by counting the number of teams for which D(t) is defined compared to the count of teams for which it is not.

From Incidence to Distance

The object I've referred to as the Distance Matrix is defined by:
dm i , j = P(i , j)
The DM is not as significant in computational terms as the IM, but it provides some useful insight into how connectivity affects rating algorithms that depend upon convergence of indeterminate processes, and provides some insight into the rate of convergence.

Define Nd(t) to be the number of teams i ≠ t for which [ dm t, i ] = d. By definition, if team t is connected to the field, and there are N teams in the field,
(N - 1) = D(t)
Nd(t)
d=1
So for each connected team, we can find the Average Path Length (APL) that connects team t to all other teams by
APL(t) = ( 1 / (N - 1) ) ×D(t)
d × Nd(t)
d = 1

Once all teams are connected to the field, we can find the overall APL by averaging all the team APLs. But we can find Nd for the field for all d ≤ D by summing over teams Nd(t) and dividing by 2 (since if team A is team B's opponent, team B is team A's so that pair is counted for each team) and use the same formula as above replacing ( 1 / ( N - 1 ) ) with ( 1 / ( N × ( N - 1 ) / 2 ).

Huh?

In the Distance Matrix reports (1A All D-1) the columns have the team, conference, and in the case of the "all D-1" report the (sub)division. The remaining fields are N1, N2 etc. and APL.

N1
is just the number of teams the given team has played (number of opponents).
N2
is the number of teams one or more of the team's opponents have played and have not played this team (number of opponents' opponents).
N3... and so on
The number of teams that are opponents of teams that are opponents of teams counted in the previous column but are not themselves included in any prior column (Opponents' opponents' opponents'...)
APL
is the average path length from the team to each of all the other teams.

There are extra lines at the bottom of the report summarize the data for all teams.

#Pairs
is the number of pairs of teams that are related to each other by the same number of links in the "opponents' ... opponent" chain, for N1 (number of pairs that are opponents), N2 (number of pairs that are opponents' opponents) and so on. The APL value for this line in the report is a measure of how well connected the field is overall, and can be compared to the team APLs to distinguish the teams that are more or less -connected.
%Total pairs
is just #pairs expressed as a percentage of all team pairs at each O-O "level."

Significance

The reason connectivity is important is that since every team can't play every other team, we have to infer relative team rankings from comparison to the field as a whole. No matter what measurement we're using, it will be more accurate for a specific team if that team's opponents and opponents' opponents represent a larger fraction of the total number of teams. This is why the computer rankings vary so widely during the first few weeks of the season.

The reports I produce (Division 1A only All of D1) summaries of the schedule graph, where the only thing that matters is whether a pair of teams have played. I also only bother with division 1, although usually there is enough inter-divisional play to eventually connect all college football teams. I also don't attempt to report the list of pairs that connect two teams due to the large amount of data involved.

Fun and Games with the Schedule Graph

If instead of "whether the teams play each other" you care about which team won the connections between teams become "one-way", and it is called a directed graph. If you want to find one of those "A beat B beat ... beat Z" chains that are often humorous, check out The College Football Victory Chain Linker. Also of interest is Kenneth Massey's Game Graph Connections page.