I was asked recently to explain some of the terms I was using with regard to the "Distance Matrix" and "connectivity." The terms are borrowed from Graph Theory, and it is quite possible (likely?) I've been misusing them whilst confusing visitors to the reports. So I wrote this to provide precise definitions for the terms as I use them. Folks who are intimidated by mathematical-looking expressions can skip to the plain-language explanations.
The Incidence Matrix IM is defined by
im i , j | = { | 1 if team i and team j have played | } |
0 otherwise |
IM is a very sparse matrix. Even at the end of the season only about 10 percent of all possible team pairs will have played each other. However, its properties make it very easy to calculate how closely related two teams are by opponents' opponent chains.
If we let A = IM2, then A i , j is nonzero if teams i and j are opponents' opponents, and the nonzero value is the number of opponents they have in common. (A i, i is the number of teams that have played team i.) In general, for A = IMd, A i , j gives the count of number of unique paths of length d that connect team i and team j.
Define P( i , j) to be the smallest value of d { d = 1, 2, 3, ...} such that IMd i , j ≠ 0 if there is one. If P is defined for all team pairs (i , j) the field is said to connected. When the field is connected, the largest value of P(i,j) for all teams i and j is said to be the diameter (D) of the schedule graph.
• For all teams t, D(t) = MAX( P(t, i) ) for all teams i ≠ t ≤ D. One measure of how "connected" a team is to the field is given by this "diameter with respect to team t."
• If there is a subset of teams such that all teams in the subset only play each other then the field is never connected. We can always redefine the field to be the complement of this subset, and consider the subset a different schedule graph.
• If the field will eventually be connected by scheduled but not-yet-played games, we can measure the progress towards connectivity by counting the number of teams for which D(t) is defined compared to the count of teams for which it is not.
dm i , j | = | P(i , j) |
Define Nd(t) to be the number of teams i ≠ t for which [ dm t, i ] = d. By definition, if team t is connected to the field, and there are N teams in the field,
(N - 1) = | D(t) |
∑ | Nd(t) |
d=1 |
APL(t) = ( 1 / (N - 1) ) × | D(t) |
∑ | d × Nd(t) |
d = 1 |
Once all teams are connected to the field, we can find the overall APL by averaging all the team APLs. But we can find Nd for the field for all d ≤ D by summing over teams Nd(t) and dividing by 2 (since if team A is team B's opponent, team B is team A's so that pair is counted for each team) and use the same formula as above replacing ( 1 / ( N - 1 ) ) with ( 1 / ( N × ( N - 1 ) / 2 ).
Huh?
In the Distance Matrix reports (1A All D-1) the columns have the team, conference, and in the case of the "all D-1" report the (sub)division. The remaining fields are N1, N2 etc. and APL.
There are extra lines at the bottom of the report summarize the data for all teams.
The reports I produce (Division 1A only All of D1) summaries of the schedule graph, where the only thing that matters is whether a pair of teams have played. I also only bother with division 1, although usually there is enough inter-divisional play to eventually connect all college football teams. I also don't attempt to report the list of pairs that connect two teams due to the large amount of data involved.