Fun&Games with Sets and Subsets

© Copyright 2009, Paul Kislanko

Up until now I've categorized a team's place in the schedule graph by how many opponents + opponents' opponents the team has as a percentage of the field. That counts vertices in the schedule graph no more than two edges apart, but there's another way to measure connectivty and teams' contributions to connectivity.

Boyd Nation suggested to me awhile back that we could understand more about how a division 1 schedule is connected by "network group analysis" - divide the field into disjoint groups that are each "more highly connected" than the field as a whole and study the connections between groups. I couldn't find a practical way to base a rating system on this analysis because there's no unambiguous way to partition the field. But if a team (vertex) can be in more than one subset there is an unambiguous way to define subsets that we can use to study connectivity.

Instead of counting vertices and edges, we can count the number and kind of subsets to which each team belongs.

a college football team
the Football Bowl Subdivsion of NCAA Division 1 football
{ t | t ∈ FBS }
U is the set ("{...}") of all college football teams ( t ) such that ( | ) the team is a member of ( ∈ ) FBS.
That's just a long-winded way of saying that when I use a lower-case italicized letter I am referring to an FBS team unless I say otherwise.

Sets of sets

The first step in the Network Group Analysis is to find all the subsets S of U in which every team in the subset plays every other team in the subset. There are 4160 such subsets, the largest consisting of the Pac 10 teams since they each play all the other Pac 10 teams.

1022 of the 4160 are just subsets of that largest group and there's a similar redundancy due to round-robins in other conferences or conference divisions. So we define

S′ = {S |S is not a proper subset of a larger S}
There are 278 elements in S′: 1 with 10 members, 3 with 9 (WAC, MWC, and Sun Belt), 1 with 8 (Big East), 9 with 6 (divisions in 12-team conferences) 17 with 5, 85 with 4, 61 with 3, and 101 with 2.

The 101 elements of S′ with two members are the the games between teams who have no common opponents. I suggested that those games contributed the most to connecting the field, because each of the teams' opponents would contribute the team's opponent's OO list.

Another Metric

Jeff Bihl observed on the FCStop25 message board:
How do you measure how much each game contributes in terms of connectivity? Not all games where the two teams have no common opponents are created equal because some of them have a lot more opponents that have common opponents than others (A lot more ways to link the team in 3 steps already exist even if there are no 2 step links)? How would you go about slapping a number on that?

Begin by noting that games are edges in the schedule graph, with teams as the vertices. For each pair of teams {a b} find the shortest connection from a to b - a plays x plays y plays... plays b and count the number of edges. The longest of these shortest paths is the diameter of the schedule graph.

From here on I'll use ↔ instead of "plays." Notice that in every case ab just means the teams play each other - it doesn't matter which is the home team. ab is the same as ba.

The FBS football schedule has diameter = 4, and there are 237 team-pairs {x,y} that require at least 4-step chain: xabcy. If we remove either "ab" or "bc" from the last such chain, then the diameter of the games graph would have to be at least 5, so we could use the number of pairs {x,y} (that are four edges apart) for which ab and bc are the "middle links" as a measure of ab's and bc's contributions to connectivity. Call this value the "Responsibility Count" (RC) for each game.

To find these values, I use the data from the distance matrix as follows:

O[n](t) = { x | distance(x,t) = n }
where by "distance()" I mean the number of edges required to connect the teams as used above: O[1](t) is the set of team t's opponents, O[2](t) is the list of t's opponents' opponents (that are not also in O[1](t)) and so on.

Note that for distance(x,y) to be 4, there must be a game between teams a and b such that
xO[1](a) and yO[2](b)
xO[2](a) and yO[1](b)
corresponding to xab?y or x?a↔by.

So, for each game ab we find all {x y} that are defined by ‡ whose distance is the same as the schedule graph diameter. The number of such pairs is the RC for game ab.

We tend to think of the FBS football schedule as not being as well-connected as basketball or baseball because fewer than half the team-pairs are opponents' opponents. However, it turns out that the connections that do exist are highly robust. A few years ago in baseball there was one A↔B↔C↔D↔E chain which if broken would cause the diameter of the schedule graph to grow from 4 to at least 5. If we count the number of games with a non-zero RC for each of the 237 FBS team-pairs that are 4 steps apart, we find that to increase the schedule graph's diameter by 1 we'd need to eliminate 69 games. That's over 10 percent of the schedule!

See game responsibility counts and strength of connections for the actual counts. I find it somewhat satisfying that the "most important" game by this metric is the last game of the season and is Army vs Navy.

Alas, I had an error in the counting program. The corrected results are here.

PS - you can explore the schedule graph using Kenneth Massey's Game Graph Connecting Paths.