{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "*Note: Github is having trouble rendering some of the LaTeX formulas and equation numbers. Please view in nbviewer.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Colley's Matrix Method\n", "\n", "## Introduction\n", "\n", "Welcome! This is the first in a two part series on Colley's Matrix Method for creating a resume rating system. In this first part we will gain access to Colley's brilliantly clean way to rate the resume of every FBS team in college football. Any system that attempts to rank all 130 FBS teams will have it's problems, but I believe that this is the best system that can be created under Colley's strict rules for keeping our resume ratings unbiased. \n", "\n", "In the second part, we'll have some fun and break most of Colley's rules. A resume rating simply attempts to measure what every team has *achieved* relative to one another. It is not concerned with being predictive. In the second part we will introduce some simple priors and hyperparameters to move the resume ratings in the direction of power ratings -- ones that better match common knowledge about the wide range of team ability in college football." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "The Colley Matrix Method is a resume rating system that was a part of the Official Bowl Championship Series Ranking from 2001 to 2013. To be a good *resume rating system* means a few things to Colley, as he explains [here](https://www.colleyrankings.com/matrate.pdf):\n", "- eliminates any bias toward conference, history or tradition,\n", "- eliminates the need to invoke some ad hoc means of deflating runaway scores, and\n", "- eliminates any other ad hoc adjustments, such as home/away tweaks.\n", "\n", "\n", "Following these self imposed restrictions, Colley begins by giving every FBS team a starting rating of 1/2 and by taking into account only their wins and losses will arrive at their final rating. What makes his method so powerful is that he uses a simple mathematical method to account for the fact that we don't just want to look at a team's win percentage to rate them. We want to account for *strength of schedule*, the ability of teams played on a given schedule. \n", "\n", "Consider teams A,B and C. If *A* beats *B*, and *B* beats *C*, then we would say that *A* has a transitive win over *C*. It is natural to want to consider transitive wins when ranking teams, because beating a team with a winning record is better than beating a team with no wins at all. In a system that only cares about wins and losses, strength of schedule is simply a proper valuation of transitive wins and losses. Colley found a way to account for strength of schedule by looking at the complete college picture of who beat whom and who lost to whom. What's amazing is that what sounds like a complicated spider-web of tracing these transitive wins and losses can be completly encapsulated into a simple formula." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Little Math\n", "\n", "Colley does an excellent job describing his system and its motivation [here](https://www.colleyrankings.com/matrate.pdf), which I will now abbreviate. This derivation will be very important to us in the second part of this series. Lets consider the following simple rating for a team,\n", "\n", "\\begin{equation} r = \\frac{1 + n_w}{2+n_{tot}} \n", "\\end{equation}\n", "\n", "where $n_w$ is their number of wins, and $n_{tot}$ is the number of games they have played. Notice that a team's rating must be between 0 and 1. A team that has played no games begins with a rating of $ r = \\frac{1+0}{2+0} = \\frac{1}{2}$. If they play 10 games in a season and win 7 they will have a rating of $ r = \\frac{1 + 7}{2+10} = \\frac{2}{3}$. This seems reasonable. Now it just takes a moderate amount of algebra to account for strength of schedule. Let's multiply both sides by the denominator. \n", "\n", "\\begin{equation} (2+ n_{tot}) r = 1 + n_w \n", "\\end{equation}\n", "\n", "It wouldn't be algebra without a clever identity, so let's add one now. \n", "\n", "\\begin{equation} n_w = \\frac{n_w - n_\\ell}{2} + \\frac{n_w + n_\\ell}{2} \n", "\\end{equation}\n", "\n", "We've added a new symbol, $n_\\ell$, the number of losses. Let's now replace $n_w$ in equation (2) with what we have in equation (3) \n", "\n", "\\begin{equation} (2+ n_{tot}) r = 1 + \\frac{n_w - n_\\ell}{2} + \\frac{n_w + n_\\ell}{2} \n", "\\end{equation}\n", "\n", "Let's move some stuff to the other side.\n", "\n", "\\begin{equation} (2+ n_{tot}) r - \\frac{n_w + n_\\ell}{2} = 1 + \\frac{n_w - n_\\ell}{2} \n", "\\end{equation}\n", "\n", "Notice that every game is a win or a loss. $n_{tot} = n_w + n_\\ell$.\n", "\n", "\\begin{equation} (2+ n_{tot}) r - \\frac{n_{tot}}{2} = 1 + \\frac{n_w - n_\\ell}{2} \n", "\\end{equation}\n", "\n", "Bring the $n_{tot}$ terms together on the left hand side.\n", "\n", "\\begin{equation} 2r + n_{tot}(r - \\frac{1}{2}) = 1 + \\frac{n_w - n_\\ell}{2} \n", "\\end{equation}\n", "\n", "Now remember that multiplication is simply repeated addition. \n", "\n", "\\begin{equation} 2r + \\displaystyle\\sum^{n_{tot}}(r - \\frac{1}{2}) = 1 + \\frac{n_w - n_\\ell}{2} \n", "\\end{equation}\n", "\n", "The $\\Sigma$ symbol means we will sum $n_{tot}$ times. \n", "\n", "*whew!* That was a lot of small steps that really added up. Lets take a step back and interpret our equation. If zero games have been played, everything goes away except for $ 2r = 1$ which recovers a rating of $\\frac{1}{2}$. As more games are played, the right hand side either increases or decreases by a half depending on if it is a win or a loss. In order to maintain equality, the rating r on the left hand side has to increase or decrease to match. \n", "\n", "Notice that the summation on the left hand side is over every game played. For every game we take the difference between the team's rating, *r*, and the average rating of an opponent, $\\frac{1}{2}$. Colley's insight was that instead of taking the difference from the *average* rating, we can actually take the difference from the rating of the teams they have played. In order to do this we need a little more notation. Adding a superscript $i$ will denote that a given symbol pertains to team *i*.\n", "\n", "\\begin{equation} 2r^i + \\displaystyle\\sum^{n^i_{tot}}(r^i - \\frac{1}{2}) = 1 + \\frac{n^i_w - n^i_\\ell}{2} \n", "\\end{equation}\n", "\n", "Lets use a subscript of *j* for each team played by team *i*. Then $r^i_j$ is rating of the $j^{th}$ team played by team *i*. Let's replace the $\\frac{1}{2}$ term with these $r^i_j$.\n", "\n", "\\begin{equation} 2r^i + \\displaystyle\\sum_{j=1}^{n^i_{tot}}(r^i - r^i_j) = 1 + \\frac{n^i_w - n^i_\\ell}{2} \n", "\\end{equation}\n", "\n", "Every team will have one of these equations, so we can package the whole system as a matrix equation.\n", "\n", "\\begin{gather}\n", " \\begin{bmatrix}\n", " 2+n^1_{tot} & -n^{1,2} & \\ldots & -n^{1,M} \\\\\n", " -n^{2,1} & 2+n^2_{tot} & \\ldots & -n^{2,M} \\\\ \n", " \\vdots & \\vdots & \\ddots & \\vdots \\\\\n", " -n^{M,1} & -n^{M,2} & \\ldots &2+n^2_{tot} \\\\ \n", " \\end{bmatrix}\n", " \\begin{bmatrix}\n", " r^1 \\\\\n", " r^2 \\\\ \n", " \\vdots \\\\\n", " r^M \\\\ \n", " \\end{bmatrix}=\n", " \\begin{bmatrix}\n", " 1 + \\frac{n^1_w - n^1_\\ell}{2} \\\\\n", " 1+ \\frac{n^2_w - n^2_\\ell}{2} \\\\ \n", " \\vdots \\\\\n", " 1+ \\frac{n^M_w - n^M_\\ell}{2} \\\\ \n", " \\end{bmatrix}\n", " \\end{gather}\n", " \n", " Here we assume that we have M teams. The diagonal counts 2 plus the number of games played by team *i*. The off diagonal counts how many times team *i* has played team *j*. Note that $n^{i,j} = n^{j,i}$, so this matrix is symmetric. The *r* column vector has the ratings we want to calculate, and the column vector after the equals accounts for the total wins and losses. Now all we need to do is build this matrix and use a solver to get those ratings!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building the matrix\n", "Finally some python! collegefootballdata.com has an excellent API for getting all of the games we want." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>id</th>\n", " <th>season</th>\n", " <th>week</th>\n", " <th>season_type</th>\n", " <th>start_date</th>\n", " <th>start_time_tbd</th>\n", " <th>neutral_site</th>\n", " <th>conference_game</th>\n", " <th>attendance</th>\n", " <th>venue_id</th>\n", " <th>...</th>\n", " <th>home_points</th>\n", " <th>home_line_scores</th>\n", " <th>home_post_win_prob</th>\n", " <th>away_id</th>\n", " <th>away_team</th>\n", " <th>away_conference</th>\n", " <th>away_points</th>\n", " <th>away_line_scores</th>\n", " <th>away_post_win_prob</th>\n", " <th>excitement_index</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>401110723</td>\n", " <td>2019</td>\n", " <td>1</td>\n", " <td>regular</td>\n", " <td>2019-08-24T23:00:00.000Z</td>\n", " <td>NaN</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>66543.0</td>\n", " <td>4013</td>\n", " <td>...</td>\n", " <td>24</td>\n", " <td>[7, 0, 10, 7]</td>\n", " <td>0.905953</td>\n", " <td>2390</td>\n", " <td>Miami</td>\n", " <td>ACC</td>\n", " <td>20</td>\n", " <td>[3, 10, 0, 7]</td>\n", " <td>0.094047</td>\n", " <td>8.767910</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>401114164</td>\n", " <td>2019</td>\n", " <td>1</td>\n", " <td>regular</td>\n", " <td>2019-08-25T02:30:00.000Z</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>22396.0</td>\n", " <td>3610</td>\n", " <td>...</td>\n", " <td>45</td>\n", " <td>[14, 14, 7, 10]</td>\n", " <td>0.688630</td>\n", " <td>12</td>\n", " <td>Arizona</td>\n", " <td>Pac-12</td>\n", " <td>38</td>\n", " <td>[0, 21, 14, 3]</td>\n", " <td>0.311370</td>\n", " <td>7.842417</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>401117855</td>\n", " <td>2019</td>\n", " <td>1</td>\n", " <td>regular</td>\n", " <td>2019-08-29T23:00:00.000Z</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>19648.0</td>\n", " <td>3892</td>\n", " <td>...</td>\n", " <td>24</td>\n", " <td>[7, 3, 14, 0]</td>\n", " <td>0.728942</td>\n", " <td>2681</td>\n", " <td>Wagner</td>\n", " <td>None</td>\n", " <td>21</td>\n", " <td>[0, 0, 14, 7]</td>\n", " <td>0.271058</td>\n", " <td>1.834351</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>401119255</td>\n", " <td>2019</td>\n", " <td>1</td>\n", " <td>regular</td>\n", " <td>2019-08-29T23:00:00.000Z</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>18412.0</td>\n", " <td>3965</td>\n", " <td>...</td>\n", " <td>38</td>\n", " <td>[21, 7, 10, 0]</td>\n", " <td>0.999788</td>\n", " <td>2523</td>\n", " <td>Robert Morris</td>\n", " <td>None</td>\n", " <td>10</td>\n", " <td>[7, 3, 0, 0]</td>\n", " <td>0.000212</td>\n", " <td>0.118588</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>401119254</td>\n", " <td>2019</td>\n", " <td>1</td>\n", " <td>regular</td>\n", " <td>2019-08-29T23:00:00.000Z</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>17620.0</td>\n", " <td>3700</td>\n", " <td>...</td>\n", " <td>46</td>\n", " <td>[13, 17, 7, 9]</td>\n", " <td>0.999979</td>\n", " <td>2415</td>\n", " <td>Morgan State</td>\n", " <td>None</td>\n", " <td>3</td>\n", " <td>[0, 3, 0, 0]</td>\n", " <td>0.000021</td>\n", " <td>0.472968</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 24 columns</p>\n", "</div>" ], "text/plain": [ " id season week season_type start_date \\\n", "0 401110723 2019 1 regular 2019-08-24T23:00:00.000Z \n", "1 401114164 2019 1 regular 2019-08-25T02:30:00.000Z \n", "2 401117855 2019 1 regular 2019-08-29T23:00:00.000Z \n", "3 401119255 2019 1 regular 2019-08-29T23:00:00.000Z \n", "4 401119254 2019 1 regular 2019-08-29T23:00:00.000Z \n", "\n", " start_time_tbd neutral_site conference_game attendance venue_id ... \\\n", "0 NaN True False 66543.0 4013 ... \n", "1 NaN False False 22396.0 3610 ... \n", "2 NaN False False 19648.0 3892 ... \n", "3 NaN False False 18412.0 3965 ... \n", "4 NaN False False 17620.0 3700 ... \n", "\n", " home_points home_line_scores home_post_win_prob away_id away_team \\\n", "0 24 [7, 0, 10, 7] 0.905953 2390 Miami \n", "1 45 [14, 14, 7, 10] 0.688630 12 Arizona \n", "2 24 [7, 3, 14, 0] 0.728942 2681 Wagner \n", "3 38 [21, 7, 10, 0] 0.999788 2523 Robert Morris \n", "4 46 [13, 17, 7, 9] 0.999979 2415 Morgan State \n", "\n", " away_conference away_points away_line_scores away_post_win_prob \\\n", "0 ACC 20 [3, 10, 0, 7] 0.094047 \n", "1 Pac-12 38 [0, 21, 14, 3] 0.311370 \n", "2 None 21 [0, 0, 14, 7] 0.271058 \n", "3 None 10 [7, 3, 0, 0] 0.000212 \n", "4 None 3 [0, 3, 0, 0] 0.000021 \n", "\n", " excitement_index \n", "0 8.767910 \n", "1 7.842417 \n", "2 1.834351 \n", "3 0.118588 \n", "4 0.472968 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import requests\n", "import numpy as np\n", "\n", "year = 2019\n", "\n", "response = requests.get(r'https://api.collegefootballdata.com/games?'\n", " 'year={year}&seasonType=both'.format(year = year))\n", "games = pd.read_json(response.text)\n", "\n", "games.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! Now, lets simplify. The next three lines do three things:\n", "1. Take just the FBS games (no FCS games)\n", "2. Drop any unplayed or canceled games\n", "3. Take just the columns we need" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>home_team</th>\n", " <th>home_points</th>\n", " <th>away_team</th>\n", " <th>away_points</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Florida</td>\n", " <td>24</td>\n", " <td>Miami</td>\n", " <td>20</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Hawai'i</td>\n", " <td>45</td>\n", " <td>Arizona</td>\n", " <td>38</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Cincinnati</td>\n", " <td>24</td>\n", " <td>UCLA</td>\n", " <td>14</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>Clemson</td>\n", " <td>52</td>\n", " <td>Georgia Tech</td>\n", " <td>14</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>Tulane</td>\n", " <td>42</td>\n", " <td>Florida International</td>\n", " <td>14</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " home_team home_points away_team away_points\n", "0 Florida 24 Miami 20\n", "1 Hawai'i 45 Arizona 38\n", "5 Cincinnati 24 UCLA 14\n", "9 Clemson 52 Georgia Tech 14\n", "11 Tulane 42 Florida International 14" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games = games[(~games['home_conference'].isnull()) & (~games['away_conference'].isnull())]\n", "games = games[(games['home_points'] > 0) | (games['away_points'] > 0)]\n", "games = games[['home_team','home_points','away_team','away_points']]\n", "\n", "games.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks better! Let's add a $\\pm1$ for whether the home or away team weans, and a column of ones." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>home_team</th>\n", " <th>home_points</th>\n", " <th>away_team</th>\n", " <th>away_points</th>\n", " <th>home_win</th>\n", " <th>ones</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Florida</td>\n", " <td>24</td>\n", " <td>Miami</td>\n", " <td>20</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Hawai'i</td>\n", " <td>45</td>\n", " <td>Arizona</td>\n", " <td>38</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>Cincinnati</td>\n", " <td>24</td>\n", " <td>UCLA</td>\n", " <td>14</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>Clemson</td>\n", " <td>52</td>\n", " <td>Georgia Tech</td>\n", " <td>14</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>Tulane</td>\n", " <td>42</td>\n", " <td>Florida International</td>\n", " <td>14</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " home_team home_points away_team away_points home_win \\\n", "0 Florida 24 Miami 20 1 \n", "1 Hawai'i 45 Arizona 38 1 \n", "5 Cincinnati 24 UCLA 14 1 \n", "9 Clemson 52 Georgia Tech 14 1 \n", "11 Tulane 42 Florida International 14 1 \n", "\n", " ones \n", "0 1 \n", "1 1 \n", "5 1 \n", "9 1 \n", "11 1 " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games['home_win'] = -1+ 2*(games['home_points'] > games['away_points']).astype(int)\n", "games['ones'] = 1\n", "\n", "games.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will be useful to have a list of the teams so lets get that now." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>team</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>Air Force</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Akron</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Alabama</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Appalachian State</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Arizona</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " team\n", "0 Air Force\n", "1 Akron\n", "2 Alabama\n", "3 Appalachian State\n", "4 Arizona" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "teams = pd.DataFrame(games['home_team'].append(games['away_team']).unique(),columns = ['team'])\n", "teams = teams.sort_values(by = ['team']).reset_index(drop = True)\n", "\n", "teams.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay! Now lets get the vector on the right hand side of the matrix equation." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>str_of_rec</th>\n", " </tr>\n", " <tr>\n", " <th>home_team</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Air Force</th>\n", " <td>5.0</td>\n", " </tr>\n", " <tr>\n", " <th>Akron</th>\n", " <td>-5.0</td>\n", " </tr>\n", " <tr>\n", " <th>Alabama</th>\n", " <td>5.0</td>\n", " </tr>\n", " <tr>\n", " <th>Appalachian State</th>\n", " <td>6.5</td>\n", " </tr>\n", " <tr>\n", " <th>Arizona</th>\n", " <td>-1.5</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " str_of_rec\n", "home_team \n", "Air Force 5.0\n", "Akron -5.0\n", "Alabama 5.0\n", "Appalachian State 6.5\n", "Arizona -1.5" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "colley_vec = 1+(games[['home_team','home_win']].groupby('home_team').sum()\\\n", " -games[['away_team','home_win']].groupby('away_team').sum())/2\n", "colley_vec = colley_vec.rename(columns = {'home_win':'str_of_rec'})\n", "\n", "colley_vec.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating the matrix takes a couple clever moves. First we will make a vector that counts games played and use that to create the diagonal of the colley matrix. We'll only look at a few teams since this matrix is 130x130." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>team</th>\n", " <th>Michigan</th>\n", " <th>Wisconsin</th>\n", " <th>Ohio State</th>\n", " </tr>\n", " <tr>\n", " <th>team</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Michigan</th>\n", " <td>15.0</td>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " </tr>\n", " <tr>\n", " <th>Wisconsin</th>\n", " <td>0.0</td>\n", " <td>16.0</td>\n", " <td>0.0</td>\n", " </tr>\n", " <tr>\n", " <th>Ohio State</th>\n", " <td>0.0</td>\n", " <td>0.0</td>\n", " <td>16.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "team Michigan Wisconsin Ohio State\n", "team \n", "Michigan 15.0 0.0 0.0\n", "Wisconsin 0.0 16.0 0.0\n", "Ohio State 0.0 0.0 16.0" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "games_played = (games[['home_team','ones']].groupby('home_team').sum()+games[['away_team','ones']].groupby('away_team').sum())\n", "diag = pd.DataFrame(2*np.identity(len(colley_vec))+np.diag(games_played['ones']),teams['team'],teams['team'])\n", "\n", "diag.loc[['Michigan','Wisconsin','Ohio State'],['Michigan','Wisconsin','Ohio State']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to create the off-diagonal entries, we will pivot on our dataframe twice, once for counting games for the home team, and once more for the away team. Adding this to our diagonal gives the Colley Matrix." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th>team</th>\n", " <th>Michigan</th>\n", " <th>Wisconsin</th>\n", " <th>Ohio State</th>\n", " </tr>\n", " <tr>\n", " <th>team</th>\n", " <th></th>\n", " <th></th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Michigan</th>\n", " <td>15.0</td>\n", " <td>-1.0</td>\n", " <td>-1.0</td>\n", " </tr>\n", " <tr>\n", " <th>Wisconsin</th>\n", " <td>-1.0</td>\n", " <td>16.0</td>\n", " <td>-2.0</td>\n", " </tr>\n", " <tr>\n", " <th>Ohio State</th>\n", " <td>-1.0</td>\n", " <td>-2.0</td>\n", " <td>16.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "team Michigan Wisconsin Ohio State\n", "team \n", "Michigan 15.0 -1.0 -1.0\n", "Wisconsin -1.0 16.0 -2.0\n", "Ohio State -1.0 -2.0 16.0" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "piv1 = pd.pivot_table(games,values = 'ones',index = 'home_team', \\\n", " columns = 'away_team', aggfunc = np.sum).fillna(0)\n", "\n", "piv2 = pd.pivot_table(games,values = 'ones',index = 'away_team', \\\n", " columns = 'home_team', aggfunc = np.sum).fillna(0)\n", " \n", "colley_mat = diag - piv1 - piv2\n", "\n", "colley_mat.loc[['Michigan','Wisconsin','Ohio State'],['Michigan','Wisconsin','Ohio State']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great! We can see that each team played one another at least once, and Wisconsin and Ohio State played each other twice.\n", "\n", "We just run a matrix solver at this point and we'll have our ratings!" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>rating</th>\n", " </tr>\n", " <tr>\n", " <th>team</th>\n", " <th></th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>LSU</th>\n", " <td>1.064182</td>\n", " </tr>\n", " <tr>\n", " <th>Ohio State</th>\n", " <td>0.986428</td>\n", " </tr>\n", " <tr>\n", " <th>Clemson</th>\n", " <td>0.943394</td>\n", " </tr>\n", " <tr>\n", " <th>Georgia</th>\n", " <td>0.926277</td>\n", " </tr>\n", " <tr>\n", " <th>Penn State</th>\n", " <td>0.891403</td>\n", " </tr>\n", " <tr>\n", " <th>Florida</th>\n", " <td>0.876903</td>\n", " </tr>\n", " <tr>\n", " <th>Oregon</th>\n", " <td>0.869208</td>\n", " </tr>\n", " <tr>\n", " <th>Notre Dame</th>\n", " <td>0.850672</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " rating\n", "team \n", "LSU 1.064182\n", "Ohio State 0.986428\n", "Clemson 0.943394\n", "Georgia 0.926277\n", "Penn State 0.891403\n", "Florida 0.876903\n", "Oregon 0.869208\n", "Notre Dame 0.850672" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "colley_inv = pd.DataFrame(np.linalg.pinv(colley_mat.values),colley_mat.columns,colley_mat.index)\n", "ratings = colley_inv.dot(colley_vec)\n", "ratings.rename(columns={'str_of_rec':'rating'},inplace=True)\n", "\n", "ratings = ratings.sort_values(by = ['rating'], ascending = False)\n", "\n", "ratings.head(8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Awesome! It looks reasonable too! We can compare this to Colley's ratings to see if we're right. As of 2007, Colley added in a roundabout way of including FCS teams, but our ratings should be very close to his. We can check 2006 and see that they agree up to four decimal places, which is good enough for me!\n", "\n", "Next time, we'll take this resume ranking system and see what we can do to make it more representative of team's power. Colley's Matrix Method is a compelling way for accounting for strength of schedule. If we can find a way to add in more information than simply wins and losses, we may be able to create some pretty reliable power ratings!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }