{ "metadata": { "name": "", "signature": "sha256:d9b9b9f0c4fa0b1e72d0bf8ed79f3099870ca0cf1ccbbd4dfc4138b4f363c347" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "World Cup Learning\n", "------------------\n", "\n", "Here I try to predict fifa world cup matches results, based on the knowledge of previous matches from the cups since the year 1950.\n", "\n", "I'll use a MLP neural network classifier, my inputs will be the past matches (replacing each team name with a lot of stats from both), and my output will be a number indicating the result (0 = tie, 1 = wins team1, 2 = wins team2).\n", "\n", "I'll be using pybrain for the classifier, pandas to hack my way through the data, and pygal for the graphs (far easier than matplotlib). And a lot of extra useful things implemented in the utils.py file, mostly to abstract the data processing I need before I feed the classifier." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from random import random\n", "\n", "from IPython.display import SVG\n", "import pygal\n", "\n", "from pybrain.structure import SigmoidLayer\n", "from pybrain.tools.shortcuts import buildNetwork\n", "from pybrain.supervised.trainers import BackpropTrainer\n", "from pybrain.datasets import ClassificationDataSet\n", "from pybrain.utilities import percentError\n", "\n", "from utils import get_matches, get_team_stats, extract_samples, normalize, split_samples, graph_teams_stat_bars, graph_matches_results_scatter\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Configs\n", "-------" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# the features I will feed to the classifier as input data.\n", "input_features = ['year',\n", " 'matches_won_percent',\n", " 'podium_score_yearly',\n", " 'matches_won_percent_2',\n", " 'podium_score_yearly_2',]\n", "\n", "# the feature giving the result the classifier must learn to predict (I recommend allways using 'winner')\n", "output_feature = 'winner'\n", "\n", "# used to avoid including tied matches in the learning process. I found this greatly improves the classifier accuracy.\n", "# I know there will be some ties, but I'm willing to fail on those and have better accuracy with all the rest.\n", "# at this point, this code will break if you set it to False, because the network uses a sigmoid function with a \n", "# threeshold for output, so it is able to distinquish only 2 kinds of results.\n", "exclude_ties = True\n", "\n", "# used to duplicate matches data, reversing the teams (team1->team2, and viceversa). \n", "# This helps on visualizations, and also improves precission of the predictions avoiding a dependence on the\n", "# order of the teams from the input.\n", "duplicate_with_reversed = True" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "def show(graph):\n", " '''Small utility to display pygal graphs'''\n", " return SVG(graph.render())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Team stats\n", "----------\n", "\n", "First we need the teams stats. We can't feed the classifier inputs like ('Argentina', 'Brazil'), we need to give it numbers. And not any numbers, not just ids, but numbers that could be somewhat related to the result of the matches.\n", "\n", "For example: the percentage of won matches of each team is something that could have an impact in the result, so that stat is a very good candidate.\n", "\n", "We just calculate a lots of stats per team, and after we will decide which ones to use." ] }, { "cell_type": "code", "collapsed": false, "input": [ "team_stats = get_team_stats()\n", "team_stats" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | matches_played | \n", "matches_won | \n", "years_played | \n", "podium_score | \n", "cups_won | \n", "matches_won_percent | \n", "podium_score_yearly | \n", "cups_won_yearly | \n", "
---|---|---|---|---|---|---|---|---|
team | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
Brazil | \n", "89 | \n", "63 | \n", "16 | \n", "102 | \n", "5 | \n", "70.786517 | \n", "6.375000 | \n", "0.312500 | \n", "
Canada | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Serbia and Montenegro | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Kuwait | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Scotland | \n", "23 | \n", "4 | \n", "8 | \n", "0 | \n", "0 | \n", "17.391304 | \n", "0.000000 | \n", "0.000000 | \n", "
Costa Rica | \n", "10 | \n", "3 | \n", "3 | \n", "0 | \n", "0 | \n", "30.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Ivory Coast | \n", "6 | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "33.333333 | \n", "0.000000 | \n", "0.000000 | \n", "
Wales | \n", "5 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "20.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Argentina | \n", "64 | \n", "33 | \n", "13 | \n", "40 | \n", "2 | \n", "51.562500 | \n", "3.076923 | \n", "0.153846 | \n", "
Bolivia | \n", "4 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Cameroon | \n", "20 | \n", "4 | \n", "6 | \n", "0 | \n", "0 | \n", "20.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Ecuador | \n", "7 | \n", "3 | \n", "2 | \n", "0 | \n", "0 | \n", "42.857143 | \n", "0.000000 | \n", "0.000000 | \n", "
Ghana | \n", "9 | \n", "4 | \n", "2 | \n", "0 | \n", "0 | \n", "44.444444 | \n", "0.000000 | \n", "0.000000 | \n", "
Saudi Arabia | \n", "13 | \n", "2 | \n", "4 | \n", "0 | \n", "0 | \n", "15.384615 | \n", "0.000000 | \n", "0.000000 | \n", "
Australia | \n", "10 | \n", "2 | \n", "3 | \n", "0 | \n", "0 | \n", "20.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Iran | \n", "9 | \n", "1 | \n", "3 | \n", "0 | \n", "0 | \n", "11.111111 | \n", "0.000000 | \n", "0.000000 | \n", "
Algeria | \n", "9 | \n", "2 | \n", "3 | \n", "0 | \n", "0 | \n", "22.222222 | \n", "0.000000 | \n", "0.000000 | \n", "
El Salvador | \n", "6 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Republic of Ireland | \n", "13 | \n", "2 | \n", "3 | \n", "0 | \n", "0 | \n", "15.384615 | \n", "0.000000 | \n", "0.000000 | \n", "
Slovenia | \n", "6 | \n", "1 | \n", "2 | \n", "0 | \n", "0 | \n", "16.666667 | \n", "0.000000 | \n", "0.000000 | \n", "
Chile | \n", "26 | \n", "7 | \n", "7 | \n", "4 | \n", "0 | \n", "26.923077 | \n", "0.571429 | \n", "0.000000 | \n", "
Belgium | \n", "32 | \n", "10 | \n", "8 | \n", "2 | \n", "0 | \n", "31.250000 | \n", "0.250000 | \n", "0.000000 | \n", "
Haiti | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Iraq | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Spain | \n", "53 | \n", "27 | \n", "12 | \n", "18 | \n", "1 | \n", "50.943396 | \n", "1.500000 | \n", "0.083333 | \n", "
China PR | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Netherlands | \n", "41 | \n", "22 | \n", "7 | \n", "26 | \n", "0 | \n", "53.658537 | \n", "3.714286 | \n", "0.000000 | \n", "
Denmark | \n", "16 | \n", "8 | \n", "4 | \n", "0 | \n", "0 | \n", "50.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Poland | \n", "30 | \n", "15 | \n", "6 | \n", "8 | \n", "0 | \n", "50.000000 | \n", "1.333333 | \n", "0.000000 | \n", "
Morocco | \n", "13 | \n", "2 | \n", "4 | \n", "0 | \n", "0 | \n", "15.384615 | \n", "0.000000 | \n", "0.000000 | \n", "
Croatia | \n", "13 | \n", "6 | \n", "3 | \n", "4 | \n", "0 | \n", "46.153846 | \n", "1.333333 | \n", "0.000000 | \n", "
Switzerland | \n", "24 | \n", "7 | \n", "7 | \n", "0 | \n", "0 | \n", "29.166667 | \n", "0.000000 | \n", "0.000000 | \n", "
Honduras | \n", "6 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
New Zealand | \n", "6 | \n", "0 | \n", "2 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Jamaica | \n", "3 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "33.333333 | \n", "0.000000 | \n", "0.000000 | \n", "
England | \n", "59 | \n", "26 | \n", "13 | \n", "18 | \n", "1 | \n", "44.067797 | \n", "1.384615 | \n", "0.076923 | \n", "
Uruguay | \n", "43 | \n", "14 | \n", "10 | \n", "22 | \n", "1 | \n", "32.558140 | \n", "2.200000 | \n", "0.100000 | \n", "
United Arab Emirates | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
South Africa | \n", "9 | \n", "2 | \n", "3 | \n", "0 | \n", "0 | \n", "22.222222 | \n", "0.000000 | \n", "0.000000 | \n", "
Egypt | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Colombia | \n", "13 | \n", "3 | \n", "4 | \n", "0 | \n", "0 | \n", "23.076923 | \n", "0.000000 | \n", "0.000000 | \n", "
South Korea | \n", "28 | \n", "5 | \n", "8 | \n", "2 | \n", "0 | \n", "17.857143 | \n", "0.250000 | \n", "0.000000 | \n", "
Turkey | \n", "10 | \n", "5 | \n", "2 | \n", "4 | \n", "0 | \n", "50.000000 | \n", "2.000000 | \n", "0.000000 | \n", "
Italy | \n", "71 | \n", "36 | \n", "15 | \n", "54 | \n", "2 | \n", "50.704225 | \n", "3.600000 | \n", "0.133333 | \n", "
Czech Republic | \n", "3 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "33.333333 | \n", "0.000000 | \n", "0.000000 | \n", "
France | \n", "48 | \n", "23 | \n", "10 | \n", "34 | \n", "1 | \n", "47.916667 | \n", "3.400000 | \n", "0.100000 | \n", "
Slovakia | \n", "4 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "25.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Peru | \n", "13 | \n", "4 | \n", "3 | \n", "0 | \n", "0 | \n", "30.769231 | \n", "0.000000 | \n", "0.000000 | \n", "
Norway | \n", "7 | \n", "2 | \n", "2 | \n", "0 | \n", "0 | \n", "28.571429 | \n", "0.000000 | \n", "0.000000 | \n", "
Nigeria | \n", "14 | \n", "4 | \n", "4 | \n", "0 | \n", "0 | \n", "28.571429 | \n", "0.000000 | \n", "0.000000 | \n", "
Israel | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Zaire | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Czechoslovakia | \n", "23 | \n", "7 | \n", "6 | \n", "8 | \n", "0 | \n", "30.434783 | \n", "1.333333 | \n", "0.000000 | \n", "
Austria | \n", "25 | \n", "10 | \n", "6 | \n", "4 | \n", "0 | \n", "40.000000 | \n", "0.666667 | \n", "0.000000 | \n", "
Togo | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Germany | \n", "98 | \n", "59 | \n", "15 | \n", "94 | \n", "3 | \n", "60.204082 | \n", "6.266667 | \n", "0.200000 | \n", "
Ukraine | \n", "5 | \n", "2 | \n", "1 | \n", "0 | \n", "0 | \n", "40.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Northern Ireland | \n", "13 | \n", "3 | \n", "3 | \n", "0 | \n", "0 | \n", "23.076923 | \n", "0.000000 | \n", "0.000000 | \n", "
United States | \n", "25 | \n", "5 | \n", "7 | \n", "0 | \n", "0 | \n", "20.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
Trinidad and Tobago | \n", "3 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
\n", " | ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
76 rows \u00d7 8 columns
\n", "