{ "metadata": { "name": "", "signature": "sha256:f281188c45cf5c657f84678f6e42bc9858ee67da9fb45806ebe711c299be4454" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Click-Through Rate Prediction - AVAZU Kaggle Competition" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is our ipython notebook which details our work in building a classifier on the avazu click through rate prediction." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "We start by loading the usual libraries and define functions for improved plots." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import brewer2mpl\n", "from matplotlib import rcParams\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_cmap = brewer2mpl.get_map('Dark2', 'Qualitative', 7)\n", "dark2_colors = dark2_cmap.mpl_colors\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n", "rcParams['font.family'] = 'StixGeneral'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right() " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1. Load the training data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import csv\n", "csv_file_object = csv.reader(open(r\"train_rev2\",\"rb\"))\n", "header = csv_file_object.next()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this stage, we realised the size of the dataset (9Gb) was prohibitive to loading and manipulating into ipython notebook. Therefore we decided to explore a small sample of the dataset to try and understand the feature space." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "We load consecutive 1000 rows of the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This allows us to subsequently plot the growth in feature space for each categorical variable as we are intending to use a boolean 0-1 hot encoding. We can therefore observe whether the features for that particular category grow with the number of examples of whether they tail off; this will help us determine whether that particular category is a useful feature to use in the model. (see Vowpal Wabbit section below)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "results = []\n", "results2 = []\n", "results3 = []\n", "results4 = []\n", "results5 = []\n", "counter = 0\n", "path = \"train_rev2\"\n", "with open(path, \"r\") as data:\n", " #Count the number of lines: 47686352\n", "# for i, l in enumerate(data):\n", "# pass\n", "# print i + 1\n", "# header = data.readline()\n", " for line in data:\n", " counter += 1\n", " line = line.strip(\"\\n\")\n", " line = line.strip()\n", " if counter <= 1000:\n", " results.append(line.split(\",\"))\n", " if counter >= 1000 and counter <= 2000:\n", " results2.append(line.split(\",\"))\n", " if counter >= 2000 and counter <= 3000:\n", " results3.append(line.split(\",\"))\n", " if counter >= 3000 and counter <= 4000:\n", " results4.append(line.split(\",\"))\n", " if counter >= 4000 and counter <= 5000:\n", " results5.append(line.split(\",\"))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We then load the first 1000 datapoints into a pandas dataframe to use as training set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "testing = pd.DataFrame(data=np.asarray(results[1:]), columns=header)\n", "testing.click = testing.click.astype(int)\n", "print testing.irow(0)\n", "testing.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "id 10000222510487979663\n", "click 0\n", "hour 14100100\n", "C1 1005\n", "banner_pos 0\n", "site_id d41d8cd9\n", "site_domain d41d8cd9\n", "site_category d41d8cd9\n", "app_id ee72efa5\n", "app_domain 85262c2b\n", "app_category 7e5068fc\n", "device_id d41d8cd9\n", "device_ip 22f9c6ba\n", "device_os c31b3236\n", "device_make 3d517f89\n", "device_model 3e238c9b\n", "device_type 1\n", "device_conn_type 0\n", "device_geo_country fc9fdf08\n", "C17 11999\n", "C18 320\n", "C19 50\n", "C20 1248\n", "C21 2\n", "C22 39\n", "C23 -1\n", "C24 13\n", "Name: 0, dtype: object\n" ] }, { "html": [ "
\n", " | id | \n", "click | \n", "hour | \n", "C1 | \n", "banner_pos | \n", "site_id | \n", "site_domain | \n", "site_category | \n", "app_id | \n", "app_domain | \n", "... | \n", "device_conn_type | \n", "device_geo_country | \n", "C17 | \n", "C18 | \n", "C19 | \n", "C20 | \n", "C21 | \n", "C22 | \n", "C23 | \n", "C24 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "10000222510487979663 | \n", "0 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "ee72efa5 | \n", "85262c2b | \n", "... | \n", "0 | \n", "fc9fdf08 | \n", "11999 | \n", "320 | \n", "50 | \n", "1248 | \n", "2 | \n", "39 | \n", "-1 | \n", "13 | \n", "
1 | \n", "10000335031004381249 | \n", "0 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "7ddd1e29 | \n", "85262c2b | \n", "... | \n", "0 | \n", "e22428cc | \n", "12026 | \n", "320 | \n", "50 | \n", "1248 | \n", "2 | \n", "39 | \n", "-1 | \n", "13 | \n", "
2 | \n", "10000413097548171036 | \n", "0 | \n", "14100100 | \n", "1010 | \n", "1 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "7dd0bcc4 | \n", "d41d8cd9 | \n", "... | \n", "2 | \n", "5343b21a | \n", "5470 | \n", "320 | \n", "50 | \n", "394 | \n", "2 | \n", "303 | \n", "-1 | \n", "15 | \n", "
3 | \n", "10000436876114817886 | \n", "0 | \n", "14100100 | \n", "1002 | \n", "0 | \n", "d5589b4a | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "... | \n", "0 | \n", "0b3b97fa | \n", "16723 | \n", "320 | \n", "50 | \n", "1876 | \n", "2 | \n", "291 | \n", "-1 | \n", "33 | \n", "
4 | \n", "10000488446663934007 | \n", "1 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "aa55fc10 | \n", "85262c2b | \n", "... | \n", "2 | \n", "75778bf8 | \n", "17012 | \n", "320 | \n", "50 | \n", "1871 | \n", "3 | \n", "35 | \n", "100053 | \n", "23 | \n", "
5 rows \u00d7 27 columns
\n", "\n", " | id | \n", "click | \n", "hour | \n", "C1 | \n", "banner_pos | \n", "site_id | \n", "site_domain | \n", "site_category | \n", "app_id | \n", "app_domain | \n", "... | \n", "device_geo_country | \n", "C17 | \n", "C18 | \n", "C19 | \n", "C20 | \n", "C21 | \n", "C22 | \n", "C23 | \n", "C24 | \n", "timestamp | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
994 | \n", "10085753362715434105 | \n", "0 | \n", "14100100 | \n", "1005 | \n", "1 | \n", "b01fd8c0 | \n", "a56a5285 | \n", "7e5068fc | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "... | \n", "959848ca | \n", "14915 | \n", "320 | \n", "50 | \n", "1623 | \n", "3 | \n", "175 | \n", "100156 | \n", "42 | \n", "1412096400 | \n", "
995 | \n", "10085762866609622098 | \n", "0 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "ca1502d1 | \n", "96b73ddc | \n", "... | \n", "3d26b0b1 | \n", "16966 | \n", "320 | \n", "50 | \n", "1919 | \n", "0 | \n", "169 | \n", "100108 | \n", "17 | \n", "1412096400 | \n", "
996 | \n", "10085798392965113138 | \n", "0 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "ca1502d1 | \n", "96b73ddc | \n", "... | \n", "3d26b0b1 | \n", "16966 | \n", "320 | \n", "50 | \n", "1919 | \n", "0 | \n", "169 | \n", "100108 | \n", "17 | \n", "1412096400 | \n", "
997 | \n", "10085816395998880696 | \n", "1 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "8374fbd9 | \n", "4f04e5f8 | \n", "... | \n", "e529a9ce | \n", "16615 | \n", "320 | \n", "50 | \n", "1863 | \n", "3 | \n", "39 | \n", "100248 | \n", "23 | \n", "1412096400 | \n", "
998 | \n", "10085854709161323225 | \n", "0 | \n", "14100100 | \n", "1005 | \n", "0 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "d41d8cd9 | \n", "ca1502d1 | \n", "96b73ddc | \n", "... | \n", "0d149b90 | \n", "16839 | \n", "320 | \n", "50 | \n", "1883 | \n", "0 | \n", "1451 | \n", "100094 | \n", "17 | \n", "1412096400 | \n", "
5 rows \u00d7 28 columns
\n", "\n", " | C17_10198 | \n", "C17_10199 | \n", "C17_10200 | \n", "C17_10229 | \n", "C17_10289 | \n", "C17_1037 | \n", "C17_1039 | \n", "C17_10704 | \n", "C17_10901 | \n", "C17_10941 | \n", "... | \n", "site_id_f3c495d0 | \n", "site_id_f440e761 | \n", "site_id_f4e01d44 | \n", "site_id_f701c177 | \n", "site_id_f91a85b6 | \n", "site_id_fb5ff023 | \n", "site_id_fe1972d4 | \n", "site_id_ff2e3304 | \n", "site_id_ffa60702 | \n", "timestamp | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1412096400 | \n", "
1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1412096400 | \n", "
2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1412096400 | \n", "
3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1412096400 | \n", "
4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1412096400 | \n", "
5 rows \u00d7 2486 columns
\n", "\n", " | C1_1002 | \n", "C1_1005 | \n", "C1_1010 | \n", "banner_pos_0 | \n", "banner_pos_1 | \n", "site_id_02e31e62 | \n", "site_id_05222bb6 | \n", "site_id_060f567a | \n", "site_id_09ab1430 | \n", "site_id_0be51948 | \n", "... | \n", "C24_32 | \n", "C24_33 | \n", "C24_42 | \n", "C24_46 | \n", "C24_48 | \n", "C24_52 | \n", "C24_62 | \n", "C24_79 | \n", "C24_82 | \n", "C24_95 | \n", "
---|
0 rows \u00d7 949 columns
\n", "