{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kaggle's Predicting Red Hat Business Value\n",
    "\n",
    "This is a first quick & dirty attempt at Kaggle's [Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value) competition.\n",
    "\n",
    "### Loading in the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>people_id</th>\n",
       "      <th>char_1</th>\n",
       "      <th>group_1</th>\n",
       "      <th>char_2</th>\n",
       "      <th>date</th>\n",
       "      <th>char_3</th>\n",
       "      <th>char_4</th>\n",
       "      <th>char_5</th>\n",
       "      <th>char_6</th>\n",
       "      <th>char_7</th>\n",
       "      <th>...</th>\n",
       "      <th>char_29</th>\n",
       "      <th>char_30</th>\n",
       "      <th>char_31</th>\n",
       "      <th>char_32</th>\n",
       "      <th>char_33</th>\n",
       "      <th>char_34</th>\n",
       "      <th>char_35</th>\n",
       "      <th>char_36</th>\n",
       "      <th>char_37</th>\n",
       "      <th>char_38</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>type 2</td>\n",
       "      <td>group 17304</td>\n",
       "      <td>type 2</td>\n",
       "      <td>2021-06-29</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 3</td>\n",
       "      <td>type 11</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ppl_100002</td>\n",
       "      <td>type 2</td>\n",
       "      <td>group 8688</td>\n",
       "      <td>type 3</td>\n",
       "      <td>2021-01-06</td>\n",
       "      <td>type 28</td>\n",
       "      <td>type 9</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 3</td>\n",
       "      <td>type 11</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ppl_100003</td>\n",
       "      <td>type 2</td>\n",
       "      <td>group 33592</td>\n",
       "      <td>type 3</td>\n",
       "      <td>2022-06-10</td>\n",
       "      <td>type 4</td>\n",
       "      <td>type 8</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 2</td>\n",
       "      <td>type 5</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>99</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3 rows × 41 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    people_id  char_1      group_1  char_2        date   char_3  char_4  \\\n",
       "0     ppl_100  type 2  group 17304  type 2  2021-06-29   type 5  type 5   \n",
       "1  ppl_100002  type 2   group 8688  type 3  2021-01-06  type 28  type 9   \n",
       "2  ppl_100003  type 2  group 33592  type 3  2022-06-10   type 4  type 8   \n",
       "\n",
       "   char_5  char_6   char_7   ...   char_29 char_30 char_31 char_32 char_33  \\\n",
       "0  type 5  type 3  type 11   ...     False    True    True   False   False   \n",
       "1  type 5  type 3  type 11   ...     False    True    True    True    True   \n",
       "2  type 5  type 2   type 5   ...     False   False    True    True    True   \n",
       "\n",
       "  char_34 char_35 char_36 char_37 char_38  \n",
       "0    True    True    True   False      36  \n",
       "1    True    True    True   False      76  \n",
       "2    True   False    True    True      99  \n",
       "\n",
       "[3 rows x 41 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "people = pd.read_csv('people.csv.zip')\n",
    "people.head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>people_id</th>\n",
       "      <th>activity_id</th>\n",
       "      <th>date</th>\n",
       "      <th>activity_category</th>\n",
       "      <th>char_1</th>\n",
       "      <th>char_2</th>\n",
       "      <th>char_3</th>\n",
       "      <th>char_4</th>\n",
       "      <th>char_5</th>\n",
       "      <th>char_6</th>\n",
       "      <th>char_7</th>\n",
       "      <th>char_8</th>\n",
       "      <th>char_9</th>\n",
       "      <th>char_10</th>\n",
       "      <th>outcome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_1734928</td>\n",
       "      <td>2023-08-26</td>\n",
       "      <td>type 4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>type 76</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_2434093</td>\n",
       "      <td>2022-09-27</td>\n",
       "      <td>type 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>type 1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_3404049</td>\n",
       "      <td>2022-09-27</td>\n",
       "      <td>type 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>type 1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  people_id   activity_id        date activity_category char_1 char_2 char_3  \\\n",
       "0   ppl_100  act2_1734928  2023-08-26            type 4    NaN    NaN    NaN   \n",
       "1   ppl_100  act2_2434093  2022-09-27            type 2    NaN    NaN    NaN   \n",
       "2   ppl_100  act2_3404049  2022-09-27            type 2    NaN    NaN    NaN   \n",
       "\n",
       "  char_4 char_5 char_6 char_7 char_8 char_9  char_10  outcome  \n",
       "0    NaN    NaN    NaN    NaN    NaN    NaN  type 76        0  \n",
       "1    NaN    NaN    NaN    NaN    NaN    NaN   type 1        0  \n",
       "2    NaN    NaN    NaN    NaN    NaN    NaN   type 1        0  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "actions = pd.read_csv('act_train.csv.zip')\n",
    "actions.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Joining together to get dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>people_id</th>\n",
       "      <th>activity_id</th>\n",
       "      <th>date_action</th>\n",
       "      <th>activity_category</th>\n",
       "      <th>char_1_action</th>\n",
       "      <th>char_2_action</th>\n",
       "      <th>char_3_action</th>\n",
       "      <th>char_4_action</th>\n",
       "      <th>char_5_action</th>\n",
       "      <th>char_6_action</th>\n",
       "      <th>...</th>\n",
       "      <th>char_29</th>\n",
       "      <th>char_30</th>\n",
       "      <th>char_31</th>\n",
       "      <th>char_32</th>\n",
       "      <th>char_33</th>\n",
       "      <th>char_34</th>\n",
       "      <th>char_35</th>\n",
       "      <th>char_36</th>\n",
       "      <th>char_37</th>\n",
       "      <th>char_38</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_1734928</td>\n",
       "      <td>2023-08-26</td>\n",
       "      <td>type 4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_2434093</td>\n",
       "      <td>2022-09-27</td>\n",
       "      <td>type 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_3404049</td>\n",
       "      <td>2022-09-27</td>\n",
       "      <td>type 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_3651215</td>\n",
       "      <td>2023-08-04</td>\n",
       "      <td>type 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ppl_100</td>\n",
       "      <td>act2_4109017</td>\n",
       "      <td>2023-08-26</td>\n",
       "      <td>type 2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 55 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  people_id   activity_id date_action activity_category char_1_action  \\\n",
       "0   ppl_100  act2_1734928  2023-08-26            type 4           NaN   \n",
       "1   ppl_100  act2_2434093  2022-09-27            type 2           NaN   \n",
       "2   ppl_100  act2_3404049  2022-09-27            type 2           NaN   \n",
       "3   ppl_100  act2_3651215  2023-08-04            type 2           NaN   \n",
       "4   ppl_100  act2_4109017  2023-08-26            type 2           NaN   \n",
       "\n",
       "  char_2_action char_3_action char_4_action char_5_action char_6_action  \\\n",
       "0           NaN           NaN           NaN           NaN           NaN   \n",
       "1           NaN           NaN           NaN           NaN           NaN   \n",
       "2           NaN           NaN           NaN           NaN           NaN   \n",
       "3           NaN           NaN           NaN           NaN           NaN   \n",
       "4           NaN           NaN           NaN           NaN           NaN   \n",
       "\n",
       "    ...   char_29 char_30 char_31 char_32  char_33 char_34 char_35 char_36  \\\n",
       "0   ...     False    True    True   False    False    True    True    True   \n",
       "1   ...     False    True    True   False    False    True    True    True   \n",
       "2   ...     False    True    True   False    False    True    True    True   \n",
       "3   ...     False    True    True   False    False    True    True    True   \n",
       "4   ...     False    True    True   False    False    True    True    True   \n",
       "\n",
       "  char_37 char_38  \n",
       "0   False      36  \n",
       "1   False      36  \n",
       "2   False      36  \n",
       "3   False      36  \n",
       "4   False      36  \n",
       "\n",
       "[5 rows x 55 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data_full = pd.merge(actions, people, how='inner', on='people_id', suffixes=['_action', '_person'], sort=False)\n",
    "training_data_full.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2197291, 15), (189118, 41), (2197291, 55))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(actions.shape, people.shape, training_data_full.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Building a preprocessing pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# %load \"preprocessing_transforms.py\"\n",
    "from sklearn.base import TransformerMixin, BaseEstimator\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "class BaseTransformer(BaseEstimator, TransformerMixin):\n",
    "    def fit(self, X, y=None, **fit_params):\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, **transform_params):\n",
    "        return self\n",
    "\n",
    "\n",
    "class ColumnSelector(BaseTransformer):\n",
    "    \"\"\"Selects columns from Pandas Dataframe\"\"\"\n",
    "\n",
    "    def __init__(self, columns, c_type=None):\n",
    "        self.columns = columns\n",
    "        self.c_type = c_type\n",
    "\n",
    "    def transform(self, X, **transform_params):\n",
    "        cs = X[self.columns]\n",
    "        if self.c_type is None:\n",
    "            return cs\n",
    "        else:\n",
    "            return cs.astype(self.c_type)\n",
    "\n",
    "\n",
    "class SpreadBinary(BaseTransformer):\n",
    "\n",
    "    def transform(self, X, **transform_params):\n",
    "        return X.applymap(lambda x: 1 if x == 1 else -1)\n",
    "\n",
    "\n",
    "class DfTransformerAdapter(BaseTransformer):\n",
    "    \"\"\"Adapts a scikit-learn Transformer to return a pandas DataFrame\"\"\"\n",
    "\n",
    "    def __init__(self, transformer):\n",
    "        self.transformer = transformer\n",
    "\n",
    "    def fit(self, X, y=None, **fit_params):\n",
    "        self.transformer.fit(X, y=y, **fit_params)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, **transform_params):\n",
    "        raw_result = self.transformer.transform(X, **transform_params)\n",
    "        return pd.DataFrame(raw_result, columns=X.columns, index=X.index)\n",
    "\n",
    "\n",
    "class DfOneHot(BaseTransformer):\n",
    "    \"\"\"\n",
    "    Wraps helper method `get_dummies` making sure all columns get one-hot encoded.\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        self.dummy_columns = []\n",
    "\n",
    "    def fit(self, X, y=None, **fit_params):\n",
    "        self.dummy_columns = pd.get_dummies(\n",
    "            X,\n",
    "            prefix=[c for c in X.columns],\n",
    "            columns=X.columns).columns\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, **transform_params):\n",
    "        return pd.get_dummies(\n",
    "            X,\n",
    "            prefix=[c for c in X.columns],\n",
    "            columns=X.columns).reindex(columns=self.dummy_columns, fill_value=0)\n",
    "\n",
    "\n",
    "class DfFeatureUnion(BaseTransformer):\n",
    "    \"\"\"A dataframe friendly implementation of `FeatureUnion`\"\"\"\n",
    "\n",
    "    def __init__(self, transformers):\n",
    "        self.transformers = transformers\n",
    "\n",
    "    def fit(self, X, y=None, **fit_params):\n",
    "        for l, t in self.transformers:\n",
    "            t.fit(X, y=y, **fit_params)\n",
    "        return self\n",
    "\n",
    "    def transform(self, X, **transform_params):\n",
    "        transform_results = [t.transform(X, **transform_params) for l, t in self.transformers]\n",
    "        return pd.concat(transform_results, axis=1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['people_id', 'activity_id', 'date_action', 'activity_category',\n",
       "       'char_1_action', 'char_2_action', 'char_3_action', 'char_4_action',\n",
       "       'char_5_action', 'char_6_action', 'char_7_action', 'char_8_action',\n",
       "       'char_9_action', 'char_10_action', 'outcome', 'char_1_person',\n",
       "       'group_1', 'char_2_person', 'date_person', 'char_3_person',\n",
       "       'char_4_person', 'char_5_person', 'char_6_person', 'char_7_person',\n",
       "       'char_8_person', 'char_9_person', 'char_10_person', 'char_11',\n",
       "       'char_12', 'char_13', 'char_14', 'char_15', 'char_16', 'char_17',\n",
       "       'char_18', 'char_19', 'char_20', 'char_21', 'char_22', 'char_23',\n",
       "       'char_24', 'char_25', 'char_26', 'char_27', 'char_28', 'char_29',\n",
       "       'char_30', 'char_31', 'char_32', 'char_33', 'char_34', 'char_35',\n",
       "       'char_36', 'char_37', 'char_38'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data_full.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in people_id there are 151295 unique values\n",
      "in activity_id there are 2197291 unique values\n",
      "in date_action there are 411 unique values\n",
      "in activity_category there are 7 unique values\n",
      "in char_1_action there are 52 unique values\n",
      "in char_2_action there are 33 unique values\n",
      "in char_3_action there are 12 unique values\n",
      "in char_4_action there are 8 unique values\n",
      "in char_5_action there are 8 unique values\n",
      "in char_6_action there are 6 unique values\n",
      "in char_7_action there are 9 unique values\n",
      "in char_8_action there are 19 unique values\n",
      "in char_9_action there are 20 unique values\n",
      "in char_10_action there are 6516 unique values\n",
      "in outcome there are 2 unique values\n",
      "in char_1_person there are 2 unique values\n",
      "in group_1 there are 29899 unique values\n",
      "in char_2_person there are 3 unique values\n",
      "in date_person there are 1196 unique values\n",
      "in char_3_person there are 43 unique values\n",
      "in char_4_person there are 25 unique values\n",
      "in char_5_person there are 9 unique values\n",
      "in char_6_person there are 7 unique values\n",
      "in char_7_person there are 25 unique values\n",
      "in char_8_person there are 8 unique values\n",
      "in char_9_person there are 9 unique values\n",
      "in char_10_person there are 2 unique values\n",
      "in char_11 there are 2 unique values\n",
      "in char_12 there are 2 unique values\n",
      "in char_13 there are 2 unique values\n",
      "in char_14 there are 2 unique values\n",
      "in char_15 there are 2 unique values\n",
      "in char_16 there are 2 unique values\n",
      "in char_17 there are 2 unique values\n",
      "in char_18 there are 2 unique values\n",
      "in char_19 there are 2 unique values\n",
      "in char_20 there are 2 unique values\n",
      "in char_21 there are 2 unique values\n",
      "in char_22 there are 2 unique values\n",
      "in char_23 there are 2 unique values\n",
      "in char_24 there are 2 unique values\n",
      "in char_25 there are 2 unique values\n",
      "in char_26 there are 2 unique values\n",
      "in char_27 there are 2 unique values\n",
      "in char_28 there are 2 unique values\n",
      "in char_29 there are 2 unique values\n",
      "in char_30 there are 2 unique values\n",
      "in char_31 there are 2 unique values\n",
      "in char_32 there are 2 unique values\n",
      "in char_33 there are 2 unique values\n",
      "in char_34 there are 2 unique values\n",
      "in char_35 there are 2 unique values\n",
      "in char_36 there are 2 unique values\n",
      "in char_37 there are 2 unique values\n",
      "in char_38 there are 101 unique values\n"
     ]
    }
   ],
   "source": [
    "for col in training_data_full.columns:\n",
    "    print(\"in {} there are {} unique values\".format(col, len(training_data_full[col].unique())))\n",
    "None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Potential trouble with high dimensionality\n",
    "\n",
    "Notice that char_10_action, group_1 and others have a ton of unique values; one-hot encoding will result in a dataframe with thousands of columns. \n",
    "\n",
    "Being lazy and getting as fast as possible to a first attempt, let's skip those and only consider categorical variable with ~20 or less unique values. We'll get smarter about dealing with these variables to reinclude them in our model on a subsequent attempt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "from sklearn.preprocessing import Imputer, StandardScaler\n",
    "\n",
    "cat_columns = ['activity_category',\n",
    "       'char_1_action', 'char_2_action', 'char_3_action', 'char_4_action',\n",
    "       'char_5_action', 'char_6_action', 'char_7_action', 'char_8_action',\n",
    "       'char_9_action', 'char_1_person',\n",
    "       'char_2_person', 'char_3_person',\n",
    "       'char_4_person', 'char_5_person', 'char_6_person', 'char_7_person',\n",
    "       'char_8_person', 'char_9_person', 'char_10_person', 'char_11',\n",
    "       'char_12', 'char_13', 'char_14', 'char_15', 'char_16', 'char_17',\n",
    "       'char_18', 'char_19', 'char_20', 'char_21', 'char_22', 'char_23',\n",
    "       'char_24', 'char_25', 'char_26', 'char_27', 'char_28', 'char_29',\n",
    "       'char_30', 'char_31', 'char_32', 'char_33', 'char_34', 'char_35',\n",
    "       'char_36', 'char_37']\n",
    "\n",
    "q_columns = ['char_38']\n",
    "\n",
    "preprocessor = Pipeline([\n",
    "    ('features', DfFeatureUnion([\n",
    "        ('quantitative', Pipeline([\n",
    "            ('select-quantitative', ColumnSelector(q_columns, c_type='float')),\n",
    "            ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),\n",
    "            ('scale', DfTransformerAdapter(StandardScaler()))\n",
    "        ])),\n",
    "        ('categorical', Pipeline([\n",
    "            ('select-categorical', ColumnSelector(cat_columns)),\n",
    "            ('apply-onehot', DfOneHot()),\n",
    "            ('spread-binary', SpreadBinary())\n",
    "        ])),\n",
    "    ]))\n",
    "])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sampling to reduce runtime in training large dataset\n",
    "\n",
    "If we train models based on the entire test dataset provided it exhausts the memory on my laptop. Again, in the spirit of getting something quick and dirty working, we'll sample the dataset and train on that. We'll then evaluate our model by testing the accuracy on a larger sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n",
    "training_frac = 0.05\n",
    "test_frac = 0.8\n",
    "\n",
    "training_data, the_rest = train_test_split(training_data_full, train_size=training_frac, random_state=0)\n",
    "test_data = the_rest.sample(frac=test_frac)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(109864, 55)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1669942, 55)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "wrangled = preprocessor.fit_transform(training_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>char_38</th>\n",
       "      <th>activity_category_type 1</th>\n",
       "      <th>activity_category_type 2</th>\n",
       "      <th>activity_category_type 3</th>\n",
       "      <th>activity_category_type 4</th>\n",
       "      <th>activity_category_type 5</th>\n",
       "      <th>activity_category_type 6</th>\n",
       "      <th>activity_category_type 7</th>\n",
       "      <th>char_1_action_type 1</th>\n",
       "      <th>char_1_action_type 10</th>\n",
       "      <th>...</th>\n",
       "      <th>char_33_False</th>\n",
       "      <th>char_33_True</th>\n",
       "      <th>char_34_False</th>\n",
       "      <th>char_34_True</th>\n",
       "      <th>char_35_False</th>\n",
       "      <th>char_35_True</th>\n",
       "      <th>char_36_False</th>\n",
       "      <th>char_36_True</th>\n",
       "      <th>char_37_False</th>\n",
       "      <th>char_37_True</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>963496</th>\n",
       "      <td>-1.380347</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>874945</th>\n",
       "      <td>-0.910167</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>424945</th>\n",
       "      <td>-1.380347</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1478640</th>\n",
       "      <td>1.357758</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>723674</th>\n",
       "      <td>0.859921</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>-1</td>\n",
       "      <td>...</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "      <td>-1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 336 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          char_38  activity_category_type 1  activity_category_type 2  \\\n",
       "963496  -1.380347                        -1                        -1   \n",
       "874945  -0.910167                        -1                         1   \n",
       "424945  -1.380347                        -1                        -1   \n",
       "1478640  1.357758                        -1                         1   \n",
       "723674   0.859921                        -1                        -1   \n",
       "\n",
       "         activity_category_type 3  activity_category_type 4  \\\n",
       "963496                         -1                        -1   \n",
       "874945                         -1                        -1   \n",
       "424945                         -1                        -1   \n",
       "1478640                        -1                        -1   \n",
       "723674                         -1                        -1   \n",
       "\n",
       "         activity_category_type 5  activity_category_type 6  \\\n",
       "963496                          1                        -1   \n",
       "874945                         -1                        -1   \n",
       "424945                          1                        -1   \n",
       "1478640                        -1                        -1   \n",
       "723674                          1                        -1   \n",
       "\n",
       "         activity_category_type 7  char_1_action_type 1  \\\n",
       "963496                         -1                    -1   \n",
       "874945                         -1                    -1   \n",
       "424945                         -1                    -1   \n",
       "1478640                        -1                    -1   \n",
       "723674                         -1                    -1   \n",
       "\n",
       "         char_1_action_type 10      ...       char_33_False  char_33_True  \\\n",
       "963496                      -1      ...                   1            -1   \n",
       "874945                      -1      ...                   1            -1   \n",
       "424945                      -1      ...                   1            -1   \n",
       "1478640                     -1      ...                  -1             1   \n",
       "723674                      -1      ...                  -1             1   \n",
       "\n",
       "         char_34_False  char_34_True  char_35_False  char_35_True  \\\n",
       "963496               1            -1              1            -1   \n",
       "874945               1            -1              1            -1   \n",
       "424945               1            -1              1            -1   \n",
       "1478640             -1             1             -1             1   \n",
       "723674              -1             1             -1             1   \n",
       "\n",
       "         char_36_False  char_36_True  char_37_False  char_37_True  \n",
       "963496               1            -1              1            -1  \n",
       "874945               1            -1              1            -1  \n",
       "424945               1            -1              1            -1  \n",
       "1478640             -1             1             -1             1  \n",
       "723674              -1             1             -1             1  \n",
       "\n",
       "[5 rows x 336 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wrangled.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Putting together classifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "pipe_lr = Pipeline([\n",
    "        ('wrangle', preprocessor),\n",
    "        ('lr', LogisticRegression(C=100.0, random_state=0))\n",
    "    ])\n",
    "\n",
    "pipe_rf = Pipeline([\n",
    "        ('wrangle', preprocessor),\n",
    "        ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))\n",
    "    ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "feature_columns = cat_columns + q_columns "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def extract_X_y(df):\n",
    "    return df[feature_columns], df['outcome']\n",
    "\n",
    "X_train, y_train = extract_X_y(training_data)\n",
    "X_test, y_test = extract_X_y(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reporting utilities\n",
    "\n",
    "Some utilities to make reporting progress easier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import time\n",
    "import subprocess\n",
    "\n",
    "class time_and_log():\n",
    "    \n",
    "    def __init__(self, label, *, prefix='', say=False):\n",
    "        self.label = label\n",
    "        self.prefix = prefix\n",
    "        self.say = say\n",
    "    \n",
    "    def __enter__(self):\n",
    "        msg = 'Starting {}'.format(self.label)\n",
    "        print('{}{}'.format(self.prefix, msg))\n",
    "        if self.say:\n",
    "            cmd_say(msg)\n",
    "        self.start = time.process_time()\n",
    "        return self\n",
    "\n",
    "    def __exit__(self, *exc):\n",
    "        self.interval = time.process_time() - self.start\n",
    "        msg = 'Finished {} in {:.2f} seconds'.format(self.label, self.interval)\n",
    "        print('{}{}'.format(self.prefix, msg))\n",
    "        if self.say:\n",
    "            cmd_say(msg)\n",
    "        return False\n",
    "    \n",
    "def cmd_say(msg):\n",
    "    subprocess.call(\"say '{}'\".format(msg), shell=True)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross validation and full test set accuracy\n",
    "\n",
    "We'll cross validate within the training set, and then train on the full training set and see how well it performs on the full test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluating logistic regression\n",
      " _Starting fitting full training set\n",
      " _Finished fitting full training set in 121.04 seconds\n",
      " _Starting evaluating on full test set\n",
      "  Full test accuracy (0.80 of dataset): 0.861\n",
      " _Finished evaluating on full test set in 288.49 seconds\n",
      "Evaluating random forest\n",
      " _Starting fitting full training set\n",
      " _Finished fitting full training set in 21.32 seconds\n",
      " _Starting evaluating on full test set\n",
      "  Full test accuracy (0.80 of dataset): 0.923\n",
      " _Finished evaluating on full test set in 292.85 seconds\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "import numpy as np\n",
    "\n",
    "models = [\n",
    "    ('logistic regression', pipe_lr), \n",
    "    ('random forest', pipe_rf), \n",
    "]\n",
    "\n",
    "for label, model in models:\n",
    "    print('Evaluating {}'.format(label))\n",
    "    say('Evaluating {}'.format(label))\n",
    "#     with time_and_log('cross validating', say=True, prefix=\" _\"):\n",
    "#         scores = cross_val_score(estimator=model,\n",
    "#                              X=X_train,\n",
    "#                              y=y_train,\n",
    "#                              cv=5,\n",
    "#                              n_jobs=1)\n",
    "#         print('  CV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores), np.std(scores)))\n",
    "    with time_and_log('fitting full training set', say=True, prefix=\" _\"):\n",
    "        model.fit(X_train, y_train)  \n",
    "    with time_and_log('evaluating on full test set', say=True, prefix=\" _\"):\n",
    "        print(\"  Full test accuracy ({:.2f} of dataset): {:.3f}\".format(\n",
    "                test_frac, \n",
    "                accuracy_score(y_test, model.predict(X_test)))) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the submission\n",
    "\n",
    "Random forest beat logistic regression, let's start with a submission using that.\n",
    "\n",
    "But first, let's see what the submission is supposed to look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>activity_id</th>\n",
       "      <th>outcome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>act1_1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>act1_100006</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>act1_100050</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>act1_100065</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>act1_100068</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   activity_id  outcome\n",
       "0       act1_1        0\n",
       "1  act1_100006        0\n",
       "2  act1_100050        0\n",
       "3  act1_100065        0\n",
       "4  act1_100068        0"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv('sample_submission.csv.zip').head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now let's prepare the submission by fitting on the full provided training set and using it to predict on the provided test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>people_id</th>\n",
       "      <th>activity_id</th>\n",
       "      <th>date_action</th>\n",
       "      <th>activity_category</th>\n",
       "      <th>char_1_action</th>\n",
       "      <th>char_2_action</th>\n",
       "      <th>char_3_action</th>\n",
       "      <th>char_4_action</th>\n",
       "      <th>char_5_action</th>\n",
       "      <th>char_6_action</th>\n",
       "      <th>...</th>\n",
       "      <th>char_29</th>\n",
       "      <th>char_30</th>\n",
       "      <th>char_31</th>\n",
       "      <th>char_32</th>\n",
       "      <th>char_33</th>\n",
       "      <th>char_34</th>\n",
       "      <th>char_35</th>\n",
       "      <th>char_36</th>\n",
       "      <th>char_37</th>\n",
       "      <th>char_38</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ppl_100004</td>\n",
       "      <td>act1_249281</td>\n",
       "      <td>2022-07-20</td>\n",
       "      <td>type 1</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 10</td>\n",
       "      <td>type 5</td>\n",
       "      <td>type 1</td>\n",
       "      <td>type 6</td>\n",
       "      <td>type 1</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ppl_100004</td>\n",
       "      <td>act2_230855</td>\n",
       "      <td>2022-07-20</td>\n",
       "      <td>type 5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>76</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 54 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    people_id  activity_id date_action activity_category char_1_action  \\\n",
       "0  ppl_100004  act1_249281  2022-07-20            type 1        type 5   \n",
       "1  ppl_100004  act2_230855  2022-07-20            type 5           NaN   \n",
       "\n",
       "  char_2_action char_3_action char_4_action char_5_action char_6_action  \\\n",
       "0       type 10        type 5        type 1        type 6        type 1   \n",
       "1           NaN           NaN           NaN           NaN           NaN   \n",
       "\n",
       "    ...   char_29 char_30 char_31 char_32 char_33 char_34 char_35 char_36  \\\n",
       "0   ...      True    True    True    True    True    True    True    True   \n",
       "1   ...      True    True    True    True    True    True    True    True   \n",
       "\n",
       "  char_37 char_38  \n",
       "0    True      76  \n",
       "1    True      76  \n",
       "\n",
       "[2 rows x 54 columns]"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kaggle_test_df = pd.merge(\n",
    "    pd.read_csv('act_test.csv.zip'), \n",
    "    people, \n",
    "    how='inner', on='people_id', suffixes=['_action', '_person'], sort=False)\n",
    "kaggle_test_df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(498687, 54)"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kaggle_test_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_kaggle_train, y_kaggle_train = extract_X_y(training_data_full)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting fitting rf on full kaggle training set\n",
      "Finished fitting rf on full kaggle training set in 548.33 seconds\n"
     ]
    }
   ],
   "source": [
    "with time_and_log('fitting rf on full kaggle training set', say=True): \n",
    "    pipe_rf.fit(X_kaggle_train, y_kaggle_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting preparing kaggle submission\n",
      "Finished preparing kaggle submission in 84.99 seconds\n"
     ]
    }
   ],
   "source": [
    "with time_and_log('preparing kaggle submission', say=True):\n",
    "    submission_df = kaggle_test_df[['activity_id']].copy()\n",
    "    submission_df['outcome'] = pipe_rf.predict(kaggle_test_df)\n",
    "    submission_df.to_csv(\"predicting-red-hat-business-value_1_rf.csv\", index=False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This got me to 85% accuracy on the submission, placing 1099 out of 1250 teams. There are 190 people with 99% or greater accuracy and 837 with 95%, so this definitely qualifies as merely a quick and dirty submission :)\n",
    "\n",
    "It's also worth noting that [apparently people figured](https://www.kaggle.com/c/predicting-red-hat-business-value/forums/t/22898/updated-competition-deadline) out how to get 98% only looking at the date and group columns—two of the columns I ditched to make things easier to get started."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}