{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Kaggle's Predicting Red Hat Business Value\n", "\n", "This is a follow up attempt at Kaggle's [Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value) competition.\n", "\n", "See [my notebooks section](http://karlrosaen.com/ml/notebooks) for links to the first attempt and other kaggle competitions.\n", "\n", "The focus of this iteration is exploring whether we can bring back the previously ignored categorical columns that have hundreds if not thousands of unique values, making it impractical to use one-hot encoding. \n", "\n", "Two approaches are taken on categorical variables with a large amount of unique values:\n", "\n", "- encoding the values ordinally; sorting the values lexicographically and assigning a sequence of numbers, and then treating them quantitatively from there\n", "- encoding the most frequently occuring values using one-hot and then binary encoding the rest. As part of this I developed a new scikit-learn transformer\n", "\n", "The end results: reincluding the columns boosted performance on the training set by only 0.5%, and surprisingly the binary / one-hot combo did hardly any better than the ordinal encoding.\n", "\n", "### Loading in the data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
people_idchar_1group_1char_2datechar_3char_4char_5char_6char_7...char_29char_30char_31char_32char_33char_34char_35char_36char_37char_38
0ppl_100type 2group 17304type 22021-06-29type 5type 5type 5type 3type 11...FalseTrueTrueFalseFalseTrueTrueTrueFalse36
1ppl_100002type 2group 8688type 32021-01-06type 28type 9type 5type 3type 11...FalseTrueTrueTrueTrueTrueTrueTrueFalse76
2ppl_100003type 2group 33592type 32022-06-10type 4type 8type 5type 2type 5...FalseFalseTrueTrueTrueTrueFalseTrueTrue99
\n", "

3 rows × 41 columns

\n", "
" ], "text/plain": [ " people_id char_1 group_1 char_2 date char_3 char_4 \\\n", "0 ppl_100 type 2 group 17304 type 2 2021-06-29 type 5 type 5 \n", "1 ppl_100002 type 2 group 8688 type 3 2021-01-06 type 28 type 9 \n", "2 ppl_100003 type 2 group 33592 type 3 2022-06-10 type 4 type 8 \n", "\n", " char_5 char_6 char_7 ... char_29 char_30 char_31 char_32 char_33 \\\n", "0 type 5 type 3 type 11 ... False True True False False \n", "1 type 5 type 3 type 11 ... False True True True True \n", "2 type 5 type 2 type 5 ... False False True True True \n", "\n", " char_34 char_35 char_36 char_37 char_38 \n", "0 True True True False 36 \n", "1 True True True False 76 \n", "2 True False True True 99 \n", "\n", "[3 rows x 41 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "people = pd.read_csv('people.csv.zip')\n", "people.head(3)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
people_idactivity_iddateactivity_categorychar_1char_2char_3char_4char_5char_6char_7char_8char_9char_10outcome
0ppl_100act2_17349282023-08-26type 4NaNNaNNaNNaNNaNNaNNaNNaNNaNtype 760
1ppl_100act2_24340932022-09-27type 2NaNNaNNaNNaNNaNNaNNaNNaNNaNtype 10
2ppl_100act2_34040492022-09-27type 2NaNNaNNaNNaNNaNNaNNaNNaNNaNtype 10
\n", "
" ], "text/plain": [ " people_id activity_id date activity_category char_1 char_2 char_3 \\\n", "0 ppl_100 act2_1734928 2023-08-26 type 4 NaN NaN NaN \n", "1 ppl_100 act2_2434093 2022-09-27 type 2 NaN NaN NaN \n", "2 ppl_100 act2_3404049 2022-09-27 type 2 NaN NaN NaN \n", "\n", " char_4 char_5 char_6 char_7 char_8 char_9 char_10 outcome \n", "0 NaN NaN NaN NaN NaN NaN type 76 0 \n", "1 NaN NaN NaN NaN NaN NaN type 1 0 \n", "2 NaN NaN NaN NaN NaN NaN type 1 0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "actions = pd.read_csv('act_train.csv.zip')\n", "actions.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joining together to get dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
people_idactivity_iddate_actionactivity_categorychar_1_actionchar_2_actionchar_3_actionchar_4_actionchar_5_actionchar_6_action...char_29char_30char_31char_32char_33char_34char_35char_36char_37char_38
0ppl_100act2_17349282023-08-26type 4NaNNaNNaNNaNNaNNaN...FalseTrueTrueFalseFalseTrueTrueTrueFalse36
1ppl_100act2_24340932022-09-27type 2NaNNaNNaNNaNNaNNaN...FalseTrueTrueFalseFalseTrueTrueTrueFalse36
2ppl_100act2_34040492022-09-27type 2NaNNaNNaNNaNNaNNaN...FalseTrueTrueFalseFalseTrueTrueTrueFalse36
3ppl_100act2_36512152023-08-04type 2NaNNaNNaNNaNNaNNaN...FalseTrueTrueFalseFalseTrueTrueTrueFalse36
4ppl_100act2_41090172023-08-26type 2NaNNaNNaNNaNNaNNaN...FalseTrueTrueFalseFalseTrueTrueTrueFalse36
\n", "

5 rows × 55 columns

\n", "
" ], "text/plain": [ " people_id activity_id date_action activity_category char_1_action \\\n", "0 ppl_100 act2_1734928 2023-08-26 type 4 NaN \n", "1 ppl_100 act2_2434093 2022-09-27 type 2 NaN \n", "2 ppl_100 act2_3404049 2022-09-27 type 2 NaN \n", "3 ppl_100 act2_3651215 2023-08-04 type 2 NaN \n", "4 ppl_100 act2_4109017 2023-08-26 type 2 NaN \n", "\n", " char_2_action char_3_action char_4_action char_5_action char_6_action \\\n", "0 NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN \n", "\n", " ... char_29 char_30 char_31 char_32 char_33 char_34 char_35 char_36 \\\n", "0 ... False True True False False True True True \n", "1 ... False True True False False True True True \n", "2 ... False True True False False True True True \n", "3 ... False True True False False True True True \n", "4 ... False True True False False True True True \n", "\n", " char_37 char_38 \n", "0 False 36 \n", "1 False 36 \n", "2 False 36 \n", "3 False 36 \n", "4 False 36 \n", "\n", "[5 rows x 55 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "training_data_full = pd.merge(actions, people, how='inner', on='people_id', suffixes=['_action', '_person'], sort=False)\n", "training_data_full.head(5)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "((2197291, 15), (189118, 41), (2197291, 55))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(actions.shape, people.shape, training_data_full.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a preprocessing pipeline\n", "\n", "Notice the new `OmniEncoder` transformer and read more about its development in [my learning log](http://karlrosaen.com/ml/learning-log/2016-08-26/)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# %load \"preprocessing_transforms.py\"\n", "from sklearn.base import TransformerMixin, BaseEstimator\n", "import pandas as pd\n", "import heapq\n", "import numpy as np\n", "\n", "class BaseTransformer(BaseEstimator, TransformerMixin):\n", " def fit(self, X, y=None, **fit_params):\n", " return self\n", "\n", " def transform(self, X, **transform_params):\n", " return self\n", "\n", "\n", "class ColumnSelector(BaseTransformer):\n", " \"\"\"Selects columns from Pandas Dataframe\"\"\"\n", "\n", " def __init__(self, columns, c_type=None):\n", " self.columns = columns\n", " self.c_type = c_type\n", "\n", " def transform(self, X, **transform_params):\n", " cs = X[self.columns]\n", " if self.c_type is None:\n", " return cs\n", " else:\n", " return cs.astype(self.c_type)\n", "\n", "\n", "class OmniEncoder(BaseTransformer):\n", " \"\"\"\n", " Encodes a categorical variable using no more than k columns. As many values as possible\n", " are one-hot encoded, the remaining are fit within a binary encoded set of columns.\n", " If necessary some are dropped (e.g if (#unique_values) > 2^k).\n", "\n", " In deciding which values to one-hot encode, those that appear more frequently are\n", " preferred.\n", " \"\"\"\n", " def __init__(self, max_cols=20):\n", " self.column_infos = {}\n", " self.max_cols = max_cols\n", " if max_cols < 3 or max_cols > 100:\n", " raise ValueError(\"max_cols {} not within range(3, 100)\".format(max_cols))\n", "\n", " def fit(self, X, y=None, **fit_params):\n", " self.column_infos = {col: self._column_info(X[col], self.max_cols) for col in X.columns}\n", " return self\n", "\n", " def transform(self, X, **transform_params):\n", " return pd.concat(\n", " [self._encode_column(X[col], self.max_cols, *self.column_infos[col]) for col in X.columns],\n", " axis=1\n", " )\n", "\n", " @staticmethod\n", " def _encode_column(col, max_cols, one_hot_vals, binary_encoded_vals):\n", " num_one_hot = len(one_hot_vals)\n", " num_bits = max_cols - num_one_hot if len(binary_encoded_vals) > 0 else 0\n", "\n", " # http://stackoverflow.com/a/29091970/231589\n", " zero_base = ord('0')\n", " def i_to_bit_array(i):\n", " return np.fromstring(\n", " np.binary_repr(i, width=num_bits),\n", " 'u1'\n", " ) - zero_base\n", "\n", " binary_val_to_bit_array = {val: i_to_bit_array(idx + 1) for idx, val in enumerate(binary_encoded_vals)}\n", "\n", " bit_cols = [np.binary_repr(2 ** i, width=num_bits) for i in reversed(range(num_bits))]\n", "\n", " col_names = [\"{}_{}\".format(col.name, val) for val in one_hot_vals] + [\"{}_{}\".format(col.name, bit_col) for bit_col in bit_cols]\n", "\n", " zero_bits = np.zeros(num_bits, dtype=np.int)\n", "\n", " def splat(v):\n", " v_one_hot = [1 if v == ohv else 0 for ohv in one_hot_vals]\n", " v_bits = binary_val_to_bit_array.get(v, zero_bits)\n", "\n", " return pd.Series(np.concatenate([v_one_hot, v_bits]))\n", "\n", " df = col.apply(splat)\n", " df.columns = col_names\n", "\n", " return df\n", "\n", " @staticmethod\n", " def _column_info(col, max_cols):\n", " \"\"\"\n", "\n", " :param col: pd.Series\n", " :return: {'val': 44, 'val2': 4, ...}\n", " \"\"\"\n", " val_counts = dict(col.value_counts())\n", " num_one_hot = OmniEncoder._num_onehot(len(val_counts), max_cols)\n", " return OmniEncoder._partition_one_hot(val_counts, num_one_hot)\n", "\n", " @staticmethod\n", " def _partition_one_hot(val_counts, num_one_hot):\n", " \"\"\"\n", " Paritions the values in val counts into a list of values that should be\n", " one-hot encoded and a list of values that should be binary encoded.\n", "\n", " The `num_one_hot` most popular values are chosen to be one-hot encoded.\n", "\n", " :param val_counts: {'val': 433}\n", " :param num_one_hot: the number of elements to be one-hot encoded\n", " :return: ['val1', 'val2'], ['val55', 'val59']\n", " \"\"\"\n", " one_hot_vals = [k for (k, count) in heapq.nlargest(num_one_hot, val_counts.items(), key=lambda t: t[1])]\n", " one_hot_vals_lookup = set(one_hot_vals)\n", "\n", " bin_encoded_vals = [val for val in val_counts if val not in one_hot_vals_lookup]\n", "\n", " return sorted(one_hot_vals), sorted(bin_encoded_vals)\n", "\n", "\n", " @staticmethod\n", " def _num_onehot(n, k):\n", " \"\"\"\n", " Determines the number of onehot columns we can have to encode n values\n", " in no more than k columns, assuming we will binary encode the rest.\n", "\n", " :param n: The number of unique values to encode\n", " :param k: The maximum number of columns we have\n", " :return: The number of one-hot columns to use\n", " \"\"\"\n", " num_one_hot = min(n, k)\n", "\n", " def num_bin_vals(num):\n", " if num == 0:\n", " return 0\n", " return 2 ** num - 1\n", "\n", " def capacity(oh):\n", " \"\"\"\n", " Capacity given we are using `oh` one hot columns.\n", " \"\"\"\n", " return oh + num_bin_vals(k - oh)\n", "\n", " while capacity(num_one_hot) < n and num_one_hot > 0:\n", " num_one_hot -= 1\n", "\n", " return num_one_hot\n", "\n", "\n", "class EncodeCategorical(BaseTransformer):\n", " def __init__(self):\n", " self.categorical_vals = {}\n", "\n", " def fit(self, X, y=None, **fit_params):\n", " self.categorical_vals = {col: {label: idx + 1 for idx, label in enumerate(sorted(X[col].dropna().unique()))} for\n", " col in X.columns}\n", " return self\n", "\n", " def transform(self, X, **transform_params):\n", " return pd.concat(\n", " [X[col].map(self.categorical_vals[col]) for col in X.columns],\n", " axis=1\n", " )\n", "\n", "\n", "class SpreadBinary(BaseTransformer):\n", "\n", " def transform(self, X, **transform_params):\n", " return X.applymap(lambda x: 1 if x == 1 else -1)\n", "\n", "\n", "class DfTransformerAdapter(BaseTransformer):\n", " \"\"\"Adapts a scikit-learn Transformer to return a pandas DataFrame\"\"\"\n", "\n", " def __init__(self, transformer):\n", " self.transformer = transformer\n", "\n", " def fit(self, X, y=None, **fit_params):\n", " self.transformer.fit(X, y=y, **fit_params)\n", " return self\n", "\n", " def transform(self, X, **transform_params):\n", " raw_result = self.transformer.transform(X, **transform_params)\n", " return pd.DataFrame(raw_result, columns=X.columns, index=X.index)\n", "\n", "\n", "class DfOneHot(BaseTransformer):\n", " \"\"\"\n", " Wraps helper method `get_dummies` making sure all columns get one-hot encoded.\n", " \"\"\"\n", " def __init__(self):\n", " self.dummy_columns = []\n", "\n", " def fit(self, X, y=None, **fit_params):\n", " self.dummy_columns = pd.get_dummies(\n", " X,\n", " prefix=[c for c in X.columns],\n", " columns=X.columns).columns\n", " return self\n", "\n", " def transform(self, X, **transform_params):\n", " return pd.get_dummies(\n", " X,\n", " prefix=[c for c in X.columns],\n", " columns=X.columns).reindex(columns=self.dummy_columns, fill_value=0)\n", "\n", "\n", "class DfFeatureUnion(BaseTransformer):\n", " \"\"\"A dataframe friendly implementation of `FeatureUnion`\"\"\"\n", "\n", " def __init__(self, transformers):\n", " self.transformers = transformers\n", "\n", " def fit(self, X, y=None, **fit_params):\n", " for l, t in self.transformers:\n", " t.fit(X, y=y, **fit_params)\n", " return self\n", "\n", " def transform(self, X, **transform_params):\n", " transform_results = [t.transform(X, **transform_params) for l, t in self.transformers]\n", " return pd.concat(transform_results, axis=1)\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "in people_id there are 151295 unique values\n", "in activity_id there are 2197291 unique values\n", "in date_action there are 411 unique values\n", "in activity_category there are 7 unique values\n", "in char_1_action there are 52 unique values\n", "in char_2_action there are 33 unique values\n", "in char_3_action there are 12 unique values\n", "in char_4_action there are 8 unique values\n", "in char_5_action there are 8 unique values\n", "in char_6_action there are 6 unique values\n", "in char_7_action there are 9 unique values\n", "in char_8_action there are 19 unique values\n", "in char_9_action there are 20 unique values\n", "in char_10_action there are 6516 unique values\n", "in outcome there are 2 unique values\n", "in char_1_person there are 2 unique values\n", "in group_1 there are 29899 unique values\n", "in char_2_person there are 3 unique values\n", "in date_person there are 1196 unique values\n", "in char_3_person there are 43 unique values\n", "in char_4_person there are 25 unique values\n", "in char_5_person there are 9 unique values\n", "in char_6_person there are 7 unique values\n", "in char_7_person there are 25 unique values\n", "in char_8_person there are 8 unique values\n", "in char_9_person there are 9 unique values\n", "in char_10_person there are 2 unique values\n", "in char_11 there are 2 unique values\n", "in char_12 there are 2 unique values\n", "in char_13 there are 2 unique values\n", "in char_14 there are 2 unique values\n", "in char_15 there are 2 unique values\n", "in char_16 there are 2 unique values\n", "in char_17 there are 2 unique values\n", "in char_18 there are 2 unique values\n", "in char_19 there are 2 unique values\n", "in char_20 there are 2 unique values\n", "in char_21 there are 2 unique values\n", "in char_22 there are 2 unique values\n", "in char_23 there are 2 unique values\n", "in char_24 there are 2 unique values\n", "in char_25 there are 2 unique values\n", "in char_26 there are 2 unique values\n", "in char_27 there are 2 unique values\n", "in char_28 there are 2 unique values\n", "in char_29 there are 2 unique values\n", "in char_30 there are 2 unique values\n", "in char_31 there are 2 unique values\n", "in char_32 there are 2 unique values\n", "in char_33 there are 2 unique values\n", "in char_34 there are 2 unique values\n", "in char_35 there are 2 unique values\n", "in char_36 there are 2 unique values\n", "in char_37 there are 2 unique values\n", "in char_38 there are 101 unique values\n" ] } ], "source": [ "for col in training_data_full.columns:\n", " print(\"in {} there are {} unique values\".format(col, len(training_data_full[col].unique())))\n", "None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Potential trouble with high dimensionality\n", "\n", "Notice that char_10_action, group_1 and others have a ton of unique values; one-hot encoding will result in a dataframe with thousands of columns. \n", "\n", "Let's explore 3 approaches to dealing with categorical columns with a lot of unique values and compare performance:\n", "\n", "- ignore them\n", "- encode them ordinally, mapping every unique value to a different integer (assuming some ordered value that probably doesn't exist, at least not by our default lexicographical sorting)\n", "- encode them with a combo of one-hot and binary\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "\n", "from sklearn.preprocessing import Imputer, StandardScaler\n", "\n", "cat_columns = ['activity_category',\n", " 'char_1_action', 'char_2_action', 'char_3_action', 'char_4_action',\n", " 'char_5_action', 'char_6_action', 'char_7_action', 'char_8_action',\n", " 'char_9_action', 'char_1_person',\n", " 'char_2_person', 'char_3_person',\n", " 'char_4_person', 'char_5_person', 'char_6_person', 'char_7_person',\n", " 'char_8_person', 'char_9_person', 'char_10_person', 'char_11',\n", " 'char_12', 'char_13', 'char_14', 'char_15', 'char_16', 'char_17',\n", " 'char_18', 'char_19', 'char_20', 'char_21', 'char_22', 'char_23',\n", " 'char_24', 'char_25', 'char_26', 'char_27', 'char_28', 'char_29',\n", " 'char_30', 'char_31', 'char_32', 'char_33', 'char_34', 'char_35',\n", " 'char_36', 'char_37']\n", "\n", "high_dim_cat_columns = ['date_action', 'char_10_action', 'group_1', 'date_person']\n", "\n", "q_columns = ['char_38']\n", "\n", "preprocessor_ignore = Pipeline([\n", " ('features', DfFeatureUnion([\n", " ('quantitative', Pipeline([\n", " ('select-quantitative', ColumnSelector(q_columns, c_type='float')),\n", " ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),\n", " ('scale', DfTransformerAdapter(StandardScaler()))\n", " ])),\n", " ('categorical', Pipeline([\n", " ('select-categorical', ColumnSelector(cat_columns)),\n", " ('apply-onehot', DfOneHot()),\n", " ('spread-binary', SpreadBinary())\n", " ])),\n", " ]))\n", "])\n", "\n", "preprocessor_lexico = Pipeline([\n", " ('features', DfFeatureUnion([\n", " ('quantitative', Pipeline([\n", " ('combine-q', DfFeatureUnion([\n", " ('highd', Pipeline([\n", " ('select-highd', ColumnSelector(high_dim_cat_columns)),\n", " ('encode-highd', EncodeCategorical()) \n", " ])),\n", " ('select-quantitative', ColumnSelector(q_columns, c_type='float')),\n", " ])),\n", " ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),\n", " ('scale', DfTransformerAdapter(StandardScaler()))\n", " ])),\n", " ('categorical', Pipeline([\n", " ('select-categorical', ColumnSelector(cat_columns)),\n", " ('apply-onehot', DfOneHot()),\n", " ('spread-binary', SpreadBinary())\n", " ])),\n", " ]))\n", "])\n", "\n", "preprocessor_omni_20 = Pipeline([\n", " ('features', DfFeatureUnion([\n", " ('quantitative', Pipeline([\n", " ('select-quantitative', ColumnSelector(q_columns, c_type='float')),\n", " ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),\n", " ('scale', DfTransformerAdapter(StandardScaler()))\n", " ])),\n", " ('categorical', Pipeline([\n", " ('select-categorical', ColumnSelector(cat_columns + high_dim_cat_columns)),\n", " ('apply-onehot', OmniEncoder(max_cols=20)),\n", " ('spread-binary', SpreadBinary())\n", " ])),\n", " ]))\n", "])\n", "\n", "preprocessor_omni_50 = Pipeline([\n", " ('features', DfFeatureUnion([\n", " ('quantitative', Pipeline([\n", " ('select-quantitative', ColumnSelector(q_columns, c_type='float')),\n", " ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),\n", " ('scale', DfTransformerAdapter(StandardScaler()))\n", " ])),\n", " ('categorical', Pipeline([\n", " ('select-categorical', ColumnSelector(cat_columns + high_dim_cat_columns)),\n", " ('apply-onehot', OmniEncoder(max_cols=50)),\n", " ('spread-binary', SpreadBinary())\n", " ])),\n", " ]))\n", "])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling to reduce runtime in training large dataset\n", "\n", "If we train models based on the entire test dataset provided it exhausts the memory on my laptop. Again, in the spirit of getting something quick and dirty working, we'll sample the dataset and train on that. We'll then evaluate our model by testing the accuracy on a larger sample." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "training_frac = 0.01\n", "test_frac = 0.05\n", "\n", "training_data, the_rest = train_test_split(training_data_full, train_size=training_frac, random_state=0)\n", "test_data = the_rest.sample(frac=test_frac / (1-training_frac))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(21972, 55)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "training_data.shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(109865, 55)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reporting utilities\n", "\n", "Some utilities to make reporting progress easier" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import time\n", "import subprocess\n", "\n", "class time_and_log():\n", " \n", " def __init__(self, label, *, prefix='', say=False):\n", " self.label = label\n", " self.prefix = prefix\n", " self.say = say\n", " \n", " def __enter__(self):\n", " msg = 'Starting {}'.format(self.label)\n", " print('{}{}'.format(self.prefix, msg))\n", " if self.say:\n", " cmd_say(msg)\n", " self.start = time.process_time()\n", " return self\n", "\n", " def __exit__(self, *exc):\n", " self.interval = time.process_time() - self.start\n", " msg = 'Finished {} in {:.2f} seconds'.format(self.label, self.interval)\n", " print('{}{}'.format(self.prefix, msg))\n", " if self.say:\n", " cmd_say(msg)\n", " return False\n", " \n", "def cmd_say(msg):\n", " subprocess.call(\"say '{}'\".format(msg), shell=True)\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " _Starting wrangling training data\n", " _Finished wrangling training data in 383.88 seconds\n" ] } ], "source": [ "with time_and_log('wrangling training data', say=True, prefix=\" _\"):\n", " wrangled = preprocessor_omni_20.fit_transform(training_data)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
char_38activity_category_type 1activity_category_type 2activity_category_type 3activity_category_type 4activity_category_type 5activity_category_type 6activity_category_type 7char_1_action_type 1char_1_action_type 10...date_person_01000000000date_person_00100000000date_person_00010000000date_person_00001000000date_person_00000100000date_person_00000010000date_person_00000001000date_person_00000000100date_person_00000000010date_person_00000000001
1119692-0.413876-1-1-1-11-1-1-1-1...-1-1-1-1-1-1-1-1-1-1
3311260.332410-11-1-1-1-1-1-1-1...1-1-11-1-1111-1
424011-0.192754-1-1-1-11-1-1-1-1...111-1-1-1-1-111
3417960.000727-11-1-1-1-1-1-1-1...-1-1-1-1-1-1-1-1-1-1
226920.885214-1-11-1-1-1-1-1-1...11-1-1-1111-1-1
\n", "

5 rows × 354 columns

\n", "
" ], "text/plain": [ " char_38 activity_category_type 1 activity_category_type 2 \\\n", "1119692 -0.413876 -1 -1 \n", "331126 0.332410 -1 1 \n", "424011 -0.192754 -1 -1 \n", "341796 0.000727 -1 1 \n", "22692 0.885214 -1 -1 \n", "\n", " activity_category_type 3 activity_category_type 4 \\\n", "1119692 -1 -1 \n", "331126 -1 -1 \n", "424011 -1 -1 \n", "341796 -1 -1 \n", "22692 1 -1 \n", "\n", " activity_category_type 5 activity_category_type 6 \\\n", "1119692 1 -1 \n", "331126 -1 -1 \n", "424011 1 -1 \n", "341796 -1 -1 \n", "22692 -1 -1 \n", "\n", " activity_category_type 7 char_1_action_type 1 \\\n", "1119692 -1 -1 \n", "331126 -1 -1 \n", "424011 -1 -1 \n", "341796 -1 -1 \n", "22692 -1 -1 \n", "\n", " char_1_action_type 10 ... \\\n", "1119692 -1 ... \n", "331126 -1 ... \n", "424011 -1 ... \n", "341796 -1 ... \n", "22692 -1 ... \n", "\n", " date_person_01000000000 date_person_00100000000 \\\n", "1119692 -1 -1 \n", "331126 1 -1 \n", "424011 1 1 \n", "341796 -1 -1 \n", "22692 1 1 \n", "\n", " date_person_00010000000 date_person_00001000000 \\\n", "1119692 -1 -1 \n", "331126 -1 1 \n", "424011 1 -1 \n", "341796 -1 -1 \n", "22692 -1 -1 \n", "\n", " date_person_00000100000 date_person_00000010000 \\\n", "1119692 -1 -1 \n", "331126 -1 -1 \n", "424011 -1 -1 \n", "341796 -1 -1 \n", "22692 -1 1 \n", "\n", " date_person_00000001000 date_person_00000000100 \\\n", "1119692 -1 -1 \n", "331126 1 1 \n", "424011 -1 -1 \n", "341796 -1 -1 \n", "22692 1 1 \n", "\n", " date_person_00000000010 date_person_00000000001 \n", "1119692 -1 -1 \n", "331126 1 -1 \n", "424011 1 1 \n", "341796 -1 -1 \n", "22692 -1 -1 \n", "\n", "[5 rows x 354 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wrangled.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Putting together classifiers" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "pipe_rf_ignore = Pipeline([\n", " ('wrangle', preprocessor_ignore),\n", " ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))\n", " ])\n", "\n", "pipe_rf_lexico = Pipeline([\n", " ('wrangle', preprocessor_lexico),\n", " ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))\n", " ])\n", "\n", "pipe_rf_omni_20 = Pipeline([\n", " ('wrangle', preprocessor_omni_20),\n", " ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))\n", " ])\n", "\n", "pipe_rf_omni_50 = Pipeline([\n", " ('wrangle', preprocessor_omni_50),\n", " ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))\n", " ])" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "feature_columns = cat_columns + q_columns + high_dim_cat_columns" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def extract_X_y(df):\n", " return df[feature_columns], df['outcome']\n", "\n", "X_train, y_train = extract_X_y(training_data)\n", "X_test, y_test = extract_X_y(test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross validation and full test set accuracy\n", "\n", "We'll cross validate within the training set, and then train on the full training set and see how well it performs on the full test set." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Evaluating random forest ignore\n", " _Starting fitting full training set\n", " _Finished fitting full training set in 3.86 seconds\n", " _Starting evaluating on full test set\n", " Full test accuracy (0.05 of dataset): 0.880\n", " _Finished evaluating on full test set in 16.32 seconds\n", "Evaluating random forest ordinal\n", " _Starting fitting full training set\n", " _Finished fitting full training set in 4.26 seconds\n", " _Starting evaluating on full test set\n", " Full test accuracy (0.05 of dataset): 0.885\n", " _Finished evaluating on full test set in 16.10 seconds\n", "Evaluating random forest omni 20\n", " _Starting fitting full training set\n", " _Finished fitting full training set in 376.31 seconds\n", " _Starting evaluating on full test set\n", " Full test accuracy (0.05 of dataset): 0.885\n", " _Finished evaluating on full test set in 1050.23 seconds\n", "Evaluating random forest omni 50\n", " _Starting fitting full training set\n", " _Finished fitting full training set in 417.19 seconds\n", " _Starting evaluating on full test set\n", " Full test accuracy (0.05 of dataset): 0.886\n", " _Finished evaluating on full test set in 1102.41 seconds\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "from sklearn.cross_validation import cross_val_score\n", "import numpy as np\n", "\n", "models = [\n", " ('random forest ignore', pipe_rf_ignore), \n", " ('random forest ordinal', pipe_rf_lexico), \n", " ('random forest omni 20', pipe_rf_omni_20), \n", " ('random forest omni 50', pipe_rf_omni_50), \n", "]\n", "\n", "for label, model in models:\n", " print('Evaluating {}'.format(label))\n", " cmd_say('Evaluating {}'.format(label))\n", "# with time_and_log('cross validating', say=True, prefix=\" _\"):\n", "# scores = cross_val_score(estimator=model,\n", "# X=X_train,\n", "# y=y_train,\n", "# cv=5,\n", "# n_jobs=1)\n", "# print(' CV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores), np.std(scores)))\n", " with time_and_log('fitting full training set', say=True, prefix=\" _\"):\n", " model.fit(X_train, y_train) \n", " with time_and_log('evaluating on full test set', say=True, prefix=\" _\"):\n", " print(\" Full test accuracy ({:.2f} of dataset): {:.3f}\".format(\n", " test_frac, \n", " accuracy_score(y_test, model.predict(X_test)))) " ] } ], "metadata": { "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }