{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Kaggle's Predicting Red Hat Business Value\n", "\n", "This is a follow up attempt at Kaggle's [Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value) competition.\n", "\n", "See [my notebooks section](http://karlrosaen.com/ml/notebooks) for links to the first attempt and other kaggle competitions.\n", "\n", "The focus of this iteration is exploring whether we can bring back the previously ignored categorical columns that have hundreds if not thousands of unique values, making it impractical to use one-hot encoding. \n", "\n", "Two approaches are taken on categorical variables with a large amount of unique values:\n", "\n", "- encoding the values ordinally; sorting the values lexicographically and assigning a sequence of numbers, and then treating them quantitatively from there\n", "- encoding the most frequently occuring values using one-hot and then binary encoding the rest. As part of this I developed a new scikit-learn transformer\n", "\n", "The end results: reincluding the columns boosted performance on the training set by only 0.5%, and surprisingly the binary / one-hot combo did hardly any better than the ordinal encoding.\n", "\n", "### Loading in the data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | people_id | \n", "char_1 | \n", "group_1 | \n", "char_2 | \n", "date | \n", "char_3 | \n", "char_4 | \n", "char_5 | \n", "char_6 | \n", "char_7 | \n", "... | \n", "char_29 | \n", "char_30 | \n", "char_31 | \n", "char_32 | \n", "char_33 | \n", "char_34 | \n", "char_35 | \n", "char_36 | \n", "char_37 | \n", "char_38 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "ppl_100 | \n", "type 2 | \n", "group 17304 | \n", "type 2 | \n", "2021-06-29 | \n", "type 5 | \n", "type 5 | \n", "type 5 | \n", "type 3 | \n", "type 11 | \n", "... | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "False | \n", "36 | \n", "
1 | \n", "ppl_100002 | \n", "type 2 | \n", "group 8688 | \n", "type 3 | \n", "2021-01-06 | \n", "type 28 | \n", "type 9 | \n", "type 5 | \n", "type 3 | \n", "type 11 | \n", "... | \n", "False | \n", "True | \n", "True | \n", "True | \n", "True | \n", "True | \n", "True | \n", "True | \n", "False | \n", "76 | \n", "
2 | \n", "ppl_100003 | \n", "type 2 | \n", "group 33592 | \n", "type 3 | \n", "2022-06-10 | \n", "type 4 | \n", "type 8 | \n", "type 5 | \n", "type 2 | \n", "type 5 | \n", "... | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "True | \n", "False | \n", "True | \n", "True | \n", "99 | \n", "
3 rows × 41 columns
\n", "\n", " | people_id | \n", "activity_id | \n", "date | \n", "activity_category | \n", "char_1 | \n", "char_2 | \n", "char_3 | \n", "char_4 | \n", "char_5 | \n", "char_6 | \n", "char_7 | \n", "char_8 | \n", "char_9 | \n", "char_10 | \n", "outcome | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "ppl_100 | \n", "act2_1734928 | \n", "2023-08-26 | \n", "type 4 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "type 76 | \n", "0 | \n", "
1 | \n", "ppl_100 | \n", "act2_2434093 | \n", "2022-09-27 | \n", "type 2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "type 1 | \n", "0 | \n", "
2 | \n", "ppl_100 | \n", "act2_3404049 | \n", "2022-09-27 | \n", "type 2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "type 1 | \n", "0 | \n", "
\n", " | people_id | \n", "activity_id | \n", "date_action | \n", "activity_category | \n", "char_1_action | \n", "char_2_action | \n", "char_3_action | \n", "char_4_action | \n", "char_5_action | \n", "char_6_action | \n", "... | \n", "char_29 | \n", "char_30 | \n", "char_31 | \n", "char_32 | \n", "char_33 | \n", "char_34 | \n", "char_35 | \n", "char_36 | \n", "char_37 | \n", "char_38 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "ppl_100 | \n", "act2_1734928 | \n", "2023-08-26 | \n", "type 4 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "False | \n", "36 | \n", "
1 | \n", "ppl_100 | \n", "act2_2434093 | \n", "2022-09-27 | \n", "type 2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "False | \n", "36 | \n", "
2 | \n", "ppl_100 | \n", "act2_3404049 | \n", "2022-09-27 | \n", "type 2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "False | \n", "36 | \n", "
3 | \n", "ppl_100 | \n", "act2_3651215 | \n", "2023-08-04 | \n", "type 2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "False | \n", "36 | \n", "
4 | \n", "ppl_100 | \n", "act2_4109017 | \n", "2023-08-26 | \n", "type 2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "True | \n", "True | \n", "False | \n", "36 | \n", "
5 rows × 55 columns
\n", "\n", " | char_38 | \n", "activity_category_type 1 | \n", "activity_category_type 2 | \n", "activity_category_type 3 | \n", "activity_category_type 4 | \n", "activity_category_type 5 | \n", "activity_category_type 6 | \n", "activity_category_type 7 | \n", "char_1_action_type 1 | \n", "char_1_action_type 10 | \n", "... | \n", "date_person_01000000000 | \n", "date_person_00100000000 | \n", "date_person_00010000000 | \n", "date_person_00001000000 | \n", "date_person_00000100000 | \n", "date_person_00000010000 | \n", "date_person_00000001000 | \n", "date_person_00000000100 | \n", "date_person_00000000010 | \n", "date_person_00000000001 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1119692 | \n", "-0.413876 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
331126 | \n", "0.332410 | \n", "-1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "... | \n", "1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "1 | \n", "1 | \n", "-1 | \n", "
424011 | \n", "-0.192754 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "... | \n", "1 | \n", "1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "1 | \n", "
341796 | \n", "0.000727 | \n", "-1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
22692 | \n", "0.885214 | \n", "-1 | \n", "-1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "... | \n", "1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "1 | \n", "1 | \n", "-1 | \n", "-1 | \n", "
5 rows × 354 columns
\n", "