{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Handling Categorical Data in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What does Nominal, Ordinal and Continuous features mean?\n", "\n", "Categorical features can only take on a limited, and usually fixed, number of possible values. For example, if a dataset is about information related to users, then you will typically find features like country, gender, age group, etc. Alternatively, if the data you're working with is related to products, you will find features like product type, manufacturer, seller and so on.\n", "\n", "These are all categorical features in your dataset. These features are typically stored as text values which represent various traits of the observations. For example, gender is described as Male (M) or Female (F), product type could be described as electronics, apparels, food etc.\n", "\n", "**Note that these type of features where the categories are only labeled without any order of precedence are called nominal features. Features which have some order associated with them are called ordinal features.** \n", "\n", "For example, a feature like economic status, with three categories: low, medium and high, which have an order associated with them.\n", "\n", "There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or a date/time.\n", "\n", "Regardless of what the value is used for, the challenge is determining how to use this data in the analysis because of the following constraints:\n", "\n", "- Categorical features may have a very large number of levels, known as high cardinality, (for example, cities or URLs), where most of the levels appear in a relatively small number of instances.\n", "\n", "- Many machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them.\n", "\n", "- While some ML packages or libraries might transform categorical data to numeric automatically based on some default embedding method, many other ML packages don’t support such inputs.\n", "\n", "- For the machine, categorical data doesn’t contain the same context or information that humans can easily associate and understand. For example, when looking at a feature called ``City`` with three cities ``New York``, ``New Jersey`` and ``New Delhi``, humans can infer that ``New York`` is closely related to ``New Jersey`` as they are from same country, while ``New York`` and ``New Delhi`` are much different. But for the model, ``New York``, ``New Jersey`` and ``New Delhi``, are just three different levels (possible values) of the same feature ``City``. If you don’t specify the additional contextual information, it will be impossible for the model to differentiate between highly different levels.\n", "\n", "You therefore are faced with the challenge of figuring out how to turn these text values into numerical values for further processing and unmask lots of interesting information which these features might hide. Typically, any standard work-flow in feature engineering involves some form of transformation of these categorical values into numeric labels and then applying some encoding scheme on these values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encoding Categorical Data\n", "Existing Encoding Methods (all modules for which code is available, see: http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html)\n", "- category_encoders.backward_difference\n", "- category_encoders.basen\n", "- category_encoders.binary\n", "- category_encoders.hashing\n", "- category_encoders.helmert\n", "- category_encoders.leave_one_out\n", "- category_encoders.one_hot\n", "- category_encoders.ordinal\n", "- category_encoders.polynomial\n", "- category_encoders.sum_coding\n", "- category_encoders.target_encoder\n", "\n", "\n", "The techniques that you'll cover are the following:\n", "\n", "* Replacing values\n", "* Encoding labels\n", "* One-Hot encoding\n", "* Binary encoding\n", "* Backward difference encoding\n", "* Polynomial encodings\n", "* Miscellaneous features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## 1. Replace Values" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import copy\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>year</th>\n", " <th>month</th>\n", " <th>day</th>\n", " <th>dep_time</th>\n", " <th>dep_delay</th>\n", " <th>arr_time</th>\n", " <th>arr_delay</th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>flight</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " <th>air_time</th>\n", " <th>distance</th>\n", " <th>hour</th>\n", " <th>minute</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>2014</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>1.0</td>\n", " <td>96.0</td>\n", " <td>235.0</td>\n", " <td>70.0</td>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>145</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " <td>194.0</td>\n", " <td>1542</td>\n", " <td>0.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2014</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>4.0</td>\n", " <td>-6.0</td>\n", " <td>738.0</td>\n", " <td>-23.0</td>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>1830</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " <td>252.0</td>\n", " <td>2279</td>\n", " <td>0.0</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>2014</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>8.0</td>\n", " <td>13.0</td>\n", " <td>548.0</td>\n", " <td>-4.0</td>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>1609</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " <td>201.0</td>\n", " <td>1825</td>\n", " <td>0.0</td>\n", " <td>8.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>2014</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>28.0</td>\n", " <td>-2.0</td>\n", " <td>800.0</td>\n", " <td>-23.0</td>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>466</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " <td>251.0</td>\n", " <td>2282</td>\n", " <td>0.0</td>\n", " <td>28.0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>2014</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>34.0</td>\n", " <td>44.0</td>\n", " <td>325.0</td>\n", " <td>43.0</td>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>121</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " <td>201.0</td>\n", " <td>1448</td>\n", " <td>0.0</td>\n", " <td>34.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " year month day dep_time dep_delay arr_time arr_delay carrier tailnum \\\n", "0 2014 1 1 1.0 96.0 235.0 70.0 AS N508AS \n", "1 2014 1 1 4.0 -6.0 738.0 -23.0 US N195UW \n", "2 2014 1 1 8.0 13.0 548.0 -4.0 UA N37422 \n", "3 2014 1 1 28.0 -2.0 800.0 -23.0 US N547UW \n", "4 2014 1 1 34.0 44.0 325.0 43.0 AS N762AS \n", "\n", " flight origin dest air_time distance hour minute \n", "0 145 PDX ANC 194.0 1542 0.0 1.0 \n", "1 1830 SEA CLT 252.0 2279 0.0 4.0 \n", "2 1609 PDX IAH 201.0 1825 0.0 8.0 \n", "3 466 PDX CLT 251.0 2282 0.0 28.0 \n", "4 121 SEA ANC 201.0 1448 0.0 34.0 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_flights = pd.read_csv('https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv')\n", "\n", "df_flights.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest\n", "0 AS N508AS PDX ANC\n", "1 US N195UW SEA CLT\n", "2 UA N37422 PDX IAH\n", "3 US N547UW PDX CLT\n", "4 AS N762AS SEA ANC" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights = df_flights.select_dtypes(include=['object']).copy()\n", "cat_df_flights.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "248\n" ] } ], "source": [ "print(cat_df_flights.isnull().values.sum())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n" ] } ], "source": [ "cat_df_flights = cat_df_flights.fillna(cat_df_flights['tailnum'].value_counts().index[0])\n", "print(cat_df_flights.isnull().values.sum())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "replace_map = {'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7 , 'UA': 8 , 'US': 9,'VX': 10,'WN': 11}}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'carrier': {'AA': 1, 'AS': 2, 'B6': 3, 'DL': 4, 'F9': 5, 'HA': 6, 'OO': 7, 'UA': 8, 'US': 9, 'VX': 10, 'WN': 11}}\n" ] } ], "source": [ "labels = cat_df_flights['carrier'].astype('category').cat.categories.tolist()\n", "replace_map_comp = {'carrier' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}\n", "\n", "print(replace_map_comp)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "cat_df_flights_replace = cat_df_flights.copy()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " carrier tailnum origin dest\n", "0 2 N508AS PDX ANC\n", "1 9 N195UW SEA CLT\n", "2 8 N37422 PDX IAH\n", "3 9 N547UW PDX CLT\n", "4 2 N762AS SEA ANC\n" ] } ], "source": [ "cat_df_flights_replace.replace(replace_map_comp, inplace=True)\n", "\n", "print(cat_df_flights_replace.head())" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "int64\n" ] } ], "source": [ "print(cat_df_flights_replace['carrier'].dtypes)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "carrier category\n", "tailnum object\n", "origin category\n", "dest object\n", "dtype: object\n" ] } ], "source": [ "cat_df_flights_lc = cat_df_flights.copy()\n", "cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].astype('category')\n", "cat_df_flights_lc['origin'] = cat_df_flights_lc['origin'].astype('category') \n", "\n", "print(cat_df_flights_lc.dtypes)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "31.3 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "import time\n", "%timeit cat_df_flights.groupby(['origin','carrier']).count() #DataFrame with object dtype columns" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "21.6 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "source": [ "%timeit cat_df_flights_lc.groupby(['origin','carrier']).count() #DataFrame with category dtype columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## 2. Label Encoding" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest\n", "0 AS N508AS PDX ANC\n", "1 US N195UW SEA CLT\n", "2 UA N37422 PDX IAH\n", "3 US N547UW PDX CLT\n", "4 AS N762AS SEA ANC" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_lc.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "cat_df_flights_lc['carrier'] = cat_df_flights_lc['carrier'].cat.codes" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>8</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>7</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>8</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest\n", "0 1 N508AS PDX ANC\n", "1 8 N195UW SEA CLT\n", "2 7 N37422 PDX IAH\n", "3 8 N547UW PDX CLT\n", "4 1 N762AS SEA ANC" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_lc.head() #alphabetically labeled from 0 to 10" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " <th>US_code</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest US_code\n", "0 AS N508AS PDX ANC 0\n", "1 US N195UW SEA CLT 1\n", "2 UA N37422 PDX IAH 0\n", "3 US N547UW PDX CLT 1\n", "4 AS N762AS SEA ANC 0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_specific = cat_df_flights.copy()\n", "cat_df_flights_specific['US_code'] = np.where(cat_df_flights_specific['carrier'].str.contains('US'), 1, 0)\n", "\n", "cat_df_flights_specific.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " <th>carrier_code</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " <td>7</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest carrier_code\n", "0 AS N508AS PDX ANC 1\n", "1 US N195UW SEA CLT 8\n", "2 UA N37422 PDX IAH 7\n", "3 US N547UW PDX CLT 8\n", "4 AS N762AS SEA ANC 1" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_sklearn = cat_df_flights.copy()\n", "\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "lb_make = LabelEncoder()\n", "cat_df_flights_sklearn['carrier_code'] = lb_make.fit_transform(cat_df_flights['carrier'])\n", "\n", "cat_df_flights_sklearn.head() #Results in appending a new column to df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Label encoding is pretty much intuitive and straight-forward and may give you a good performance from your learning algorithm, but it has as disadvantage that the numerical values can be misinterpreted by the algorithm. Should the carrier US (encoded to 8) be given 8x more weight than the carrier AS (encoded to 1) ?**\n", "\n", "**To solve this issue there is another popular way to encode the categories via something called one-hot encoding.**\n", "\n", "> ## 3. One-Hot encoding" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest\n", "0 AS N508AS PDX ANC\n", "1 US N195UW SEA CLT\n", "2 UA N37422 PDX IAH\n", "3 US N547UW PDX CLT\n", "4 AS N762AS SEA ANC" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights.head()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " <th>carrier_AA</th>\n", " <th>carrier_AS</th>\n", " <th>carrier_B6</th>\n", " <th>carrier_DL</th>\n", " <th>carrier_F9</th>\n", " <th>carrier_HA</th>\n", " <th>carrier_OO</th>\n", " <th>carrier_UA</th>\n", " <th>carrier_US</th>\n", " <th>carrier_VX</th>\n", " <th>carrier_WN</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tailnum origin dest carrier_AA carrier_AS carrier_B6 carrier_DL \\\n", "0 N508AS PDX ANC 0 1 0 0 \n", "1 N195UW SEA CLT 0 0 0 0 \n", "2 N37422 PDX IAH 0 0 0 0 \n", "3 N547UW PDX CLT 0 0 0 0 \n", "4 N762AS SEA ANC 0 1 0 0 \n", "\n", " carrier_F9 carrier_HA carrier_OO carrier_UA carrier_US carrier_VX \\\n", "0 0 0 0 0 0 0 \n", "1 0 0 0 0 1 0 \n", "2 0 0 0 1 0 0 \n", "3 0 0 0 0 1 0 \n", "4 0 0 0 0 0 0 \n", "\n", " carrier_WN \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_onehot = cat_df_flights.copy()\n", "cat_df_flights_onehot = pd.get_dummies(cat_df_flights_onehot, columns=['carrier'], prefix = ['carrier'])\n", "\n", "cat_df_flights_onehot.head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>AA</th>\n", " <th>AS</th>\n", " <th>B6</th>\n", " <th>DL</th>\n", " <th>F9</th>\n", " <th>HA</th>\n", " <th>OO</th>\n", " <th>UA</th>\n", " <th>US</th>\n", " <th>VX</th>\n", " <th>WN</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " AA AS B6 DL F9 HA OO UA US VX WN\n", "0 0 1 0 0 0 0 0 0 0 0 0\n", "1 0 0 0 0 0 0 0 0 1 0 0\n", "2 0 0 0 0 0 0 0 1 0 0 0\n", "3 0 0 0 0 0 0 0 0 1 0 0\n", "4 0 1 0 0 0 0 0 0 0 0 0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_onehot_sklearn = cat_df_flights.copy()\n", "\n", "from sklearn.preprocessing import LabelBinarizer\n", "\n", "lb = LabelBinarizer()\n", "lb_results = lb.fit_transform(cat_df_flights_onehot_sklearn['carrier'])\n", "lb_results_df = pd.DataFrame(lb_results, columns=lb.classes_)\n", "\n", "lb_results_df.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " <th>AA</th>\n", " <th>AS</th>\n", " <th>B6</th>\n", " <th>DL</th>\n", " <th>F9</th>\n", " <th>HA</th>\n", " <th>OO</th>\n", " <th>UA</th>\n", " <th>US</th>\n", " <th>VX</th>\n", " <th>WN</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>AS</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>US</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>UA</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>US</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>AS</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier tailnum origin dest AA AS B6 DL F9 HA OO UA US VX WN\n", "0 AS N508AS PDX ANC 0 1 0 0 0 0 0 0 0 0 0\n", "1 US N195UW SEA CLT 0 0 0 0 0 0 0 0 1 0 0\n", "2 UA N37422 PDX IAH 0 0 0 0 0 0 0 1 0 0 0\n", "3 US N547UW PDX CLT 0 0 0 0 0 0 0 0 1 0 0\n", "4 AS N762AS SEA ANC 0 1 0 0 0 0 0 0 0 0 0" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_df = pd.concat([cat_df_flights_onehot_sklearn, lb_results_df], axis=1)\n", "\n", "result_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## 4. Binary Encoding" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier_0</th>\n", " <th>carrier_1</th>\n", " <th>carrier_2</th>\n", " <th>carrier_3</th>\n", " <th>carrier_4</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " carrier_0 carrier_1 carrier_2 carrier_3 carrier_4 tailnum origin dest\n", "0 0 0 0 0 1 N508AS PDX ANC\n", "1 0 0 0 1 0 N195UW SEA CLT\n", "2 0 0 0 1 1 N37422 PDX IAH\n", "3 0 0 0 1 0 N547UW PDX CLT\n", "4 0 0 0 0 1 N762AS SEA ANC" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_df_flights_ce = cat_df_flights.copy()\n", "\n", "import category_encoders as ce\n", "\n", "encoder = ce.BinaryEncoder(cols=['carrier'])\n", "df_binary = encoder.fit_transform(cat_df_flights_ce)\n", "\n", "df_binary.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>carrier_0</th>\n", " <th>carrier_1</th>\n", " <th>carrier_2</th>\n", " <th>carrier_3</th>\n", " <th>carrier_4</th>\n", " <th>tailnum</th>\n", " <th>origin</th>\n", " <th>dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ "Empty DataFrame\n", "Columns: [carrier_0, carrier_1, carrier_2, carrier_3, carrier_4, tailnum, origin, dest]\n", "Index: []" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_binary[df_binary['carrier_0']==1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## 5. Backward Difference Encoding" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>col_carrier_0</th>\n", " <th>col_carrier_1</th>\n", " <th>col_carrier_2</th>\n", " <th>col_carrier_3</th>\n", " <th>col_carrier_4</th>\n", " <th>col_carrier_5</th>\n", " <th>col_carrier_6</th>\n", " <th>col_carrier_7</th>\n", " <th>col_carrier_8</th>\n", " <th>col_carrier_9</th>\n", " <th>col_carrier_10</th>\n", " <th>col_tailnum</th>\n", " <th>col_origin</th>\n", " <th>col_dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>-0.909091</td>\n", " <td>-0.818182</td>\n", " <td>-0.727273</td>\n", " <td>-0.636364</td>\n", " <td>-0.545455</td>\n", " <td>-0.454545</td>\n", " <td>-0.363636</td>\n", " <td>-0.272727</td>\n", " <td>-0.181818</td>\n", " <td>-0.090909</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>0.090909</td>\n", " <td>-0.818182</td>\n", " <td>-0.727273</td>\n", " <td>-0.636364</td>\n", " <td>-0.545455</td>\n", " <td>-0.454545</td>\n", " <td>-0.363636</td>\n", " <td>-0.272727</td>\n", " <td>-0.181818</td>\n", " <td>-0.090909</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1.0</td>\n", " <td>0.090909</td>\n", " <td>0.181818</td>\n", " <td>-0.727273</td>\n", " <td>-0.636364</td>\n", " <td>-0.545455</td>\n", " <td>-0.454545</td>\n", " <td>-0.363636</td>\n", " <td>-0.272727</td>\n", " <td>-0.181818</td>\n", " <td>-0.090909</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.0</td>\n", " <td>0.090909</td>\n", " <td>-0.818182</td>\n", " <td>-0.727273</td>\n", " <td>-0.636364</td>\n", " <td>-0.545455</td>\n", " <td>-0.454545</td>\n", " <td>-0.363636</td>\n", " <td>-0.272727</td>\n", " <td>-0.181818</td>\n", " <td>-0.090909</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1.0</td>\n", " <td>-0.909091</td>\n", " <td>-0.818182</td>\n", " <td>-0.727273</td>\n", " <td>-0.636364</td>\n", " <td>-0.545455</td>\n", " <td>-0.454545</td>\n", " <td>-0.363636</td>\n", " <td>-0.272727</td>\n", " <td>-0.181818</td>\n", " <td>-0.090909</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " col_carrier_0 col_carrier_1 col_carrier_2 col_carrier_3 col_carrier_4 \\\n", "0 1.0 -0.909091 -0.818182 -0.727273 -0.636364 \n", "1 1.0 0.090909 -0.818182 -0.727273 -0.636364 \n", "2 1.0 0.090909 0.181818 -0.727273 -0.636364 \n", "3 1.0 0.090909 -0.818182 -0.727273 -0.636364 \n", "4 1.0 -0.909091 -0.818182 -0.727273 -0.636364 \n", "\n", " col_carrier_5 col_carrier_6 col_carrier_7 col_carrier_8 col_carrier_9 \\\n", "0 -0.545455 -0.454545 -0.363636 -0.272727 -0.181818 \n", "1 -0.545455 -0.454545 -0.363636 -0.272727 -0.181818 \n", "2 -0.545455 -0.454545 -0.363636 -0.272727 -0.181818 \n", "3 -0.545455 -0.454545 -0.363636 -0.272727 -0.181818 \n", "4 -0.545455 -0.454545 -0.363636 -0.272727 -0.181818 \n", "\n", " col_carrier_10 col_tailnum col_origin col_dest \n", "0 -0.090909 N508AS PDX ANC \n", "1 -0.090909 N195UW SEA CLT \n", "2 -0.090909 N37422 PDX IAH \n", "3 -0.090909 N547UW PDX CLT \n", "4 -0.090909 N762AS SEA ANC " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = ce.BackwardDifferenceEncoder(cols=['carrier'])\n", "df_bd = encoder.fit_transform(cat_df_flights_ce)\n", "\n", "df_bd.head()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-0.90909091, 0.09090909])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.unique(df_bd['col_carrier_1'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## 6. Polynomial Encoding" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>col_carrier_0</th>\n", " <th>col_carrier_1</th>\n", " <th>col_carrier_2</th>\n", " <th>col_carrier_3</th>\n", " <th>col_carrier_4</th>\n", " <th>col_carrier_5</th>\n", " <th>col_carrier_6</th>\n", " <th>col_carrier_7</th>\n", " <th>col_carrier_8</th>\n", " <th>col_carrier_9</th>\n", " <th>col_carrier_10</th>\n", " <th>col_tailnum</th>\n", " <th>col_origin</th>\n", " <th>col_dest</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>-0.476731</td>\n", " <td>0.512092</td>\n", " <td>-0.458029</td>\n", " <td>0.354787</td>\n", " <td>-0.240192</td>\n", " <td>0.141610</td>\n", " <td>-0.071707</td>\n", " <td>0.030334</td>\n", " <td>-0.010141</td>\n", " <td>0.002326</td>\n", " <td>N508AS</td>\n", " <td>PDX</td>\n", " <td>ANC</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1.0</td>\n", " <td>-0.381385</td>\n", " <td>0.204837</td>\n", " <td>0.091606</td>\n", " <td>-0.354787</td>\n", " <td>0.480384</td>\n", " <td>-0.453153</td>\n", " <td>0.329853</td>\n", " <td>-0.188069</td>\n", " <td>0.081127</td>\n", " <td>-0.023265</td>\n", " <td>N195UW</td>\n", " <td>SEA</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1.0</td>\n", " <td>-0.286039</td>\n", " <td>-0.034139</td>\n", " <td>0.335888</td>\n", " <td>-0.354787</td>\n", " <td>0.080064</td>\n", " <td>0.273780</td>\n", " <td>-0.473267</td>\n", " <td>0.442872</td>\n", " <td>-0.273805</td>\n", " <td>0.104692</td>\n", " <td>N37422</td>\n", " <td>PDX</td>\n", " <td>IAH</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1.0</td>\n", " <td>-0.381385</td>\n", " <td>0.204837</td>\n", " <td>0.091606</td>\n", " <td>-0.354787</td>\n", " <td>0.480384</td>\n", " <td>-0.453153</td>\n", " <td>0.329853</td>\n", " <td>-0.188069</td>\n", " <td>0.081127</td>\n", " <td>-0.023265</td>\n", " <td>N547UW</td>\n", " <td>PDX</td>\n", " <td>CLT</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1.0</td>\n", " <td>-0.476731</td>\n", " <td>0.512092</td>\n", " <td>-0.458029</td>\n", " <td>0.354787</td>\n", " <td>-0.240192</td>\n", " <td>0.141610</td>\n", " <td>-0.071707</td>\n", " <td>0.030334</td>\n", " <td>-0.010141</td>\n", " <td>0.002326</td>\n", " <td>N762AS</td>\n", " <td>SEA</td>\n", " <td>ANC</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " col_carrier_0 col_carrier_1 col_carrier_2 col_carrier_3 col_carrier_4 \\\n", "0 1.0 -0.476731 0.512092 -0.458029 0.354787 \n", "1 1.0 -0.381385 0.204837 0.091606 -0.354787 \n", "2 1.0 -0.286039 -0.034139 0.335888 -0.354787 \n", "3 1.0 -0.381385 0.204837 0.091606 -0.354787 \n", "4 1.0 -0.476731 0.512092 -0.458029 0.354787 \n", "\n", " col_carrier_5 col_carrier_6 col_carrier_7 col_carrier_8 col_carrier_9 \\\n", "0 -0.240192 0.141610 -0.071707 0.030334 -0.010141 \n", "1 0.480384 -0.453153 0.329853 -0.188069 0.081127 \n", "2 0.080064 0.273780 -0.473267 0.442872 -0.273805 \n", "3 0.480384 -0.453153 0.329853 -0.188069 0.081127 \n", "4 -0.240192 0.141610 -0.071707 0.030334 -0.010141 \n", "\n", " col_carrier_10 col_tailnum col_origin col_dest \n", "0 0.002326 N508AS PDX ANC \n", "1 -0.023265 N195UW SEA CLT \n", "2 0.104692 N37422 PDX IAH \n", "3 -0.023265 N547UW PDX CLT \n", "4 0.002326 N762AS SEA ANC " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder = ce.PolynomialEncoder(cols=['carrier'])\n", "df_bd = encoder.fit_transform(cat_df_flights_ce)\n", "\n", "df_bd.head()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([-4.76731295e-01, -3.81385036e-01, -2.86038777e-01, -1.90692518e-01,\n", " -9.53462589e-02, -1.16688970e-17, 9.53462589e-02, 1.90692518e-01,\n", " 2.86038777e-01, 3.81385036e-01, 4.76731295e-01])" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.unique(df_bd['col_carrier_1'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## 7. Miscellaneous Features" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>start</th>\n", " <th>end</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0-20</td>\n", " <td>0</td>\n", " <td>20</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>20-40</td>\n", " <td>20</td>\n", " <td>40</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>40-60</td>\n", " <td>40</td>\n", " <td>60</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>60-80</td>\n", " <td>60</td>\n", " <td>80</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age start end\n", "0 0-20 0 20\n", "1 20-40 20 40\n", "2 40-60 40 60\n", "3 60-80 60 80" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})\n", "dummy_df_age['start'], dummy_df_age['end'] = zip(*dummy_df_age['age'].map(lambda x: x.split('-')))\n", "\n", "dummy_df_age.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>age</th>\n", " <th>age_mean</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0-20</td>\n", " <td>10.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>20-40</td>\n", " <td>30.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>40-60</td>\n", " <td>50.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>60-80</td>\n", " <td>70.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " age age_mean\n", "0 0-20 10.0\n", "1 20-40 30.0\n", "2 40-60 50.0\n", "3 60-80 70.0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dummy_df_age = pd.DataFrame({'age': ['0-20', '20-40', '40-60','60-80']})\n", "\n", "def split_mean(x):\n", " split_list = x.split('-')\n", " mean = (float(split_list[0])+float(split_list[1]))/2\n", " return mean\n", "\n", "dummy_df_age['age_mean'] = dummy_df_age['age'].apply(lambda x: split_mean(x))\n", "\n", "dummy_df_age.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dealing with Categorical Features in Big Data with Spark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The first step in Spark programming is to create a SparkContext. SparkContext is required when you want to execute operations in a cluster. SparkContext tells Spark how and where to access a cluster. You'll start by importing SparkContext.\n", "\n", "- To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### 1st way" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#import findspark\n", "\n", "#findspark.init()\n", "\n", "#import pyspark\n", "\n", "#confspark = pyspark.SparkConf().setMaster(\"local[*]\").set(\"spark.cores.max\", \"4\").set(\"spark.executor.memory\", \"2G\").setAppName(\"--test--\")\n", "\n", "#sc = pyspark.SparkContext(conf=confspark)\n", "\n", "#sc._conf.getAll()\n", "\n", "#from pyspark.sql import SparkSession \n", "\n", "#spark = SparkSession(sc) \n", "\n", "#sc.stop()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> #### 2nd way" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import findspark\n", "findspark.init()\n", "\n", "import pyspark\n", "from pyspark.sql import SparkSession \n", "\n", "#confspark = pyspark.SparkConf().setMaster(\"local[4]\").set(\"spark.cores.max\", \"4\").set(\"spark.executor.memory\", \"2G\").setAppName(\"--test--\")\n", "#spark = SparkSession.builder.config(conf=confspark).getOrCreate()\n", "\n", "spark = SparkSession.builder.master(\"local[*]\").appName(\"--test--\").config(\"spark.some.config.option\", \"some-value\").getOrCreate()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.3.1'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.version" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.catalog.listTables()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+\n", "|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|\n", "+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+\n", "|2014| 1| 1| 1| 96| 235| 70| AS| N508AS| 145| PDX| ANC| 194| 1542| 0| 1|\n", "|2014| 1| 1| 4| -6| 738| -23| US| N195UW| 1830| SEA| CLT| 252| 2279| 0| 4|\n", "|2014| 1| 1| 8| 13| 548| -4| UA| N37422| 1609| PDX| IAH| 201| 1825| 0| 8|\n", "+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+\n", "only showing top 3 rows\n", "\n" ] } ], "source": [ "spark_flights = spark.read.format(\"csv\").option('header',True).load('data/flights.csv',inferSchema=True)\n", "spark_flights.show(3)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- year: integer (nullable = true)\n", " |-- month: integer (nullable = true)\n", " |-- day: integer (nullable = true)\n", " |-- dep_time: string (nullable = true)\n", " |-- dep_delay: string (nullable = true)\n", " |-- arr_time: string (nullable = true)\n", " |-- arr_delay: string (nullable = true)\n", " |-- carrier: string (nullable = true)\n", " |-- tailnum: string (nullable = true)\n", " |-- flight: integer (nullable = true)\n", " |-- origin: string (nullable = true)\n", " |-- dest: string (nullable = true)\n", " |-- air_time: string (nullable = true)\n", " |-- distance: integer (nullable = true)\n", " |-- hour: string (nullable = true)\n", " |-- minute: string (nullable = true)\n", "\n" ] } ], "source": [ "spark_flights.printSchema()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.catalog.listTables()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "spark_flights.createOrReplaceTempView(\"flights_temp\")" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[Table(name='flights_temp', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark.catalog.listTables()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+\n", "|carrier|\n", "+-------+\n", "| AS|\n", "| US|\n", "| UA|\n", "| US|\n", "| AS|\n", "+-------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "carrier_df = spark_flights.select(\"carrier\")\n", "carrier_df.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## StringIndexer" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+-------------+\n", "|carrier|carrier_index|\n", "+-------+-------------+\n", "| AS| 0.0|\n", "| US| 6.0|\n", "| UA| 4.0|\n", "| US| 6.0|\n", "| AS| 0.0|\n", "| DL| 3.0|\n", "| UA| 4.0|\n", "+-------+-------------+\n", "only showing top 7 rows\n", "\n" ] } ], "source": [ "from pyspark.ml.feature import StringIndexer\n", "\n", "carr_indexer = StringIndexer(inputCol=\"carrier\",outputCol=\"carrier_index\")\n", "carr_indexed = carr_indexer.fit(carrier_df).transform(carrier_df)\n", "\n", "carr_indexed.show(7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## OneHotEncoder" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+-------------+--------------+\n", "|carrier|carrier_index| carrier_vec|\n", "+-------+-------------+--------------+\n", "| AS| 0.0|(11,[0],[1.0])|\n", "| US| 6.0|(11,[6],[1.0])|\n", "| UA| 4.0|(11,[4],[1.0])|\n", "| US| 6.0|(11,[6],[1.0])|\n", "| AS| 0.0|(11,[0],[1.0])|\n", "| DL| 3.0|(11,[3],[1.0])|\n", "| UA| 4.0|(11,[4],[1.0])|\n", "+-------+-------------+--------------+\n", "only showing top 7 rows\n", "\n" ] } ], "source": [ "carrier_df_onehot = spark_flights.select(\"carrier\")\n", "\n", "from pyspark.ml.feature import OneHotEncoder, StringIndexer\n", "\n", "stringIndexer = StringIndexer(inputCol=\"carrier\", outputCol=\"carrier_index\")\n", "model = stringIndexer.fit(carrier_df_onehot)\n", "indexed = model.transform(carrier_df_onehot)\n", "\n", "encoder = OneHotEncoder(dropLast=False, inputCol=\"carrier_index\", outputCol=\"carrier_vec\")\n", "encoded = encoder.transform(indexed)\n", "\n", "encoded.show(7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## Example" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---+--------+-------------+-------------+\n", "| id|category|categoryIndex| categoryVec|\n", "+---+--------+-------------+-------------+\n", "| 0| a| 0.0|(2,[0],[1.0])|\n", "| 1| b| 2.0| (2,[],[])|\n", "| 2| c| 1.0|(2,[1],[1.0])|\n", "| 3| a| 0.0|(2,[0],[1.0])|\n", "| 4| a| 0.0|(2,[0],[1.0])|\n", "| 5| c| 1.0|(2,[1],[1.0])|\n", "+---+--------+-------------+-------------+\n", "\n" ] } ], "source": [ "from pyspark.ml.feature import OneHotEncoder, StringIndexer\n", "\n", "df1 = spark.createDataFrame([\n", " (0, \"a\"),\n", " (1, \"b\"),\n", " (2, \"c\"),\n", " (3, \"a\"),\n", " (4, \"a\"),\n", " (5, \"c\"),\n", " (6, \"a\"),\n", " (7, \"b\"),\n", " (8, \"d\"),\n", " (9, \"d\")\n", "], [\"id\", \"category\"])\n", "\n", "df2 = spark.createDataFrame([\n", " (0, \"a\"),\n", " (1, \"b\"),\n", " (2, \"c\"),\n", " (3, \"a\"),\n", " (4, \"a\"),\n", " (5, \"c\")\n", "], [\"id\", \"category\"])\n", "\n", "df = df2\n", "\n", "stringIndexer = StringIndexer(inputCol=\"category\", outputCol=\"categoryIndex\")\n", "model = stringIndexer.fit(df)\n", "indexed = model.transform(df)\n", "\n", "encoder = OneHotEncoder(dropLast=True, inputCol=\"categoryIndex\", outputCol=\"categoryVec\")\n", "encoded = encoder.transform(indexed)\n", "encoded.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ## VectorIndexer" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Chose 351 categorical features: 645, 69, 365, 138, 101, 479, 333, 249, 0, 555, 666, 88, 170, 115, 276, 308, 5, 449, 120, 247, 614, 677, 202, 10, 56, 533, 142, 500, 340, 670, 174, 42, 417, 24, 37, 25, 257, 389, 52, 14, 504, 110, 587, 619, 196, 559, 638, 20, 421, 46, 93, 284, 228, 448, 57, 78, 29, 475, 164, 591, 646, 253, 106, 121, 84, 480, 147, 280, 61, 221, 396, 89, 133, 116, 1, 507, 312, 74, 307, 452, 6, 248, 60, 117, 678, 529, 85, 201, 220, 366, 534, 102, 334, 28, 38, 561, 392, 70, 424, 192, 21, 137, 165, 33, 92, 229, 252, 197, 361, 65, 97, 665, 583, 285, 224, 650, 615, 9, 53, 169, 593, 141, 610, 420, 109, 256, 225, 339, 77, 193, 669, 476, 642, 637, 590, 679, 96, 393, 647, 173, 13, 41, 503, 134, 73, 105, 2, 508, 311, 558, 674, 530, 586, 618, 166, 32, 34, 148, 45, 161, 279, 64, 689, 17, 149, 584, 562, 176, 423, 191, 22, 44, 59, 118, 281, 27, 641, 71, 391, 12, 445, 54, 313, 611, 144, 49, 335, 86, 672, 172, 113, 681, 219, 419, 81, 230, 362, 451, 76, 7, 39, 649, 98, 616, 477, 367, 535, 103, 140, 621, 91, 66, 251, 668, 198, 108, 278, 223, 394, 306, 135, 563, 226, 3, 505, 80, 167, 35, 473, 675, 589, 162, 531, 680, 255, 648, 112, 617, 194, 145, 48, 557, 690, 63, 640, 18, 282, 95, 310, 50, 67, 199, 673, 16, 585, 502, 338, 643, 31, 336, 613, 11, 72, 175, 446, 612, 143, 43, 250, 231, 450, 99, 363, 556, 87, 203, 671, 688, 104, 368, 588, 40, 304, 26, 258, 390, 55, 114, 171, 139, 418, 23, 8, 75, 119, 58, 667, 478, 536, 82, 620, 447, 36, 168, 146, 30, 51, 190, 19, 422, 564, 305, 107, 4, 136, 506, 79, 195, 474, 664, 532, 94, 283, 395, 332, 528, 644, 47, 15, 163, 200, 68, 62, 277, 691, 501, 90, 111, 254, 227, 337, 122, 83, 309, 560, 639, 676, 222, 592, 364, 100\n", "+-----+--------------------+--------------------+\n", "|label| features| indexed|\n", "+-----+--------------------+--------------------+\n", "| 0.0|(692,[127,128,129...|(692,[127,128,129...|\n", "| 1.0|(692,[158,159,160...|(692,[158,159,160...|\n", "| 1.0|(692,[124,125,126...|(692,[124,125,126...|\n", "| 1.0|(692,[152,153,154...|(692,[152,153,154...|\n", "| 1.0|(692,[151,152,153...|(692,[151,152,153...|\n", "| 0.0|(692,[129,130,131...|(692,[129,130,131...|\n", "| 1.0|(692,[158,159,160...|(692,[158,159,160...|\n", "| 1.0|(692,[99,100,101,...|(692,[99,100,101,...|\n", "| 0.0|(692,[154,155,156...|(692,[154,155,156...|\n", "| 0.0|(692,[127,128,129...|(692,[127,128,129...|\n", "| 1.0|(692,[154,155,156...|(692,[154,155,156...|\n", "| 0.0|(692,[153,154,155...|(692,[153,154,155...|\n", "| 0.0|(692,[151,152,153...|(692,[151,152,153...|\n", "| 1.0|(692,[129,130,131...|(692,[129,130,131...|\n", "| 0.0|(692,[154,155,156...|(692,[154,155,156...|\n", "| 1.0|(692,[150,151,152...|(692,[150,151,152...|\n", "| 0.0|(692,[124,125,126...|(692,[124,125,126...|\n", "| 0.0|(692,[152,153,154...|(692,[152,153,154...|\n", "| 1.0|(692,[97,98,99,12...|(692,[97,98,99,12...|\n", "| 1.0|(692,[124,125,126...|(692,[124,125,126...|\n", "+-----+--------------------+--------------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "from pyspark.ml.feature import VectorIndexer\n", "\n", "data = spark.read.format(\"libsvm\").load(\"data/sample_libsvm_data.txt\")\n", "\n", "indexer = VectorIndexer(inputCol=\"features\", outputCol=\"indexed\", maxCategories=10)\n", "indexerModel = indexer.fit(data)\n", "\n", "categoricalFeatures = indexerModel.categoryMaps\n", "print(\"Chose %d categorical features: %s\" %\n", " (len(categoricalFeatures), \", \".join(str(k) for k in categoricalFeatures.keys())))\n", "\n", "# Create new column \"indexed\" with categorical values transformed to indices\n", "indexedData = indexerModel.transform(data)\n", "indexedData.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> https://www.datacamp.com/community/tutorials/categorical-data\n", "\n", "> http://contrib.scikit-learn.org/categorical-encoding/_modules/index.html" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }