{ "cells": [ { "cell_type": "markdown", "metadata": { "_uuid": "3d7b11cc286f4fd4845bc94715ce4f66584dc5c4" }, "source": [ "# Import Packages and Data " ] }, { "cell_type": "markdown", "metadata": { "_uuid": "881e5f8c8601294f08b1f2b5686455ce9e3050ed" }, "source": [ "* **Importing packages for preprocssing, visualization and modeling **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5" }, "outputs": [], "source": [ "import numpy as np # linear algebra\n", "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", "\n", "#Import Packages for Modeling and Performance Evaluation\n", "from sklearn.model_selection import cross_val_score\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "#Data Visualization Packages\n", "from matplotlib import style\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "%matplotlib inline\n", "from collections import Counter\n", "\n", "#Import Packages for Feature Selection\n", "##Univariate feature selection\n", "from sklearn.feature_selection import SelectKBest,f_classif\n", "from sklearn.feature_selection import chi2\n", "\n", "#Import package for encoding categorical features \n", "import category_encoders as ce\n", "\n", "# Import package calculate accuracy\n", "from sklearn import metrics\n", "\n", "#Import package for splitting the dataset\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "bb0dfe490776b3629626e59707b475fb08283665" }, "source": [ "* **Importing Training and Test Data**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since its an old competition, it is not possible to evaluate the prediction using the 'test.csv' dataset provided by amazon. Therefore, this problem is solved by only importing the train.csv file and then spltting the dataset 50/50 for training and testing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0", "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a" }, "outputs": [], "source": [ "train = pd.read_csv('../input/train.csv')\n", "X1=train.drop([\"ACTION\"],axis = 1)\n", "y1=train['ACTION']\n", "train_X, test_X, train_Y, test_Y = train_test_split(X1, y1, test_size=0.5, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Inspection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "* **Quick Inspection of Train and Test Data**\n", "\n", "A quick look at the datasets shows that all the features are categorical and none of them has any missing values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "70db0bbb52fe877af6e719f631adc1908cbfb7be" }, "outputs": [], "source": [ "train_X.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_X.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "f318b6b43c4d65050d5819044c733538c6fca9b9" }, "outputs": [], "source": [ "train_X.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "cb046f3b5dacb9933c5bf034f8303191df7e6310" }, "outputs": [], "source": [ "test_X.info()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "9c6ed4e0ae9f336fc93a66712650e8e5be6419cd" }, "source": [ "# Feature Selection\n", "\n", "* [Feature selection algorithms](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/) help to determine which features significantly influence the output variable and then eliminate the remaining features from consideration. This step helps to reduce complexity of a model by reducing the data and enables machine learning algortihms to train faster. \n", "\n", "* The scikit module provides [several ways](https://scikit-learn.org/stable/modules/feature_selection.html) for identifying significant features either by evaluating the statistical correlation with the outcome variable (e.g. univariate feature selection method ) or by measuring the usefulness of a subset of feature by actually training a machine learning model on it (e.g. recursive feature elimination).\n", "\n", "* Considering the size of this particular dataset and the computational time usually required for recursive feature elimination, the universal feature seleciton method is adopted for the first trial. " ] }, { "cell_type": "markdown", "metadata": { "_uuid": "78178b29972cfdccafb1340e09e9fea5d568927b" }, "source": [ "* **Applying Univariate Feature Selection Methods to Train data**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "54dfd984852487b7b80edda8c323dd687bd06a55" }, "outputs": [], "source": [ "selector = SelectKBest(chi2, k=6)\n", "X_new=selector.fit_transform(train_X, train_Y)\n", "mask = selector.get_support(indices=True)\n", "colname = train_X.columns[mask]\n", "printf('the features to drop are -',colname)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "1a2820d77c2d973886752f02c120516a2492d971" }, "source": [ "Now that the top 6 significant features have been identified, the remaining features will be dropped from both the training and test dataset. To minimize coding efforts, lets concatenate train and test data to a single dataset prior preprocessing. " ] }, { "cell_type": "markdown", "metadata": { "_uuid": "dfc811c979f504b0b4899cac9e591e902279a8d6" }, "source": [ "& **Concatenating Train and Test Data into a Single Dataset**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "b9e526be8dd007d2e8d8c3bcd81667d4f3d5c386" }, "outputs": [], "source": [ "dataset = pd.concat([train_X, test_X])\n", "dataset.head()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "d07a518cf923894c6b6e766fb2c3d319381e09c9" }, "source": [ "Now, lets drop the features 'RESOURCE', 'ROLE_ROLLUP_1',and 'ROLE_DEPTNAME' from the dataset. " ] }, { "cell_type": "markdown", "metadata": { "_uuid": "49f35c2dd4dd3566f834b84fdd85c3d582cf60fb" }, "source": [ "**Dropping Features From Dataset**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "8b812a2e91d86091483f41cd1fdfc3ac0f406b6c", "scrolled": false }, "outputs": [], "source": [ "dataset = dataset.drop(columns=['RESOURCE','ROLE_ROLLUP_2','ROLE_DEPTNAME'],axis=1)\n", "dataset.head(5)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "2f5d46a22c1a64d3040e34b72b38b36d5f760aeb" }, "source": [ "# Feature Encoding\n", "\n", "* All the features listed in the dataset are categorical and therefore need to be encoded prior applying any machine learning model to the dataset. Categorical encoding transforms a categorical feature into one or multiple numeric features. [In traditional encoding approaches], this is accomplished by assigning an integar/binary number to each category. For example, one hot coding works by creating a sepearate column for every item (i.e. unique value or type) a categorical variable, and assigns either a '1' or '0' depending on whether the value of the category variable is that item or not.\n", "\n", "* However, this feature expansion during encoding can create serious memory problems if the data set has high cardinality features (i.e. categorical variable with large number of items/unique values) as one hot coding creates as many columns as the cardinalities (items/unique values) in the categorical variable. Feature hashing can be useful as an alterntative encoding approach under such circumstances. \n", "\n", "* The core idea behind feature hashing is to map data of arbitrary size to data of fixed size by using a hash function,  instead of maintaining a one-to-one mapping of categorical feature values to new individual columns in the dataset. A detailed discussion of the techqnues can be found [in this link]((https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159)) to get a better understanding of the encodning methods and the benifits of one over the other.\n", "\n", "* To decide what feature encoding to apply to process the categorical features for modeling, lets check the number of unique elements in the dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "**Calculating the Number of Unique Elements in Each Column**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "253a60c6442f14cba1c7c0b8b508eea4e304d2c8" }, "outputs": [], "source": [ "dataset.nunique()\n", "#dict(Counter(train.RESOURCE))#count the frequancy of each category in RESOURCES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of unique values for each feature suggests that this is a dataset with high cardinality features. Therefore, feature hashing was chosen to process the categorical features prior modeling. " ] }, { "cell_type": "markdown", "metadata": { "_uuid": "4fbde5dce58b9e2f98729de4f25604830395be70" }, "source": [ "* **Implementing Feature Hashing**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "485210431fec57389f52b40d665e197a5acf6730", "scrolled": false }, "outputs": [], "source": [ "train_len = len(train_X)\n", "train_X = dataset[:train_len]\n", "test_X = dataset[train_len:]\n", "\n", "he = ce.HashingEncoder(cols = ['MGR_ID',\"ROLE_CODE\",\"ROLE_FAMILY\",\"ROLE_FAMILY_DESC\",\"ROLE_ROLLUP_1\",\"ROLE_TITLE\"], n_components=8)\n", "train_he = he.fit_transform(train_X)\n", "test_he = he.transform(test_X)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "48b72fd9acad577a098cb10088c4934d2b698bb3" }, "source": [ "# 6. Modeling and Evaluation\n", "\n", "* Four different classification models (i.e. DecisionTree Classifier, Random Forest Classifier, Logical Regressor and KNN claissifier) are applied and then the best model was identified based on the cross validation score.\n", "\n", "* For each model, the cross validation was performed by splitting the data (80%/20% for train/test split), fitting a model and computing the score 5 consecutive times (with different splits each time).\n", "\n", "* The average of the accuracy for the 5 trials were then reported as the CV score for each model." ] }, { "cell_type": "markdown", "metadata": { "_uuid": "8aae0544068443f36ab0977e90826bee1f89909f" }, "source": [ "* **Defining the Input Features for Training and Prediction**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "f28cc5558b243dfc187374bf97340a5f0a6ae618" }, "outputs": [], "source": [ "train_X=train_he\n", "test_X=test_he\n", "train_X.head()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "64867bb30789b9614bd0db9f878a13fab553f889" }, "source": [ "* **Defining Machine Learning Models**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "056f64bf71a88434f626878e570da5d699f76187" }, "outputs": [], "source": [ "# Establishing model using train data\n", "DTmodel=DecisionTreeClassifier(random_state=1)\n", "RFmodel=RandomForestClassifier(random_state=1)\n", "logreg = LogisticRegression()\n", "knn = KNeighborsClassifier(n_neighbors = 3) " ] }, { "cell_type": "markdown", "metadata": { "_uuid": "e1b5c63b30ef7693ab0d616fe3ebf43f09b24870" }, "source": [ "**Training the Machine Learning Models**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "9ff3f46990af396f1e28665c76fede806ad46ce4" }, "outputs": [], "source": [ "#Fitting models\n", "DTmodel.fit(train_X,train_Y)\n", "RFmodel.fit(train_X,train_Y)\n", "logreg.fit(train_X, train_Y)\n", "knn.fit(train_X, train_Y)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "ccf00a693d6fc2c9facc5c1073c356b0733e9296" }, "source": [ "**Function to Calculate Cross Validation Score **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "cfa6665b2efa6dff301ff7ef245d69a92c35d4fc" }, "outputs": [], "source": [ "def cross_val(X,x,y):\n", "\n", " scores = cross_val_score(X, x, y, cv=10, scoring = \"roc_auc\")\n", " #print(\"Scores:\", scores)\n", " #print(\"Mean:\", scores.mean())\n", " #print(\"Standard Deviation:\", scores.std())\n", " return scores.mean()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "d2219a352ba3261c6581d9698002b24d3dde230f" }, "source": [ "**Using the Models for Prediction on Test Data**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "bf5e32f709dd31519ea783f7881fee23a3091a64" }, "outputs": [], "source": [ "DTpred=DTmodel.predict_proba(test_X).astype(int)\n", "RFpred=RFmodel.predict_proba(test_X).astype(int)\n", "logreg_pred = logreg.predict_proba(test_X).astype(int)\n", "knn_pred = knn.predict_proba(test_X).astype(int)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "5cd50a35206fee96aaab857ae1c823a03754f15a" }, "source": [ "**Calculate Cross Validation Scores Based on ROC_AUC ** " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "e245752d5e7d23928b3ad2e6f22a110f15dfd8c7" }, "outputs": [], "source": [ "DT_score =round(cross_val(DTmodel,train_X, train_Y) * 100, 2)\n", "RF_score = round(cross_val(RFmodel,train_X, train_Y) * 100, 2)\n", "logreg_score = round(cross_val(logreg,train_X, train_Y) * 100, 2)\n", "knn_score = round(cross_val(knn,train_X, train_Y) * 100, 2)" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "46a056e7ff032f6240ed987fac2559e6d438209e" }, "source": [ "**Tabulating Scores for Each Model **" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "de7b279f0b61cac0fc213c4d762e79d550cb6cc9" }, "outputs": [], "source": [ "results = pd.DataFrame({'Model': ['DecisionTreeeClaissifier', 'KNN', 'Logistic Regression', \n", " 'Random ForestClassifier'],'Score': [DT_score , knn_score, logreg_score, \n", " RF_score]})\n", "results = results.sort_values(['Score'], ascending=[False])\n", "results.head()" ] }, { "cell_type": "markdown", "metadata": { "_uuid": "cd4d7a7cc63812884b7ad62f150231e4f20e6ecc" }, "source": [ "# Compute ROC Curve with Cross Validation\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "_uuid": "01c0c1fbe4bee1a8775977c4d247d201746b9331", "scrolled": true }, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.metrics import roc_curve, auc\n", "from sklearn.model_selection import train_test_split\n", "from scipy import interp\n", "x=test_X\n", "y=test_Y\n", "classifier =[]\n", "classifier.append(RFmodel)\n", "classifier.append(DTmodel)\n", "cv = StratifiedKFold(n_splits=5)\n", "tprs = []\n", "aucs = []\n", "mean_fpr = np.linspace(0,1,100)\n", "\n", "i = 0\n", "for train,test in cv.split(x,y):\n", " yscore = classifier[0].fit(x.iloc[train],y.iloc[train]).predict_proba(x.iloc[test])\n", " fpr, tpr, t = roc_curve(y.iloc[test], yscore[:, 1])\n", " tprs.append(interp(mean_fpr, fpr, tpr))\n", " roc_auc = auc(fpr, tpr)\n", " aucs.append(roc_auc)\n", " plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))\n", " i= i+1\n", "plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')\n", "mean_tpr = np.mean(tprs, axis=0)\n", "mean_auc = auc(mean_fpr, mean_tpr)\n", "plt.plot(mean_fpr, mean_tpr, color='blue',\n", " label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)\n", "\n", "plt.xlabel('False Positive Rate')\n", "plt.ylabel('True Positive Rate')\n", "plt.title('ROC')\n", "plt.legend(loc=\"lower right\")\n", "plt.text(0.32,0.7,'More accurate area',fontsize = 12)\n", "plt.text(0.63,0.4,'Less accurate area',fontsize = 12)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 1 }