{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "\n", "####
Author: Pisarev Ivan, ODS Slack: pisarev_i
\n", "##
Predict attrition of employees " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Feature and data explanation\n", "\n", "> *People are definitely a company's greatest asset. \n", " It doesn't make any difference whether the product is cars or cosmetics. \n", " A company is only as good as the people it keeps.* \n", "> ***Mary Kay Ash*** \n", "\n", "There is no doubt about the fact that the human asset is the key intangible asset for any organization. In today’s dynamic and continuously changing business world, it is the human assets and not the fixed or tangible assets that differentiate an organization from its competitors. Today’s knowledge economy distinguishes one organization from another with the single most important and powerful factor that is the Human Resources (HR) or Human Assets.\n", "\n", "Employees leaving an organization might be replaced physically; however, their skill-sets and knowledge cannot be exactly replaced by the person replacing them, as each individual possesses a different skill-set and experience. Employee efficiency and talent determines the pace and growth of the organizations.\n", "\n", "There are two important business issues:\n", " - Uncover the factors that lead to employee attrition\n", " - Prediction valuable employees who are ready to attrition\n", "\n", "To get answers to these questions, we will analyze dataset IBM HR Analytics Employee Attrition & Performance\n", "\n", "This is a fictional data set created by IBM data scientists. \n", "List of columns with their types:\n", " - **Age** - Numeric Discrete\n", " - **Attrition** - Caregorical\n", " - **BusinessTravel** - Caregorical\n", " - **DailyRate** - Numeric Discrete\n", " - **Department** - Caregorical\n", " - **DistanceFromHome** - Numeric Discrete\n", " - **Education** - Caregorical (1: 'Below College', 2: 'College', 3: 'Bachelor', 4: 'Master', 5: 'Doctor')\n", " - **EducationField** - Caregorical\n", " - **EmployeeCount** - Numeric Discrete\n", " - **EmployeeNumber** - Numeric Discrete\n", " - **EnvironmentSatisfaction** - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')\n", " - **Gender** - Caregorical\n", " - **HourlyRate** - Numeric Discrete\n", " - **JobInvolvement** - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')\n", " - **JobLevel** - Caregorical\n", " - **JobRole** - Caregorical\n", " - **JobSatisfaction** - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')\n", " - **MaritalStatus** - Caregorical\n", " - **MonthlyIncome** - Numeric Discrete\n", " - **MonthlyRate** - Numeric Discrete\n", " - **NumCompaniesWorked** - Numeric Discrete\n", " - **Over18** - Caregorical\n", " - **OverTime** - Caregorical\n", " - **PercentSalaryHike** - Numeric Discrete\n", " - **PerformanceRating** - Caregorical (1: 'Low', 2: 'Good', 3: 'Excellent', 4: 'Outstanding')\n", " - **RelationshipSatisfaction** - Caregorical (1: 'Low', 2: 'Medium', 3: 'High', 4: 'Very High')\n", " - **StandardHours** - Numeric Discrete\n", " - **StockOptionLevel** - Caregorical\n", " - **TotalWorkingYears** - Numeric Discrete\n", " - **TrainingTimesLastYear** - Numeric Discrete\n", " - **WorkLifeBalance** - Caregorical (1: 'Bad', 2: 'Good', 3: 'Better', 4: 'Best')\n", " - **YearsAtCompany** - Numeric Discrete\n", " - **YearsInCurrentRole** - Numeric Discrete\n", " - **YearsSinceLastPromotion** - Numeric Discrete\n", " - **YearsWithCurrManager** - Numeric Discrete\n", "\n", "The target feature **Attrition** has two possible values: 'Yes' and 'No', so our task is binary classification. \n", "It is also important to understand the significance of features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Primary data analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "color = sns.color_palette('tab20')\n", "plt.style.use('seaborn-whitegrid')\n", "sns.set_style(\"whitegrid\")\n", "plt.rcParams['figure.figsize'] = (10,8)\n", "sns.palplot(color)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the data, look at the first lines, check types and omissions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM = pd.read_csv('./data/WA_Fn-UseC_-HR-Employee-Attrition.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.head().T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are **no missing items** in the data. \n", "Let's check the distribution of features values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.describe(include=['int64']).T" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.describe(include=['object']).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check count of unique values for all features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([pd.DataFrame({'Unique Values': dfIBM.nunique().sort_values()}),\n", " pd.DataFrame({'Type': dfIBM.dtypes})], axis=1, sort=False).sort_values(by='Unique Values')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three columns with **constant** values. These columns do not make sense, we can **remove them**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.drop(columns=['EmployeeCount', 'StandardHours', 'Over18'], axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check balance in values of target feature" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "round(dfIBM['Attrition'].value_counts(normalize=True)*100, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see **imbalance** in target class, there much more values 'No' than 'Yes'. \n", "Let's convert target feature to numeric." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.Attrition = dfIBM.Attrition.map({'Yes': 1, 'No': 0})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column `EmployeeNumber` has all unique values (1470). We can suppose that it is like employee identificaton number. Let's check it is not affected to target feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = (14,3)\n", "plt.plot(dfIBM.EmployeeNumber, dfIBM.Attrition, 'ro', alpha=0.2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, there is no leak in the data and `Attrition` not sorted by `EmployeeNumber`. We can **remove this column** from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.drop(columns=['EmployeeNumber'], axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, looking at the variable names and their values, we can classify all variables into 3 types. \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameUnique ValuesType
Categorical, order has no sense
BusinessTravel3object
Department3object
EducationField6object
EmployeeNumber1470int64
Gender2object
JobRole9object
MaritalStatus3object
OverTime2object
Categorical, order has sense, but distance between values has no sense
Education5int64
EnvironmentSatisfaction4int64
JobInvolvement4int64
JobLevel5int64
JobSatisfaction4int64
PerformanceRating2int64
RelationshipSatisfaction4int64
StockOptionLevel4int64
WorkLifeBalance4int64
Numeric, discrete
Age43int64
DailyRate886int64
DistanceFromHome29int64
HourlyRate71int64
MonthlyIncome1349int64
MonthlyRate1427int64
NumCompaniesWorked10int64
PercentSalaryHike15int64
TotalWorkingYears40int64
TrainingTimesLastYear7int64
YearsAtCompany37int64
YearsInCurrentRole19int64
YearsSinceLastPromotion16int64
YearsWithCurrManager18int64
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Categorical_without_order = ['BusinessTravel', 'Department', 'EducationField',\n", " 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']\n", "\n", "Categorical_with_order = ['Education', 'EnvironmentSatisfaction', 'JobInvolvement',\n", " 'JobLevel', 'JobSatisfaction', 'PerformanceRating',\n", " 'RelationshipSatisfaction', 'StockOptionLevel', 'WorkLifeBalance']\n", "\n", "Numeric = ['Age', 'DailyRate', 'DistanceFromHome', 'HourlyRate',\n", " 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',\n", " 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear',\n", " 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',\n", " 'YearsWithCurrManager']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Primary visual data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see distribution of all features and the dependence of the target variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dictCatNames = {'Education': ['Below College','College','Bachelor','Master','Doctor'],\n", " 'EnvironmentSatisfaction': ['Low','Medium','High','Very High'],\n", " 'JobInvolvement': ['Low','Medium','High','Very High'],\n", " 'JobSatisfaction': ['Low','Medium','High','Very High'],\n", " 'PerformanceRating': ['Low','Good','Excellent','Outstanding'],\n", " 'RelationshipSatisfaction':['Low','Medium','High','Very High'],\n", " 'WorkLifeBalance': ['Bad','Good','Better','Best']}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cat_distribution_target_proportion(column):\n", " fig , axes = plt.subplots(1,2,figsize = (15,6))\n", " fig.suptitle(column,fontsize=16)\n", " \n", " sns.countplot(dfIBM[column],ax=axes[0])\n", " axes[0].set_title(column + ' distribution')\n", " \n", " sns.barplot(x=column,y='Attrition',data=dfIBM,ax=axes[1])\n", " axes[1].set_title('Attrition rate by '+column)\n", " \n", " for ax in axes:\n", " if column in dictCatNames:\n", " ax.xaxis.set_ticklabels(dictCatNames[column])\n", " plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right', rotation_mode='anchor')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for col in (Categorical_without_order + Categorical_with_order):\n", " cat_distribution_target_proportion(col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we can see:\n", " - `Attrition` higher if `BusinessTravel` is frequently\n", " - `Department`, `Gender`, `Education` and `PerformanceRating` have low effect to `Attrition`\n", " - `Attrition` higher if `MartialStatus` is Single\n", " - Some `JobRole` (Sales Representative, Human Resources, Laboratory Technician) have a high level of `Attrition`\n", " - `Attrition` is higher if an employee has `OverTime`\n", " - If `EnvironmentSatisfaction`, `JobInvolvement`, `JobLevel`, `JobSatisfaction`, `RelationshipSatisfaction`, `WorkLifeBalance` is lower, then `Attrition` is higher\n", "\n", "What about distribution and relationship with the target for numeric features?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def num_distribution_target_impact(column):\n", " fig , axes = plt.subplots(2,2,figsize = (15,6))\n", " fig.suptitle(column,fontsize=16)\n", " \n", " sns.distplot(dfIBM[column],kde=False,ax=axes[0])\n", " axes[0].set_title(column + ' distribution')\n", " \n", " sns.boxplot(x='Attrition',y=column,data=dfIBM,ax=axes[1])\n", " axes[1].set_title('Relationship Attrition with '+column)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for n in Numeric:\n", " num_distribution_target_impact(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the `Attrition` has occurred, we can see:\n", " - Lower `Age`\n", " - Lower `MonthlyIncome`\n", " - Lower `TotalWorkingYears`\n", " - Lower `YearsAtCompany`\n", " - Lower `YearsInCurrentRole`\n", " - Lower `YearsWithCurrManager` \n", "\n", "`DailyRate`, `DistanceFromHome`, `HourlyRate`, `MonthlyRate`, `NumCompaniesWorked`, `PercentSalaryHike`, `TrainigTimesLastYear`, `YearsSinceLastPromotion` have a low relationship with the `Attrition`. \n", "\n", "Let's see Pearson correlation matrix for all columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = (15, 10)\n", "sns.heatmap(data=dfIBM[Categorical_with_order+Numeric].corr(),\n", " annot=True,fmt='.2f',linewidths=.5,cmap='RdGy_r');\n", "plt.title('Pearson correlation for numerical features', fontsize=16);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see:\n", " - High correlation between `JobLevel`, `MontlyIncome`, and `TotalWorkingYears`\n", " - High correlation between `YearsAtCompany`, `YearsInCurrentRole`, `YearsSinceLastPromotion`, `YearsWithCurrManager`\n", " - `PerfomanceRating` is correlate with `PercentSalaryHike`\n", " - `Age` is correlate with `TotalWorkingYears`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For categorical features we want to check relationship between `MaritalStatus` and `BusinessTravel`, `OverTime`, `WorkLifeBalance` in relation to `Attrition`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feat = ['BusinessTravel', 'OverTime', 'WorkLifeBalance']\n", "fig , axes = plt.subplots(3,1,figsize = (15,18))\n", "for i in range(len(feat)):\n", " sns.barplot(x=feat[i], y='Attrition', hue='MaritalStatus', data=dfIBM, ax=axes[i]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, our assumption about increasing influence of features `BusinessTravel`, `OverTime`, `WorkLifeBalance` to `Attrition` by `MartialStatus` is not confirmed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Insights and found dependencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are all our observations about the influence of features to target. \n", "The Attrition has occured often, if:\n", " - `BusinessTravel` is Frequently\n", " - `MartialStatus` is Single\n", " - `JobRole` is Sales Representative, Human Resources, Laboratory Technician\n", " - `OverTime` is Yes\n", " - `EnvironmentSatisfaction`, `JobInvolvement`, `JobLevel`, `JobSatisfaction`, `RelationshipSatisfaction`, `WorkLifeBalance` is lower\n", " - `Age`, `MonthlyIncome`, `TotalWorkingYears`, `YearsAtCompany`, `YearsInCurrentRole`, `YearsWithCurrManager` is lower \n", "\n", "We will check this when create our model. Also we will verify that other features have a low impact to target. \n", "\n", "We have some features with a high correlation with other:\n", " - `JobLevel`, `MontlyIncome`, `TotalWorkingYears` and `Age`\n", " - `PerfomanceRating` and `PercentSalaryHike`\n", " - `YearsAtCompany`, `YearsInCurrentRole`, `YearsSinceLastPromotion`, `YearsWithCurrManager` \n", "\n", "We will either not include them in our train dataset for model (in case of LogisticRegression) or correct it by regularization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Metrics selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a task of a binary classification. There is an imbalance in the distribution of classes. So, we can't use an Accuracy. \n", "We can assume that in this task Recall is more important than Precision (we need to find all valuable employees want to get out), but a lot of false positive prediction is no good too (we need to uncover the factors that lead to employee attrition, with a lot of false positive prediction we make a mistake in choosing these factors). \n", "We will use a ROC-AUC since this metric is well in the case of class imbalance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Model selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will try to use models:\n", " - LogisticRegression\n", " - RandomForestClassifier\n", "\n", "These models are solving our problems: **binary classification** and identifying the **significance of features**. \n", "We will use LogisticRegression as a baseline model. \n", "We expect to get better results with Random Forest, given the presence of correlated features in our data, and possible nonlinear dependence target from features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For use Logistic Regression we need to convert our categorical features to dummies. But first, we need to convert our features, that have only 2 unique values. Let's do it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfIBM.Gender = dfIBM.Gender.map({'Male': 1, 'Female': 0})\n", "dfIBM.OverTime = dfIBM.OverTime.map({'Yes': 1, 'No': 0})\n", "Categorical_binary = ['Gender', 'OverTime']\n", "Categorical_without_order = ['BusinessTravel', 'Department', 'EducationField',\n", " 'JobRole', 'MaritalStatus']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfOHE = pd.get_dummies(dfIBM[Categorical_without_order])\n", "Categorical_OHE = dfOHE.columns\n", "dfIBMFull = pd.concat([dfIBM, dfOHE], axis=1)\n", "# create target\n", "y = dfIBMFull['Attrition']\n", "dfIBMFull.drop(columns=['Attrition'], axis=1, inplace=True)\n", "dfIBMFull.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's divide the data into training and hold-out sets. We have imbalance target class, so we need to stratified our separation by the target. We will use 15% of the data for the hold-out set because we have a very small dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y.value_counts(normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_holdout, y_train, y_holdout = train_test_split(dfIBMFull, y,\n", " test_size=0.15,\n", " random_state=2018,\n", " shuffle=True,\n", " stratify=y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_train.value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can to scale all numerical features for use Logistic Regression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "sc = StandardScaler()\n", "X_train_sc = pd.DataFrame(sc.fit_transform(X_train[Categorical_with_order+Numeric]),\n", " columns=Categorical_with_order+Numeric, index=X_train.index)\n", "X_holdout_sc = pd.DataFrame(sc.transform(X_holdout[Categorical_with_order+Numeric]),\n", " columns=Categorical_with_order+Numeric, index=X_holdout.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8. Cross-validation and adjustment of model hyperparameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can split our train for Cross-validation. As in the case with the split to train and hold-out, we need to make a stratified split. We will use 5 Folds (still a very small data)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 8.1 Logistic Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_LR = pd.concat([X_train_sc, X_train[list(Categorical_OHE)+list(Categorical_binary)]], axis=1)\n", "X_train_LR.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_holdout_LR = pd.concat([X_holdout_sc, X_holdout[list(Categorical_OHE)+list(Categorical_binary)]], axis=1)\n", "X_holdout_LR.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "skf = StratifiedKFold(n_splits=5, random_state=2018, shuffle=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try to run Logistic Regression with defaults parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr=LogisticRegression(random_state=2018)\n", "\n", "cv_scores = cross_val_score(lr, X_train_LR, y_train, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "\n", "# Let's check variation in Folds\n", "plt.rcParams['figure.figsize'] = (10,5)\n", "plt.axhline(y=cv_scores.mean(), linewidth=2, color='b', linestyle='dashed');\n", "plt.text(x=0, y=cv_scores.mean()+0.01, s='mean score ='+str(round(cv_scores.mean(),6)));\n", "plt.scatter(range(5), cv_scores, s=100, c=(cv_scores>=cv_scores.mean()),\n", " edgecolor='k', cmap='autumn', linewidth=1.5);\n", "print('Mean score =', cv_scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to find beter regularization (it can be useful, accounting a multicollinearity). We will search regularization for both L2 (squares) and L1 (absolute) penalty." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "penalty = ['l1', 'l2']\n", "C = np.logspace(-1, 1, 10)\n", "params = {'C': C, 'penalty': penalty}\n", "cv_lr = GridSearchCV(lr, param_grid=params, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "cv_lr.fit(X_train_LR, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best parameters:', cv_lr.best_params_)\n", "print('Best score:', cv_lr.best_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see results at the heatmap." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = (12,5)\n", "sns.heatmap(pd.DataFrame(cv_lr.cv_results_['mean_test_score'].reshape(len(C),\n", " len(penalty)),\n", " index=np.round(C, 4),\n", " columns=penalty).sort_index(ascending=False),\n", " annot=True, fmt='.4f', cmap='RdGy_r');\n", "plt.yticks(rotation=0);\n", "plt.xlabel('penalty', fontsize=18);\n", "plt.ylabel('C', fontsize=18);\n", "plt.title('Mean validation ROC-AUC score', fontsize=20);\n", "plt.tick_params(axis='both', length=6, width=0, labelsize=12);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's try to find the best regularization more precisely for both l2 and l1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "C = np.arange(0.1, 2.1, 0.1)\n", "cv_scores = []\n", "for c in C:\n", " lr=LogisticRegression(C=c, random_state=2018, penalty='l2')\n", " cv_scores.append(cross_val_score(lr, X_train_LR, y_train, cv=skf, scoring='roc_auc', n_jobs=-1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.plot(C, np.mean(cv_scores, axis=1), 'o-', color=color[6]);\n", "plt.xticks(C);\n", "plt.title('Mean validation scores');\n", "plt.ylabel('ROC-AUC');\n", "plt.xlabel(r'$\\alpha$');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best C for l2 = 1.0, Best score =', np.max(np.mean(cv_scores, axis=1)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "C = np.arange(0.1, 2.1, 0.1)\n", "cv_scores = []\n", "for c in C:\n", " lr=LogisticRegression(C=c, random_state=2018, penalty='l1')\n", " cv_scores.append(cross_val_score(lr, X_train_LR, y_train, cv=skf, scoring='roc_auc', n_jobs=-1))\n", "\n", "plt.plot(C, np.mean(cv_scores, axis=1), 'o-', color=color[6]);\n", "plt.xticks(C);\n", "plt.title('Mean validation scores');\n", "plt.ylabel('ROC-AUC');\n", "plt.xlabel(r'$\\alpha$');" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best C for l1 = 1.0, Best score =', np.max(np.mean(cv_scores, axis=1)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get result for the hold-out set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_LR_validation = np.max(np.mean(cv_scores, axis=1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_auc_score\n", "\n", "lr=LogisticRegression(C=1, random_state=2018, penalty='l1')\n", "lr.fit(X_train_LR, y_train)\n", "y_pred = lr.predict_proba(X_holdout_LR)[:, 1]\n", "LR_holdout_score = roc_auc_score(y_holdout, y_pred)\n", "print('Hold-out score = ', LR_holdout_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, there is no overfitting. \n", "Now will see at the model coefficients to understand the importance of features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'Name': X_train_LR.columns.values,\n", " 'Coefficient': lr.coef_.flatten(),\n", " 'Abs. Coefficient': np.abs(lr.coef_).\n", " flatten()}).sort_values(by='Abs. Coefficient', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some features are expected significant (`OverTime`, `BusinessTravel`,`YearsAtCompany` etc.), but some features are have the significance less than expected (`MonthlyIncome`, `Age`, etc.). So, it is probably due to multicollinearity in the data (there are many fetures with zero coefficients). \n", "Well, continue with Random Forest." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 8.2 Random Forest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_RF = pd.concat([X_train_sc, X_train[list(Categorical_OHE)+list(Categorical_binary)]], axis=1)\n", "X_train_RF.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_holdout_RF = pd.concat([X_holdout_sc, X_holdout[list(Categorical_OHE)+list(Categorical_binary)]], axis=1)\n", "X_holdout_RF.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try Random Forest with default parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf = RandomForestClassifier(random_state=2018)\n", "\n", "cv_scores = cross_val_score(rf, X_train_RF, y_train, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "\n", "plt.rcParams['figure.figsize'] = (10,5)\n", "plt.axhline(y=cv_scores.mean(), linewidth=2, color='b', linestyle='dashed');\n", "plt.text(x=0, y=cv_scores.mean()+0.01, s='mean score ='+str(round(cv_scores.mean(),6)));\n", "plt.scatter(range(5), cv_scores, s=100, c=(cv_scores>=cv_scores.mean()),\n", " edgecolor='k', cmap='autumn', linewidth=1.5);\n", "print('Mean score =', cv_scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find beter parameters. We will check:\n", " - n_estimators = 50-500, The number of trees in the forest\n", " - max_depth = 2-20, The maximum depth of the tree\n", " - max_features = 5-49, The number of features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "n_estimators = np.arange(100, 600, 100)\n", "max_depth = np.arange(2, 22, 4)\n", "max_features = np.arange(10, 50, 10)\n", "params = {'n_estimators': n_estimators,\n", " 'max_depth': max_depth,\n", " 'max_features': max_features}\n", "cv_rf = GridSearchCV(rf, param_grid=params, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "cv_rf.fit(X_train_RF, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best parameters:', cv_rf.best_params_)\n", "print('Best score:', cv_rf.best_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find more precision parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "n_estimators = np.arange(250, 400, 50)\n", "max_depth = np.arange(15, 26, 3)\n", "max_features = np.arange(5, 21, 5)\n", "params = {'n_estimators': n_estimators,\n", " 'max_depth': max_depth,\n", " 'max_features': max_features}\n", "cv_rf = GridSearchCV(rf, param_grid=params, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "cv_rf.fit(X_train_RF, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best parameters:', cv_rf.best_params_)\n", "print('Best score:', cv_rf.best_score_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "n_estimators = np.arange(280, 330, 10)\n", "max_depth = np.arange(16, 25, 2)\n", "max_features = np.arange(6, 14, 2)\n", "params = {'n_estimators': n_estimators,\n", " 'max_depth': max_depth,\n", " 'max_features': max_features}\n", "cv_rf = GridSearchCV(rf, param_grid=params, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "cv_rf.fit(X_train_RF, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best parameters:', cv_rf.best_params_)\n", "print('Best score:', cv_rf.best_score_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_RF_validation = cv_rf.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The low value of the parameter max_features indicates the presence redundant features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get result for the hold-out set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf=RandomForestClassifier(random_state=2018, max_depth=20, max_features=6, n_estimators=280)\n", "rf.fit(X_train_LR, y_train)\n", "y_pred = rf.predict_proba(X_holdout_RF)[:, 1]\n", "RF_holdout_score = roc_auc_score(y_holdout, y_pred)\n", "print('Hold-out score = ', RF_holdout_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see features importances" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'Name': X_train_RF.columns.values,\n", " 'Coefficient': rf.feature_importances_}).sort_values(by='Coefficient',\n", " ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have `MonthlyIncome`, `Age` and `OverTime` in the top! \n", "But `BusinessTravel` is dropped unexpectedly low." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9. Creation of new features and description of this process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initially, we need to select important features and **remove unusable** features. We have small dataset and can use a full search. Let's use `SequentialFeatureSelector`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlxtend.feature_selection import SequentialFeatureSelector as SFS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_LR = pd.concat([X_train_sc, X_train[list(Categorical_OHE)+list(Categorical_binary)]], axis=1)\n", "X_train_LR.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_holdout_LR = pd.concat([X_holdout_sc, X_holdout[list(Categorical_OHE)+list(Categorical_binary)]], axis=1)\n", "X_holdout_LR.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lrs = LogisticRegression(random_state=2018)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "sfs1 = SFS(lrs, k_features='best', forward=True, floating=False, verbose=2,\n", " scoring='roc_auc', cv=5, n_jobs=-1)\n", "\n", "sfs1 = sfs1.fit(X_train_LR, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best score: ', sfs1.k_score_)\n", "print('Best features names: ', sfs1.k_feature_names_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see all our \"favorite\" columns in the final scope! The search seems to have gone well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Best_features = list(sfs1.k_feature_names_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_B = X_train_LR[Best_features]\n", "X_holdout_B = X_holdout_LR[Best_features]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check results for our models with these features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "lr=LogisticRegression(random_state=2018, penalty='l1')\n", "cv_scores = cross_val_score(lr, X_train_B, y_train, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "print('Mean score LR =', cv_scores.mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "rf = RandomForestClassifier(random_state=2018, max_depth=20, max_features=6, n_estimators=280)\n", "cv_scores = cross_val_score(rf, X_train_B, y_train, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "print('Mean score RF =', cv_scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scores are increased for both models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to create some features as the intersection of important features (All Satisfaction) and classes of numeric features, significant for target (we can see visualisations at p.3):\n", " - Age <= ~33\n", " - MonthlyInvome <= ~ 2600\n", " - YearsWithCurrManager <= ~2\n", " - YearsInCurrentRole <= ~2\n", " - YearsAtCompany <= ~2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# All Level Satisfaction\n", "X_train_B['FullSatisfaction'] = (X_train_B['EnvironmentSatisfaction']+\n", " X_train_B['JobInvolvement']+\n", " X_train_B['JobSatisfaction']+\n", " X_train_B['RelationshipSatisfaction']+\n", " X_train_B['WorkLifeBalance'])\n", "\n", "X_holdout_B['FullSatisfaction'] = (X_holdout_B['EnvironmentSatisfaction']+\n", " X_holdout_B['JobInvolvement']+\n", " X_holdout_B['JobSatisfaction']+\n", " X_holdout_B['RelationshipSatisfaction']+\n", " X_holdout_B['WorkLifeBalance'])\n", "sc = StandardScaler()\n", "X_train_B['FullSatisfaction'] = sc.fit_transform(X_train_B[['FullSatisfaction']])\n", "X_holdout_B['FullSatisfaction'] = sc.transform(X_holdout_B[['FullSatisfaction']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Age: Young <= 33 (-0.431717 after scaling)\n", "X_train_B['Young'] = [1 if a<=-0.431717 else 0 for a in X_train_B['Age']]\n", "X_holdout_B['Young'] = [1 if a<=-0.431717 else 0 for a in X_holdout_B['Age']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Income: Low <= 2650 (~-0.185819 after scaling)\n", "X_train_B['LowIncome'] = [1 if a<=-0.185819 else 0 for a in X_train_B['MonthlyIncome']]\n", "X_holdout_B['LowIncome'] = [1 if a<=-0.185819 else 0 for a in X_holdout_B['MonthlyIncome']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YearsWithCurrManager <= 2 (~-0.599076 after scaling)\n", "X_train_B['LowYearsWithCurrManager'] = [1 if a<=-0.599076 else 0 for a in X_train_B['YearsWithCurrManager']]\n", "X_holdout_B['LowYearsWithCurrManager'] = [1 if a<=-0.599076 else 0 for a in X_holdout_B['YearsWithCurrManager']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YearsInCurrentRole <= 2 (~-0.616905 after scaling)\n", "X_train_B['LowYearsInCurrentRole'] = [1 if a<=-0.616905 else 0 for a in X_train_B['YearsInCurrentRole']]\n", "X_holdout_B['LowYearsInCurrentRole'] = [1 if a<=-0.616905 else 0 for a in X_holdout_B['YearsInCurrentRole']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# YearsAtCompany <= 2 (~-0.818114 after scaling)\n", "X_train_B['LowYearsAtCompany'] = [1 if a<=-0.818114 else 0 for a in X_train_B['YearsAtCompany']]\n", "X_holdout_B['LowYearsAtCompany'] = [1 if a<=-0.818114 else 0 for a in X_holdout_B['YearsAtCompany']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try our model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "lr=LogisticRegression(random_state=2018, penalty='l1')\n", "cv_scores = cross_val_score(lr, X_train_B, y_train, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "print('Mean score LR =', cv_scores.mean())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is increased!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10. Plotting training and validation curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import validation_curve\n", "from sklearn.model_selection import learning_curve" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create validation curve for Logistic Regression model. Repeat selection of the regularization coefficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = (15,6)\n", "C = np.logspace(-5, 4, 10)\n", "train_sc, valid_sc = validation_curve(lr, X_train_B, y_train,\n", " 'C', C, cv=skf, scoring='roc_auc')\n", "\n", "plt.plot(C, np.mean(train_sc, axis=1), 'o-', color=color[0], label='Training scores');\n", "plt.fill_between(C, np.max(train_sc, axis=1),\n", " np.min(train_sc, axis=1), alpha=0.3, color=color[1]);\n", "plt.plot(C, np.mean(valid_sc, axis=1), 'o-', color=color[4], label='Validation scores');\n", "plt.fill_between(C, np.max(valid_sc, axis=1),\n", " np.min(valid_sc, axis=1), alpha=0.3, color=color[5]);\n", "plt.xscale('log')\n", "plt.xlabel(r'$\\alpha$')\n", "plt.ylabel('ROC-AUC')\n", "plt.title('Validation curve')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Bect C = ', C[(np.mean(valid_sc, axis=1)).argmax()])\n", "print('Best score = ', np.max((np.mean(valid_sc, axis=1))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have training and validation curves close to each other, so our model is underfitting, it is not complex enough. \n", "Let's create learning curve using C=100." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr=LogisticRegression(C=100, random_state=2018, penalty='l1')\n", "\n", "train_sizes, train_sc, valid_sc = learning_curve(lr, X_train_B, y_train,\n", " train_sizes=np.arange(50, 998, 50),\n", " scoring='roc_auc', cv=skf)\n", "\n", "plt.plot(train_sizes, np.mean(train_sc, axis=1), 'o-', color=color[0], label='Training scores');\n", "plt.fill_between(train_sizes, np.max(train_sc, axis=1),\n", " np.min(train_sc, axis=1), alpha=0.3, color=color[1]);\n", "plt.plot(train_sizes, np.mean(valid_sc, axis=1), 'o-', color=color[4], label='Validation scores');\n", "plt.fill_between(train_sizes, np.max(valid_sc, axis=1),\n", " np.min(valid_sc, axis=1), alpha=0.3, color=color[1]);\n", "plt.xlabel('Train Size')\n", "plt.ylabel('ROC-AUC')\n", "plt.title('Learning curve')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, an approximation of curves is stopped after 400 rows of the data. More data not make out model better, only change parameters and adding new features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we will create curves for Random Forest. We will search value fot parameter `max_features`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = (15,6)\n", "rf = RandomForestClassifier(random_state=2018, max_depth=20, n_estimators=300)\n", "max_features = np.arange(4, 26, 2)\n", "train_sc, valid_sc = validation_curve(rf, X_train_B, y_train,\n", " 'max_features', max_features, cv=skf, scoring='roc_auc')\n", "\n", "plt.plot(max_features, np.mean(train_sc, axis=1), 'o-', color=color[0], label='Training scores');\n", "plt.fill_between(max_features, np.max(train_sc, axis=1),\n", " np.min(train_sc, axis=1), alpha=0.3, color=color[1]);\n", "plt.plot(max_features, np.mean(valid_sc, axis=1), 'o-', color=color[4], label='Validation scores');\n", "plt.fill_between(max_features, np.max(valid_sc, axis=1),\n", " np.min(valid_sc, axis=1), alpha=0.3, color=color[5]);\n", "plt.xlabel('max_features')\n", "plt.ylabel('ROC-AUC')\n", "plt.title('Validation curve')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Best max_features = ', max_features[(np.mean(valid_sc, axis=1)).argmax()])\n", "print('Best score = ', np.max((np.mean(valid_sc, axis=1))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Validation score is almost identical for all values of max_features. We need to change other parameters for get a significant improvement. Due to new set of features, we can find parameters `max_depth` and `n_estimators` again. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "rf = RandomForestClassifier(random_state=2018)\n", "n_estimators = np.arange(100, 600, 100)\n", "max_depth = np.arange(4, 25, 4)\n", "max_features = np.arange(8, 30, 4)\n", "params = {'n_estimators': n_estimators,\n", " 'max_depth': max_depth,\n", " 'max_features': max_features}\n", "cv_rf = GridSearchCV(rf, param_grid=params, cv=skf, scoring='roc_auc', n_jobs=-1)\n", "cv_rf.fit(X_train_RF, y_train)\n", "print('Best parameters:', cv_rf.best_params_)\n", "print('Best score:', cv_rf.best_score_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use received values in creating learning curve." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf = RandomForestClassifier(random_state=2018, max_depth=8, max_features=8, n_estimators=200)\n", "\n", "train_sizes, train_sc, valid_sc = learning_curve(rf, X_train_B, y_train,\n", " train_sizes=np.arange(50, 998, 50),\n", " scoring='roc_auc', cv=skf)\n", "\n", "plt.plot(train_sizes, np.mean(train_sc, axis=1), 'o-', color=color[0], label='Training scores');\n", "plt.fill_between(train_sizes, np.max(train_sc, axis=1),\n", " np.min(train_sc, axis=1), alpha=0.3, color=color[1]);\n", "plt.plot(train_sizes, np.mean(valid_sc, axis=1), 'o-', color=color[4], label='Validation scores');\n", "plt.fill_between(train_sizes, np.max(valid_sc, axis=1),\n", " np.min(valid_sc, axis=1), alpha=0.3, color=color[1]);\n", "plt.xlabel('Train Size')\n", "plt.ylabel('ROC-AUC')\n", "plt.title('Learning curve')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The convergence of these curves is not over, it may help to increase the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11. Prediction for test or hold-out samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a prediction on the hold-out set for Logistic Regression model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr=LogisticRegression(C=100, random_state=2018, penalty='l1')\n", "lr.fit(X_train_B, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred_lr = lr.predict_proba(X_holdout_B)[:, 1]\n", "LR_holdout_score = roc_auc_score(y_holdout, y_pred_lr)\n", "print('Hold-out score = ', LR_holdout_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result on cross-calidation is capmarable and is about *0.848*. \n", "\n", "Let's make a prediction on the hold-out set for Fandom Forest model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf = RandomForestClassifier(random_state=2018, max_depth=8, max_features=8, n_estimators=200)\n", "rf.fit(X_train_B, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred_rf = rf.predict_proba(X_holdout_B)[:, 1]\n", "RF_holdout_score = roc_auc_score(y_holdout, y_pred_rf)\n", "print('Hold-out score = ', RF_holdout_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the result of Random Forest model on hold-out set is slightly better than Logistic Regression result! \n", "Let's see at feature importance in both models." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'Name': X_train_B.columns.values,\n", " 'Coefficient': lr.coef_.flatten(),\n", " 'Abs. Coefficient': np.abs(lr.coef_).\n", " flatten()}).sort_values(by='Abs. Coefficient', ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'Name': X_train_B.columns.values,\n", " 'Coefficient': rf.feature_importances_}).sort_values(by='Coefficient',\n", " ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some features in top of both models (`Overtime`, `MartialStatus_Signed`) but for most features the estimate of importance is strongly different." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 12. Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Exploratory Data Analysis we examine a structure of the date: count and type of features, target feature and it's relationship with other variables. During a brief review of the relationship of variables, we find some insight and relationship with target description. Some of our suspicions were confirmed while working with prediction models. \n", "We get the result of the model: ROC AUC about 0.84. \n", "HR can use our model to prediction of a possible outflow of valuable employees. But at the moment it doesn't really matter. \n", "We get few features (as `OverTime`, `BusinessTravel`, `MonthlyIncoming`), which have a significant impact on employee attrition. If companies will pay attention to this, they will more reliably protect the safety of their most important asset - people. \n", "To improve the quality of the model:\n", " - We need more not fictional data\n", " - Review other models such as SVM and neural networks\n", " - Find other patterns in the data " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" } }, "nbformat": 4, "nbformat_minor": 2 }