{"cells":[{"cell_type":"markdown","metadata":{},"source":["# Random Forest Classifier with Feature Importance"]},{"cell_type":"markdown","metadata":{},"source":["## The problem statement\n","We try to make predictions where the prediction task is to determine whether a person makes over 50K a year. We implement Random Forest Classification with Python and Scikit-Learn. So, to answer the question, we build a Random Forest classifier to predict whether a person makes over 50K a year.\n"]},{"cell_type":"markdown","metadata":{},"source":["## Import libraries\n","\n","\n","\n"]},{"cell_type":"code","execution_count":1,"metadata":{"trusted":true},"outputs":[],"source":["# This Python 3 environment comes with many helpful analytics libraries installed\n","# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n","# For example, here's several helpful packages to load in \n","\n","import numpy as np # linear algebra\n","import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n","\n","# Input data files are available in the \"../input/\" directory.\n","# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n","\n","import os\n","for dirname, _, filenames in os.walk('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'):\n","    for filename in filenames:\n","        print(os.path.join(dirname, filename))\n","\n","# Any results you write to the current directory are saved as output.\n"]},{"cell_type":"code","execution_count":2,"metadata":{"trusted":true},"outputs":[],"source":["import seaborn as sns\n","import matplotlib.pyplot as plt\n","%matplotlib inline\n","\n","sns.set(style=\"whitegrid\")"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["import warnings\n","\n","warnings.filterwarnings('ignore')"]},{"cell_type":"markdown","metadata":{},"source":["## Import dataset\n","\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["data = 'https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/random-forest-income_evaluation.csv'\n","\n","df = pd.read_csv(data)"]},{"cell_type":"markdown","metadata":{},"source":["## Exploratory data analysis"]},{"cell_type":"markdown","metadata":{},"source":["### View dimensions of dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# print the shape\n","print('The shape of the dataset : ', df.shape)"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there are 32561 instances and 15 attributes in the data set."]},{"cell_type":"markdown","metadata":{},"source":["### Preview the dataset <a class=\"anchor\" id=\"4.2\"></a>"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.head()"]},{"cell_type":"markdown","metadata":{},"source":["### Rename column names"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',\n","             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']\n","\n","df.columns = col_names\n","\n","df.columns"]},{"cell_type":"markdown","metadata":{},"source":["### View summary of dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.info()"]},{"cell_type":"markdown","metadata":{},"source":["### Check the data types of columns\n","\n","The above df.info() command gives us the number of filled values along with the data types of columns.\n","\n","If we simply want to check the data type of a particular column, we can use the following command."]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.dtypes"]},{"cell_type":"markdown","metadata":{},"source":["### View statistical properties of dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.describe()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.describe().T"]},{"cell_type":"markdown","metadata":{},"source":["We can see that the above df.describe().T command presents statistical properties in horizontal form."]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.describe(include='all')"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check for missing values\n","\n","df.isnull().sum()"]},{"cell_type":"markdown","metadata":{},"source":["### Check with ASSERT statement"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["assert pd.notnull(df).all().all()"]},{"cell_type":"markdown","metadata":{},"source":["### Functional approach to EDA"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["def initial_eda(df):\n","    if isinstance(df, pd.DataFrame):\n","        total_na = df.isna().sum().sum()\n","        print(\"Dimensions : %d rows, %d columns\" % (df.shape[0], df.shape[1]))\n","        print(\"Total NA Values : %d \" % (total_na))\n","        print(\"%38s %10s     %10s %10s\" % (\"Column Name\", \"Data Type\", \"#Distinct\", \"NA Values\"))\n","        col_name = df.columns\n","        dtyp = df.dtypes\n","        uniq = df.nunique()\n","        na_val = df.isna().sum()\n","        for i in range(len(df.columns)):\n","            print(\"%38s %10s   %10s %10s\" % (col_name[i], dtyp[i], uniq[i], na_val[i]))\n","        \n","    else:\n","        print(\"Expect a DataFrame but got a %15s\" % (type(df)))\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["initial_eda(df)"]},{"cell_type":"markdown","metadata":{},"source":["## Explore Categorical Variables"]},{"cell_type":"markdown","metadata":{},"source":["### Find categorical variables "]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["categorical = [var for var in df.columns if df[var].dtype=='O']\n","\n","print('There are {} categorical variables\\n'.format(len(categorical)))\n","\n","print('The categorical variables are :\\n\\n', categorical)"]},{"cell_type":"markdown","metadata":{"trusted":true},"source":["### Preview categorical variables "]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df[categorical].head()"]},{"cell_type":"markdown","metadata":{},"source":["### Summary of categorical variables "]},{"cell_type":"markdown","metadata":{},"source":["### Frequency distribution of categorical variables"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["for var in categorical: \n","    \n","    print(df[var].value_counts())"]},{"cell_type":"markdown","metadata":{},"source":["### Percentage of frequency distribution of values"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["for var in categorical:\n","    print(df[var].value_counts()/float(len(df)))"]},{"cell_type":"markdown","metadata":{},"source":["### Explore the variables "]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check for missing values\n","\n","df['income'].isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view number of unique values\n","\n","df['income'].nunique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view the unique values\n","\n","df['income'].unique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view the frequency distribution of values\n","\n","df['income'].value_counts()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view percentage of frequency distribution of values\n","\n","df['income'].value_counts()/len(df)"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# visualize frequency distribution of income variable\n","\n","f,ax=plt.subplots(1,2,figsize=(18,8))\n","\n","ax[0] = df['income'].value_counts().plot.pie(explode=[0,0],autopct='%1.1f%%',ax=ax[0],shadow=True)\n","ax[0].set_title('Income Share')\n","\n","\n","#f, ax = plt.subplots(figsize=(6, 8))\n","ax[1] = sns.countplot(x=\"income\", data=df, palette=\"Set1\")\n","ax[1].set_title(\"Frequency distribution of income variable\")\n","\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["f, ax = plt.subplots(figsize=(8, 6))\n","ax = sns.countplot(y=\"income\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of income variable\")\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["f, ax = plt.subplots(figsize=(10, 8))\n","ax = sns.countplot(x=\"income\", hue=\"sex\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of income variable wrt sex\")\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["f, ax = plt.subplots(figsize=(10, 8))\n","ax = sns.countplot(x=\"income\", hue=\"race\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of income variable wrt race\")\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check number of unique labels \n","\n","df.workclass.nunique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view the unique labels\n","\n","df.workclass.unique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view frequency distribution of values\n","\n","df.workclass.value_counts()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# replace '?' values in workclass variable with `NaN`\n","\n","df['workclass'].replace(' ?', np.NaN, inplace=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# again check the frequency distribution of values in workclass variable\n","\n","df.workclass.value_counts()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["f, ax = plt.subplots(figsize=(10, 6))\n","ax = df.workclass.value_counts().plot(kind=\"bar\", color=\"green\")\n","ax.set_title(\"Frequency distribution of workclass variable\")\n","ax.set_xticklabels(df.workclass.value_counts().index, rotation=30)\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["f, ax = plt.subplots(figsize=(12, 8))\n","ax = sns.countplot(x=\"workclass\", hue=\"income\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of workclass variable wrt income\")\n","ax.legend(loc='upper right')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["f, ax = plt.subplots(figsize=(12, 8))\n","ax = sns.countplot(x=\"workclass\", hue=\"sex\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of workclass variable wrt sex\")\n","ax.legend(loc='upper right')\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check number of unique labels\n","\n","df.occupation.nunique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view unique labels\n","\n","df.occupation.unique()\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view frequency distribution of values\n","\n","df.occupation.value_counts()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# replace '?' values in occupation variable with `NaN`\n","\n","df['occupation'].replace(' ?', np.NaN, inplace=True)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# again check the frequency distribution of values\n","\n","df.occupation.value_counts()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# visualize frequency distribution of `occupation` variable\n","\n","f, ax = plt.subplots(figsize=(12, 8))\n","ax = sns.countplot(x=\"occupation\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of occupation variable\")\n","ax.set_xticklabels(df.occupation.value_counts().index, rotation=30)\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check number of unique labels\n","\n","df.native_country.nunique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view unique labels \n","\n","df.native_country.unique()\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check frequency distribution of values\n","\n","df.native_country.value_counts()\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# replace '?' values in native_country variable with `NaN`\n","\n","df['native_country'].replace(' ?', np.NaN, inplace=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# visualize frequency distribution of `native_country` variable\n","\n","f, ax = plt.subplots(figsize=(16, 12))\n","ax = sns.countplot(x=\"native_country\", data=df, palette=\"Set1\")\n","ax.set_title(\"Frequency distribution of native_country variable\")\n","ax.set_xticklabels(df.native_country.value_counts().index, rotation=90)\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df[categorical].isnull().sum()"]},{"cell_type":"markdown","metadata":{},"source":["### Number of labels: Cardinality \n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check for cardinality in categorical variables\n","\n","for var in categorical:\n","    \n","    print(var, ' contains ', len(df[var].unique()), ' labels')"]},{"cell_type":"markdown","metadata":{},"source":["We can see that native_country column contains relatively large number of labels as compared to other columns. I will check for cardinality after train-test split."]},{"cell_type":"markdown","metadata":{},"source":["## Declare feature vector and target variable"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X = df.drop(['income'], axis=1)\n","\n","y = df['income']"]},{"cell_type":"markdown","metadata":{},"source":["## Split data into separate training and test set"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.model_selection import train_test_split\n","\n","X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check the shape of X_train and X_test\n","\n","X_train.shape, X_test.shape"]},{"cell_type":"markdown","metadata":{},"source":["## Feature Engineering"]},{"cell_type":"markdown","metadata":{},"source":["### Display categorical variables in training set\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']\n","\n","categorical"]},{"cell_type":"markdown","metadata":{},"source":["### Display numerical variables in training set\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']\n","\n","numerical"]},{"cell_type":"markdown","metadata":{},"source":["### Engineering missing values in categorical variables"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# print percentage of missing values in the categorical variables in training set\n","\n","X_train[categorical].isnull().mean()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# print categorical variables with missing data\n","\n","for col in categorical:\n","    if X_train[col].isnull().mean()>0:\n","        print(col, (X_train[col].isnull().mean()))"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# impute missing categorical variables with most frequent value\n","\n","for df2 in [X_train, X_test]:\n","    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)\n","    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)\n","    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)    "]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check missing values in categorical variables in X_train\n","\n","X_train[categorical].isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check missing values in categorical variables in X_test\n","\n","X_test[categorical].isnull().sum()"]},{"cell_type":"markdown","metadata":{},"source":["As a final check, I will check for missing values in X_train and X_test."]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check missing values in X_train\n","\n","X_train.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# check missing values in X_test\n","\n","X_test.isnull().sum()"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there are no missing values in X_train and X_test."]},{"cell_type":"markdown","metadata":{},"source":["### Encode categorical variables\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# preview categorical variables in X_train\n","\n","X_train[categorical].head()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# import category encoders\n","\n","import category_encoders as ce"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# encode categorical variables with one-hot encoding\n","\n","encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', \n","                                 'race', 'sex', 'native_country'])\n","\n","X_train = encoder.fit_transform(X_train)\n","\n","X_test = encoder.transform(X_test)"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X_train.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X_train.shape"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X_test.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X_test.shape"]},{"cell_type":"markdown","metadata":{},"source":["* We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called **feature scaling**. We will do it as follows."]},{"cell_type":"markdown","metadata":{},"source":["## Feature Scaling"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["cols = X_train.columns\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.preprocessing import RobustScaler\n","\n","scaler = RobustScaler()\n","\n","X_train = scaler.fit_transform(X_train)\n","\n","X_test = scaler.transform(X_test)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X_train = pd.DataFrame(X_train, columns=[cols])"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X_test = pd.DataFrame(X_test, columns=[cols])"]},{"cell_type":"markdown","metadata":{},"source":["We now have X_train dataset ready to be fed into the Random Forest classifier. We will do it as follows."]},{"cell_type":"markdown","metadata":{},"source":["## Random Forest Classifier model with default parameters"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# import Random Forest classifier\n","\n","from sklearn.ensemble import RandomForestClassifier\n","\n","\n","\n","# instantiate the classifier \n","\n","rfc = RandomForestClassifier(random_state=0)\n","\n","\n","\n","# fit the model\n","\n","rfc.fit(X_train, y_train)\n","\n","\n","\n","# Predict the Test set results\n","\n","y_pred = rfc.predict(X_test)\n","\n","\n","\n","# Check accuracy score \n","\n","from sklearn.metrics import accuracy_score\n","\n","print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))"]},{"cell_type":"markdown","metadata":{},"source":["## Random Forest Classifier model with 100 Decision Trees"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# instantiate the classifier with n_estimators = 100\n","\n","rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)\n","\n","\n","\n","# fit the model to the training set\n","\n","rfc_100.fit(X_train, y_train)\n","\n","\n","\n","# Predict on the test set results\n","\n","y_pred_100 = rfc_100.predict(X_test)\n","\n","\n","\n","# Check accuracy score \n","\n","print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# create the classifier with n_estimators = 100\n","\n","clf = RandomForestClassifier(n_estimators=100, random_state=0)\n","\n","\n","\n","# fit the model to the training set\n","\n","clf.fit(X_train, y_train)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"scrolled":true,"trusted":true},"outputs":[],"source":["# view the feature scores\n","\n","feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)\n","\n","feature_scores"]},{"cell_type":"markdown","metadata":{},"source":["## Build the Random Forest model on selected features"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# drop the least important feature from X_train and X_test\n","\n","X_train = X_train.drop(['native_country_41'], axis=1)\n","\n","X_test = X_test.drop(['native_country_41'], axis=1)\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# instantiate the classifier with n_estimators = 100\n","\n","clf = RandomForestClassifier(n_estimators=100, random_state=0)\n","\n","\n","\n","# fit the model to the training set\n","\n","clf.fit(X_train, y_train)\n","\n","\n","# Predict on the test set results\n","\n","y_pred = clf.predict(X_test)\n","\n","\n","\n","# Check accuracy score \n","\n","print('Model accuracy score with native_country_41 variable removed : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))\n"]},{"cell_type":"markdown","metadata":{},"source":["## Confusion matrix\n","\n","\n","A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.\n","\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# Print the Confusion Matrix and slice it into four pieces\n","\n","from sklearn.metrics import confusion_matrix\n","\n","cm = confusion_matrix(y_test, y_pred)\n","\n","print('Confusion matrix\\n\\n', cm)\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# visualize confusion matrix with seaborn heatmap\n","\n","cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], \n","                                 index=['Predict Positive:1', 'Predict Negative:0'])\n","\n","sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')"]},{"cell_type":"markdown","metadata":{},"source":["## Classification Report"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.metrics import classification_report\n","\n","print(classification_report(y_test, y_pred))"]},{"cell_type":"markdown","metadata":{},"source":["# Acknowledgments\n","\n","Thanks to [prashant111](https://www.kaggle.com/prashant111) for creating [random-forest-classifier-feature-importance](https://www.kaggle.com/code/prashant111/random-forest-classifier-feature-importance). It inspires the majority of the content in this chapter."]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.9.16"}},"nbformat":4,"nbformat_minor":4}