{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Work Visa: Ensemble Methods" ], "metadata": { "id": "5S8NuBE3nvlD" }, "id": "5S8NuBE3nvlD" }, { "cell_type": "markdown", "source": [ "The number of applications for US work visas is growing. To assist with the review process, we will develop an ML solution to filter out many candidates. This filtering will be trained on a data set of previous applications and the resulting case statuses.\n", "\n", "This data has features describing both the applicant and their sponsoring employer:\n", "\n", "* case_id: unique identifier for each application\n", "\n", "* continent: employee continent of origin\n", "\n", "* education_of_employee: level of education\n", "\n", "* has_job_experience: binary flag\n", "\n", "* requires_job_training: binary flag\n", "\n", "* no_of_employees: size of sponsoring employer's company\n", "\n", "* yr_of_estab: sponsoring company's year of establishment\n", "\n", "* region_of_employment: applicant's intended region of employment in the US\n", "\n", "* prevailing_wage: regional average wage for a given domain of labor; useful to keep job market competative without underpaying foreign workers\n", "\n", "* unit_of_wage: frequency of payment\n", "\n", "* full_time_position: binary flag; Y: full time\n", "\n", "* case_status: binary flag" ], "metadata": { "id": "sCbkoxqqnLPK" }, "id": "sCbkoxqqnLPK" }, { "cell_type": "markdown", "id": "dirty-island", "metadata": { "id": "dirty-island" }, "source": [ "## Importing necessary libraries and data" ] }, { "cell_type": "code", "source": [ "!pip uninstall scikit-learn -y\n", "\n", "!pip install -U scikit-learn" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vSdfBXwUfVm_", "outputId": "d0d4ac89-81dc-40a4-e776-7d43aae22378" }, "id": "vSdfBXwUfVm_", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Found existing installation: scikit-learn 1.0.2\n", "Uninstalling scikit-learn-1.0.2:\n", " Successfully uninstalled scikit-learn-1.0.2\n", "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", "Collecting scikit-learn\n", " Downloading scikit_learn-1.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.2 MB)\n", "\u001b[K |████████████████████████████████| 31.2 MB 1.1 MB/s \n", "\u001b[?25hRequirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.21.6)\n", "Requirement already satisfied: joblib>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.2.0)\n", "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.7.3)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (3.1.0)\n", "Installing collected packages: scikit-learn\n", "Successfully installed scikit-learn-1.1.3\n" ] } ] }, { "cell_type": "code", "execution_count": null, "id": "statewide-still", "metadata": { "id": "statewide-still" }, "outputs": [], "source": [ "# math and data\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# plotting\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set_theme()\n", "\n", "# model building\n", "from sklearn import metrics\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "\n", "# models\n", "from sklearn import tree\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.ensemble import BaggingClassifier, RandomForestClassifier\n", "from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier\n", "from sklearn.ensemble import StackingClassifier\n", "from xgboost import XGBClassifier" ] }, { "cell_type": "code", "source": [ "visa=pd.read_csv('dataset.csv')" ], "metadata": { "id": "QTUEqr4ReaRG" }, "id": "QTUEqr4ReaRG", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "visa_copy=visa.copy()" ], "metadata": { "id": "1etBGBuhffbJ" }, "id": "1etBGBuhffbJ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "id": "desperate-infection", "metadata": { "id": "desperate-infection" }, "source": [ "## Data Overview\n" ] }, { "cell_type": "code", "source": [ "visa.sample(10)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 479 }, "id": "_9Sk7d9TflmP", "outputId": "9bd17233-17a0-4c34-ca0a-03dd47d52c06" }, "id": "_9Sk7d9TflmP", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " case_id continent education_of_employee has_job_experience \\\n", "24759 EZYV24760 Europe Bachelor's Y \n", "17290 EZYV17291 Asia Master's Y \n", "11584 EZYV11585 Asia Master's Y \n", "5717 EZYV5718 Europe Master's N \n", "10098 EZYV10099 North America Doctorate Y \n", "18484 EZYV18485 Asia Master's Y \n", "8757 EZYV8758 North America Bachelor's Y \n", "6137 EZYV6138 Asia Master's N \n", "1178 EZYV1179 North America Master's Y \n", "12869 EZYV12870 Asia High School Y \n", "\n", " requires_job_training no_of_employees yr_of_estab \\\n", "24759 N 955 2001 \n", "17290 N 1305 2011 \n", "11584 N 645 2007 \n", "5717 Y 3343 1911 \n", "10098 Y 1307 2001 \n", "18484 N 1851 2007 \n", "8757 N 7778 1968 \n", "6137 N 3406 2010 \n", "1178 N 1409 2001 \n", "12869 N 2981 2006 \n", "\n", " region_of_employment prevailing_wage unit_of_wage full_time_position \\\n", "24759 Northeast 136814.79 Year Y \n", "17290 Midwest 221669.11 Year N \n", "11584 Northeast 57470.10 Year N \n", "5717 West 147919.62 Year Y \n", "10098 South 25838.77 Year Y \n", "18484 Northeast 58934.94 Year Y \n", "8757 Northeast 115183.30 Year Y \n", "6137 South 161136.84 Year N \n", "1178 South 124682.10 Year Y \n", "12869 Northeast 110582.48 Year Y \n", "\n", " case_status \n", "24759 Denied \n", "17290 Denied \n", "11584 Certified \n", "5717 Certified \n", "10098 Certified \n", "18484 Certified \n", "8757 Certified \n", "6137 Certified \n", "1178 Certified \n", "12869 Denied " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
case_idcontinenteducation_of_employeehas_job_experiencerequires_job_trainingno_of_employeesyr_of_estabregion_of_employmentprevailing_wageunit_of_wagefull_time_positioncase_status
24759EZYV24760EuropeBachelor'sYN9552001Northeast136814.79YearYDenied
17290EZYV17291AsiaMaster'sYN13052011Midwest221669.11YearNDenied
11584EZYV11585AsiaMaster'sYN6452007Northeast57470.10YearNCertified
5717EZYV5718EuropeMaster'sNY33431911West147919.62YearYCertified
10098EZYV10099North AmericaDoctorateYY13072001South25838.77YearYCertified
18484EZYV18485AsiaMaster'sYN18512007Northeast58934.94YearYCertified
8757EZYV8758North AmericaBachelor'sYN77781968Northeast115183.30YearYCertified
6137EZYV6138AsiaMaster'sNN34062010South161136.84YearNCertified
1178EZYV1179North AmericaMaster'sYN14092001South124682.10YearYCertified
12869EZYV12870AsiaHigh SchoolYN29812006Northeast110582.48YearYDenied
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "code", "source": [ "visa.shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "45jNV4adf-ho", "outputId": "e864df18-b27c-458c-fe84-bbf7490737d9" }, "id": "45jNV4adf-ho", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(25480, 12)" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "code", "execution_count": null, "id": "persistent-juice", "metadata": { "id": "persistent-juice", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "d67d2622-64f1-4dfa-91ef-1a35e35af3b1" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "RangeIndex: 25480 entries, 0 to 25479\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 case_id 25480 non-null object \n", " 1 continent 25480 non-null object \n", " 2 education_of_employee 25480 non-null object \n", " 3 has_job_experience 25480 non-null object \n", " 4 requires_job_training 25480 non-null object \n", " 5 no_of_employees 25480 non-null int64 \n", " 6 yr_of_estab 25480 non-null int64 \n", " 7 region_of_employment 25480 non-null object \n", " 8 prevailing_wage 25480 non-null float64\n", " 9 unit_of_wage 25480 non-null object \n", " 10 full_time_position 25480 non-null object \n", " 11 case_status 25480 non-null object \n", "dtypes: float64(1), int64(2), object(9)\n", "memory usage: 2.3+ MB\n" ] } ], "source": [ "visa.info()" ] }, { "cell_type": "markdown", "source": [ "There are twelve features with 25480 records each. While we record no NaN values, there may very well be missing or erroneous data. Most of the columns are object type, which we can convert to categorical during our pre-processing later." ], "metadata": { "id": "8pWt6N2YxUbA" }, "id": "8pWt6N2YxUbA" }, { "cell_type": "markdown", "id": "seasonal-calibration", "metadata": { "id": "seasonal-calibration" }, "source": [ "## Exploratory Data Analysis (EDA)" ] }, { "cell_type": "code", "source": [ "visa.describe().T" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "eT5a4t1cgDME", "outputId": "205ff54e-0c61-4df6-dbe6-21c4a7bcfff4" }, "id": "eT5a4t1cgDME", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " count mean std min 25% \\\n", "no_of_employees 25480.0 5667.043210 22877.928848 -26.0000 1022.00 \n", "yr_of_estab 25480.0 1979.409929 42.366929 1800.0000 1976.00 \n", "prevailing_wage 25480.0 74455.814592 52815.942327 2.1367 34015.48 \n", "\n", " 50% 75% max \n", "no_of_employees 2109.00 3504.0000 602069.00 \n", "yr_of_estab 1997.00 2005.0000 2016.00 \n", "prevailing_wage 70308.21 107735.5125 319210.27 " ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countmeanstdmin25%50%75%max
no_of_employees25480.05667.04321022877.928848-26.00001022.002109.003504.0000602069.00
yr_of_estab25480.01979.40992942.3669291800.00001976.001997.002005.00002016.00
prevailing_wage25480.074455.81459252815.9423272.136734015.4870308.21107735.5125319210.27
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "* We see that the minimum number of employees is -26, an impossibility! This indicates that we already have at least one apparent error in our data.\n", "\n", "* The median number of employees is around 2100, so a midsize company. The largest in our records has over 600,000 employees.\n", "\n", "* The oldest company we have records for was founded in 1800, but the mean year is around 1979.\n", "\n", "* The median local wage is \\$70,308.21, and the mean is several thousand higher, indicating some skew toward higher costs of living. There is at least one shockingly low value, as the minimum is just \\$2.14. Perhaps this is hourly wage, though even then \\$2/hr is quite low. This hints that we must diligently explore outliers for potential errors." ], "metadata": { "id": "FVQNxg_GgZxg" }, "id": "FVQNxg_GgZxg" }, { "cell_type": "code", "source": [ "visa.describe(include='object').T" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 332 }, "id": "_iQsP9-ggHe0", "outputId": "4f808fad-59b0-49ef-f149-6948496e5bfb" }, "id": "_iQsP9-ggHe0", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " count unique top freq\n", "case_id 25480 25480 EZYV01 1\n", "continent 25480 6 Asia 16861\n", "education_of_employee 25480 4 Bachelor's 10234\n", "has_job_experience 25480 2 Y 14802\n", "requires_job_training 25480 2 N 22525\n", "region_of_employment 25480 5 Northeast 7195\n", "unit_of_wage 25480 4 Year 22962\n", "full_time_position 25480 2 Y 22773\n", "case_status 25480 2 Certified 17018" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countuniquetopfreq
case_id2548025480EZYV011
continent254806Asia16861
education_of_employee254804Bachelor's10234
has_job_experience254802Y14802
requires_job_training254802N22525
region_of_employment254805Northeast7195
unit_of_wage254804Year22962
full_time_position254802Y22773
case_status254802Certified17018
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "markdown", "source": [ "* With 25480 unique case IDs, we can be fairly confident that there are no duplicates in our records.\n", "\n", "* The data covers 6 continents of origin, with Asia being the most frequent.\n", "\n", "* The plurality of applicants have a Bachelor's level of education. We will explore the other levels on record shortly.\n", "\n", "* The majority of applicants do have some job experience, and around 88% of jobs do not require training.\n", "\n", "* Nearly all wages are recorded as yearly salaries, though the data dictionary indicates that hourly, weekly, and monthly are options too." ], "metadata": { "id": "Fng3k8BkzIhL" }, "id": "Fng3k8BkzIhL" }, { "cell_type": "code", "execution_count": null, "id": "right-permit", "metadata": { "id": "right-permit", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "c6c62a2b-f397-40ed-d203-32615509c19f" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Values in continent feature\n", "Asia 16861\n", "Europe 3732\n", "North America 3292\n", "South America 852\n", "Africa 551\n", "Oceania 192\n", "Name: continent, dtype: int64\n", "---------------------------------------------\n", "Values in education_of_employee feature\n", "Bachelor's 10234\n", "Master's 9634\n", "High School 3420\n", "Doctorate 2192\n", "Name: education_of_employee, dtype: int64\n", "---------------------------------------------\n", "Values in has_job_experience feature\n", "Y 14802\n", "N 10678\n", "Name: has_job_experience, dtype: int64\n", "---------------------------------------------\n", "Values in requires_job_training feature\n", "N 22525\n", "Y 2955\n", "Name: requires_job_training, dtype: int64\n", "---------------------------------------------\n", "Values in region_of_employment feature\n", "Northeast 7195\n", "South 7017\n", "West 6586\n", "Midwest 4307\n", "Island 375\n", "Name: region_of_employment, dtype: int64\n", "---------------------------------------------\n", "Values in unit_of_wage feature\n", "Year 22962\n", "Hour 2157\n", "Week 272\n", "Month 89\n", "Name: unit_of_wage, dtype: int64\n", "---------------------------------------------\n", "Values in full_time_position feature\n", "Y 22773\n", "N 2707\n", "Name: full_time_position, dtype: int64\n", "---------------------------------------------\n", "Values in case_status feature\n", "Certified 17018\n", "Denied 8462\n", "Name: case_status, dtype: int64\n", "---------------------------------------------\n" ] } ], "source": [ "for col in visa.drop('case_id',axis=1).select_dtypes('object').columns:\n", " print('Values in',col,'feature')\n", " print(visa[col].value_counts())\n", " print('-'*45)" ] }, { "cell_type": "markdown", "source": [ "* The only continent of origin not represented is (predictably) Antarctica. After Asia, the most common continents are Europe and North America (likely Canada and Mexico).\n", "\n", "* Other than a Bachelor's degree, many applicants have a Master's. Some also have a high school diploma or a doctorate, but these are less common.\n", "\n", "* There are five options for region of employment: Northeast, South, West, Midwest, and Island (in decreasing level of frequency). Northeast and South have nearly the same number of records.\n", "\n", "* The dependent variable in this study is ```case_status```, with outcomes Certified or Denied. About two-thirds of cases are Certified." ], "metadata": { "id": "EjVl26e93LEX" }, "id": "EjVl26e93LEX" }, { "cell_type": "code", "source": [ "# quick plotting function\n", "def plott(col=None):\n", " '''Quick plot a countplot for categorical\n", " data and a histogram for numeric data.'''\n", " plt.figure(figsize=(8,5))\n", " if col==None:\n", " return\n", " elif visa[col].dtype=='object':\n", " plt.title('Countplot of '+col,fontsize=14)\n", " sns.countplot(data=visa,x=col,\n", " order=visa[col].value_counts().index.tolist());\n", " else:\n", " plt.title('Histogram of '+col,fontsize=14)\n", " sns.histplot(data=visa,x=col);" ], "metadata": { "id": "OhuPoWJX4zrh" }, "id": "OhuPoWJX4zrh", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "plott('continent')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "Iv4GGconqc-R", "outputId": "10695d26-f495-4cfb-983d-0a6e0b1ab0da" }, "id": "Iv4GGconqc-R", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAgoAAAFTCAYAAABYqCT3AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3dfVhUdf7/8dcwCN4vgqCIlUnfEL/eUSSZWSu6oix3aqaSWnlTrZm2aUq4AZlWgNm2mmvbWtbmzVWpmZpird3Z2o2ZeY8tGt4R6CCKd6DM+f3h1/lJeHS4HcTn47q6cj6fOee8z5lh5nXO58w5FsMwDAEAAFyGm6sLAAAAtRdBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAUwcPHlRQUJC2bdvmkuX/8MMPio6OVocOHTR8+HCX1GDG1dsGqCkEBaCaHD16VNOnT1fv3r3VoUMH9ejRQ6NHj9YXX3xR47UkJCTo0UcfrZFlDR8+XNOmTauSec2YMUPt2rXTJ598otmzZ1fJPCvicuvk7++vDRs2KDg4uMbrCQ8P1/z582t8ubg+ubu6AKAuOnjwoIYOHapGjRrpqaeeUrt27WQYhjZu3Kjk5GR9/vnnri7xmrB//3498MAD8vf3d3UpZVitVvn6+rq6DKDacUQBqAbPPfecJGnp0qWKjIxU27ZtFRgYqGHDhumjjz5yPO/w4cN6/PHHFRISopCQEI0bN06//vqro3/27NmKiooqNe9ly5YpJCSkzHNWr16t3r17KyQkRGPHjlV+fr6jf/ny5fr8888VFBSkoKAgffvtt45D5ytXrtTQoUPVsWNH9e3bVxs2bLjiun3//fcaNGiQOnbsqLvuuksvvPCCiouLJV04cvHdd99p4cKFjmUdPHjwsvMpLi7WjBkzdNddd6ljx466//77tWnTJkn//7B+YWGhEhMTFRQUpGXLll12PoZh6M0331SfPn3UoUMH3XPPPXr55Zcd/ZmZmXrooYfUqVMnde3aVQkJCSosLHT0Xzza8vbbb6tHjx6644479Mwzz+jMmTNXXKffDj18++23CgoK0saNGzVo0CB17txZAwYM0I4dO0rVu3nzZg0bNkydO3dWjx49lJycrJMnTzr6hw8frpSUFM2aNUthYWHq1q2bUlNTZbfbHf2HDh1SWlqaox6gOhEUgCpWUFCgr776Sg888IAaNWpUpr9p06aSJLvdrrFjx8pms+mdd97RO++8o7y8PI0dO1blvbL6oUOH9PHHH2vOnDl68803tWvXLv31r3+VJI0cOVL9+vXTXXfdpQ0bNmjDhg2lgkZ6erqGDx+uDz/8UN27d9fYsWOVm5t72eXk5uZqzJgxCg4O1ocffqgZM2Zo9erVmjVrliRp6tSpCgkJ0YABAxzLMjsakJaWpjVr1uiFF17Qhx9+qFtvvVVjxoxRXl6e47B+gwYNlJiYqA0bNigyMvKy85k1a5bmzp2rRx55RKtXr9arr76qli1bSpJOnz6tUaNGqWHDhnr//fc1Z84c/fjjj0pMTCw1j02bNunnn3/WggUL9Morr+iTTz7RO++8U+51kqSXX35ZEydO1LJly9SsWTNNmjTJ8XpmZmZq1KhRCg8P14oVKzRnzhzt3r27TD0rV66U1WrVkiVL9Oyzz+rtt9/Wxx9/LOlC8GvZsqUef/xxRz1AdSIoAFVs//79MgxDgYGBV3zexo0blZmZqZkzZ6pjx47q2LGjXn75Ze3cuVMbN24s1zLPnz+vl156Se3atVNISIjuv/9+xzwaNWqk+vXry8PDQ76+vvL19ZWHh4dj2qFDhyoyMlKBgYGaOnWq/P39tWjRossuZ9GiRfLz81NKSooCAwPVs2dPTZw4Ue+++67OnDmjJk2aqF69emrQoIFjWVartcx8Tp8+rSVLlmjSpEn6/e9/r8DAQD333HPy8fHRwoULHYf1LRaLmjRpIl9fX9WvX7/MfE6dOqUFCxZo4sSJuu+++3TTTTcpJCREDzzwgCRp1apVOnPmjGPvu2vXrpo2bZrWrVun7Oxsx3waN26s5557ToGBgbr77rvVt29fx/Zzdp0umjBhgu68804FBgZq7Nix2rt3ryN4zZ8/X/369dPIkSPVpk0bde7cWSkpKcrIyJDNZnPM45ZbbtGECRN08803KzIyUmFhYY56vLy8ZLVa1ahRI0c9QHXiHAWgijl7NCArK0t+fn5q3bq1o+2GG26Qn5+f/vvf/+quu+5yepmtWrVSkyZNHI/9/PxKffFcSZcuXRz/dnNzU6dOnZSVlWVac+fOneXm9v/3MW6//XadO3dO2dnZateunVPL3L9/v86dO6fbbrvN0Wa1WtWlSxfTZZvVU1xcrG7dupn2BwUFqXHjxo62kJAQubm56b///a9uuukmSRe+mC/98vfz89NPP/3kdB2XunQowM/PT5Jks9nUsmVL7dixQ9nZ2VqzZo3jORffL/v375ePj0+ZeVycj7OvJ1DVCApAFbvppptksViUlZWlP/zhDxWah8Vicfz/t8Hj/PnzZZ5fr169MtPX9I1hL9ZcW+ZTnuW4u7uX6avo9rt0XheXcfH8ArvdrkGDBumhhx4qM12LFi2qpR6gshh6AKqYl5eX7r77br377rs6depUmf4TJ05IkgIDA5WXl1fqZL8DBw4oLy9Pt9xyiyTJ29tbR48eLfUlsWvXrnLXVK9ePZWUlFy279I9Z8MwtHXrVtNhk8DAQP3000+OLz7pwrUO6tWrpxtvvPGqy7roxhtvVL169bR582ZHW0lJibZs2XLVIZtLtW3bVh4eHqZDNYGBgdqzZ0+pkwV//PFH2e32ci3HmXVyRvv27R1HMn773+WGVqq7HsAZBAWgGiQnJ0uSBg4cqDVr1mjv3r3KysrSokWLFBMTI0m66667FBQUpEmTJmnbtm3atm2bJk2apPbt2+vOO++UJIWFhen48eOaN2+e9u/fr/fff18ZGRnlricgIEA///yz9u7dq/z8fJ07d87Rt3jxYq1du1Z79+7VjBkzdPjwYQ0dOvSy84mPj1deXp5SUlKUlZWlzz//XC+//LKGDRumBg0aOJa1bds2HTx4UPn5+aVCxUUNGzbU0KFDNXPmTH3xxRfKyspSSkqKbDab4uPjnV6vxo0ba8SIEZo1a5aWLl2q/fv3a+vWrY5zLKKjo1W/fn1NmTJFmZmZ+v7775WUlKQ+ffo4hh2c3X5XWydnjBkzRlu3blVSUpJ27typ7OxsffbZZ0pKSirXfAICAvTDDz8oNzfX8esWoLoQFIBqcMMNN2jZsmXq3r27Zs6cqZiYGD344INav36948I9FotFc+fOlbe3t0aMGKERI0aoefPmmjt3ruOQdWBgoFJSUvTee+8pJiZG//nPfyp04aT7779fgYGBGjhwoLp161ZqT37ixIlasGCBYmNj9dVXX2nOnDmOXw38VosWLfTGG29o165dio2NVWJiov74xz/qqaeecjxn5MiRqlevnv74xz+qW7duOnz48GXn9fTTT6tfv3565plnFBsbq8zMTL3xxhuOcX1nTZw4UWPGjNHcuXMVGRmpJ554wnHyYIMGDTR//nydPHlSgwYN0tixYxUSEqIXXnihXMtwdp2upl27dnr33Xd16NAhDRs2TLGxsZo1a5bj3ARnjR8/Xjk5Oerdu7fp+RlAVbEYDHwB16WDBw+qV69e+uCDD9SxY0dXlwOgluKIAgAAMEVQAAAAphh6AAAApjiiAAAATBEUAACAKYICAAAwxSWcTRw7dkp2O6dvAADqNjc3i5o1K3un24sICibsdoOgAAC47jH0AAAATBEUAACAKYICAAAwRVAAAACmCAoAAMAUQQEAAJgiKAAAAFMEBQAAYIqgAAAATHFlxnJq0rS+6nvWc3UZNeZs0TkVnjjr6jIAAC5CUCin+p71FD95oavLqDGL0h5QoQgKAHC9YugBAACYIigAAABTBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApggKAADAVI0EhdTUVIWHhysoKEh79uxxtBcVFSk5OVl9+vRRdHS0nn32WUffvn37NHjwYEVERGjw4MH65ZdfKt0HAADKp0aCQq9evbRw4UIFBASUak9PT5enp6cyMjK0cuVKTZgwwdGXnJys+Ph4ZWRkKD4+XklJSZXuAwAA5VMjQSE0NFT+/v6l2k6dOqUPP/xQEyZMkMVikSQ1b95ckmSz2bRz505FRUVJkqKiorRz507l5+dXuA8AAJSfy+71cODAAXl5eWnOnDn69ttv1ahRI02YMEGhoaHKyclRixYtZLVaJUlWq1V+fn7KycmRYRgV6vP29nbVqgIAcM1yWVAoKSnRgQMH1L59e02ZMkU//fSTHnvsMX3yySeuKqkUH5/Gri6h1vD1beLqEgAALuKyoODv7y93d3fHMEHnzp3VrFkz7du3T61atVJubq5KSkpktVpVUlKivLw8+fv7yzCMCvWVl812Una7Uab9evzSPHKk0NUlAACqiZub5Yo7xy77eaS3t7fCwsL09ddfS7rwawWbzaabbrpJPj4+Cg4O1qpVqyRJq1atUnBwsLy9vSvcBwAAys9iGEbZ3eYqNn36dK1bt05Hjx5Vs2bN5OXlpdWrV+vAgQNKTExUQUGB3N3d9eSTT+ree++VJGVlZSkhIUEnTpxQ06ZNlZqaqrZt21aqrzyudEQhfvLCSmyNa8uitAc4ogAAddjVjijUSFC4FhEULiAoAEDdVmuHHgAAQO1HUAAAAKYICgAAwBRBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwBRBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwBRBAQAAmCIoAAAAUwQFAABgqsaCQmpqqsLDwxUUFKQ9e/aU6Z8zZ06Zvi1btigmJkYREREaOXKkbDZbpfsAAIDzaiwo9OrVSwsXLlRAQECZvh07dmjLli2l+ux2u55++mklJSUpIyNDoaGhmjlzZqX6AABA+dRYUAgNDZW/v3+Z9uLiYk2bNk0pKSml2rdv3y5PT0+FhoZKkoYMGaK1a9dWqg8AAJSPu6sLePXVVxUTE6PWrVuXas/JyVGrVq0cj729vWW321VQUFDhPi8vL6fr8vFpXIm1qlt8fZu4ugQAgIu4NCj8+OOP2r59uyZNmuTKMi7LZjspu90o0349fmkeOVLo6hIAANXEzc1yxZ1jlwaF77//XllZWerVq5ck6ddff9WoUaP04osvyt/fX4cPH3Y8Nz8/X25ubvLy8qpwHwAAKB+X/jzykUce0YYNG7R+/XqtX79eLVu21Pz583X33XerQ4cOOnv2rDZt2iRJWrJkifr27StJFe4DAADlU2NHFKZPn65169bp6NGjevjhh+Xl5aXVq1ebPt/NzU1paWlKTk5WUVGRAgIClJ6eXqk+AABQPhbDMMoOxOOK5yjET17ogopcY1HaA5yjAAB12NXOUeDKjAAAwBRBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwBRBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwBRBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAICpGgsKqampCg8PV1BQkPbs2SNJOnbsmMaMGaOIiAhFR0dr3Lhxys/Pd0yzZcsWxcTEKCIiQiNHjpTNZqt0HwAAcF6NBYVevXpp4cKFCggIcLRZLBaNHj1aGRkZWrlypW644QbNnDlTkmS32/X0008rKSlJGRkZCg0NrXQfAAAonxoLCqGhofL39y/V5uXlpbCwMMfjLl266PDhw5Kk7du3y9PTU6GhoZKkIUOGaO3atZXqAwAA5ePu6gIustvtWrx4scLDwyVJOTk5atWqlaPf29tbdrtdBQUFFe7z8vJyuh4fn8ZVsFZ1g69vE1eXAABwkVoTFJ5//nk1bNhQw4YNc3UpkiSb7aTsdqNM+/X4pXnkSKGrSwAAVBM3N8sVd45rRVBITU1Vdna25s2bJze3C6Mh/v7+jmEIScrPz5ebm5u8vLwq3AcAAMrH5T+PnDVrlrZv367XXntNHh4ejvYOHTro7Nmz2rRpkyRpyZIl6tu3b6X6AABA+VgMwyh7fL0aTJ8+XevWrdPRo0fVrFkzeXl56a9//auioqLUpk0b1a9fX5LUunVrvfbaa5KkzZs3Kzk5WUVFRQoICFB6erqaN29eqT5nXWnoIX7ywspsimvKorQHGHoAgDrsakMPNRYUrjUEhQsICgBQt10tKLh86AEAANReBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApggKAADAFEEBAACYIigAAABTBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApggKAADAFEEBAACYIigAAABTBAUAAGCKoAAAAEwRFAAAgCmCAgAAMFUjQSE1NVXh4eEKCgrSnj17HO379u3T4MGDFRERocGDB+uXX36p1j4AAFA+NRIUevXqpYULFyogIKBUe3JysuLj45WRkaH4+HglJSVVax8AACifGgkKoaGh8vf3L9Vms9m0c+dORUVFSZKioqK0c+dO5efnV0sfAAAoP3dXLTgnJ0ctWrSQ1WqVJFmtVvn5+SknJ0eGYVR5n7e3t2tWFACAa5jLgkJt5+PT2NUl1Bq+vk1cXQIAwEVcFhT8/f2Vm5urkpISWa1WlZSUKC8vT/7+/jIMo8r7ystmOym73SjTfj1+aR45UujqEgAA1cTNzXLFnWOX/TzSx8dHwcHBWrVqlSRp1apVCg4Olre3d7X0AQCA8rMYhlF2t/ky5s+fr1GjRpVpf+utt/Twww9fcdrp06dr3bp1Onr0qJo1ayYvLy+tXr1aWVlZSkhI0IkTJ9S0aVOlpqaqbdu2klQtfeVxpSMK8ZMXlnt+16pFaQ9wRAEA6rCrHVFwOijcdttt2rx5c5n2rl276rvvvqt4hbUUQeECggIA1G1XCwpXPUdh48aNkiS73a5vvvlGl+aKgwcPqlGjRlVQJgAAqI2uGhSmTp0qSSoqKlJiYqKj3WKxyNfXV3/5y1+qrzoAAOBSVw0K69evlyRNnjxZaWlp1V4QAACoPZz+eeSlIcFut5fqc3Pj3lIAANRFTgeFHTt2aNq0acrMzFRRUZEkyTAMWSwW7dq1q9oKBAAAruN0UEhISFDPnj31wgsvqH79+tVZEwAAqCWcDgqHDh3Sn//8Z1ksluqsBwAA1CJOn1zwhz/8QRs2bKjOWgAAQC3j9BGFoqIijRs3TrfffruaN29eqo9fQwAAUDc5HRRuueUW3XLLLdVZCwAAqGWcDgrjxo2rzjoAAEAt5HRQuHgp58vp1q1blRQDAABqF6eDwsVLOV907NgxnTt3Ti1atNC///3vKi8MAAC4ntNB4eKlnC8qKSnR3//+d24KBQBAHVbhay9brVY99thj+uc//1mV9QAAgFqkUjdp+Prrr7kAEwAAdZjTQw/33ntvqVBw5swZFRcXKzk5uVoKAwAArud0UEhPTy/1uEGDBrr55pvVuHHjKi8KAADUDk4Hha5du0q6cIvpo0ePqnnz5txeGgCAOs7pb/qTJ09q8uTJ6tSpk+655x516tRJU6ZMUWFhYXXWBwAAXMjpoDB9+nSdOXNGK1eu1NatW7Vy5UqdOXNG06dPr876AACACzk99PDVV1/p008/VYMGDSRJN998s1588UX94Q9/qLbiAACAazl9RMHT01P5+fml2o4dOyYPD49KF/HZZ58pLi5OsbGxiomJ0bp16yRJ+/bt0+DBgxUREaHBgwfrl19+cUxT0T4AAOA8p4PCfffdp5EjR2rx4sX64osvtHjxYo0aNUqDBg2qVAGGYWjy5MlKS0vTihUrlJaWpilTpshutys5OVnx8fHKyMhQfHy8kpKSHNNVtA8AADjP6aDwpz/9SY888ogyMjKUmpqqjIwMjR49Wo8//njli3Bzc5wUWVhYKD8/Px07dkw7d+5UVFSUJCkqKko7d+5Ufn6+bDZbhfoAAED5OH2OwowZMxQZGakFCxY42jZv3qwZM2aUuWFUeVgsFv31r3/V2LFj1bBhQ506dUr/+Mc/lJOToxYtWshqtUq6cMloPz8/5eTkyDCMCvV5e3s7XZePD9eHuMjXt4mrSwAAuIjTQWHVqlWaPHlyqbYOHTro8ccfr1RQOH/+vF5//XXNnTtXt99+u3744Qc9+eSTSktLq/A8q4LNdlJ2u1Gm/Xr80jxyhJ/AAkBd5eZmueLOsdNBwWKxyG63l2orKSkp01Zeu3btUl5enm6//XZJ0u23364GDRrI09NTubm5KikpkdVqVUlJifLy8uTv7y/DMCrUBwAAysfpcxRCQ0P16quvOoKB3W7X7NmzFRoaWqkCWrZsqV9//VV79+6VJGVlZclms+mmm25ScHCwVq1aJenCEY3g4GB5e3vLx8enQn0AAKB8LIZhlD2+fhm//vqrHn30UR05ckStWrVSTk6OfH19NW/ePLVs2bJSRXz00Ud64403HDedGj9+vHr37q2srCwlJCToxIkTatq0qVJTU9W2bVtJqnCfs6409BA/eWGl1vdasijtAYYeAKAOu9rQg9NBQbpwFGHr1q3KycmRv7+/OnXqVGfv90BQuICgAAB1W5Wdo3BhZm7q0qWLunTpUunCAABA7Vc3DwcAAIAqQVAAAACmCAoAAMAUQQEAAJgiKAAAAFMEBQAAYIqgAAAATBEUAACAKYICAAAwRVAAAACmCAoAAMAUQQEAAJgiKAAAAFMEBQAAYIqgAAAATBEUAACAKYICAAAwRVAAAACmCAoAAMAUQQEAAJgiKAAAAFO1IigUFRUpOTlZffr0UXR0tJ599llJ0r59+zR48GBFRERo8ODB+uWXXxzTVLQPAAA4r1YEhfT0dHl6eiojI0MrV67UhAkTJEnJycmKj49XRkaG4uPjlZSU5Jimon0AAMB5Lg8Kp06d0ocffqgJEybIYrFIkpo3by6bzaadO3cqKipKkhQVFaWdO3cqPz+/wn0AAKB83F1dwIEDB+Tl5aU5c+bo22+/VaNGjTRhwgTVr19fLVq0kNVqlSRZrVb5+fkpJydHhmFUqM/b29vpunx8Glf9yl6jfH2buLoEAICLuDwolJSU6MCBA2rfvr2mTJmin376SY899pheffVVl9Zls52U3W6Uab8evzSPHCl0dQkAgGri5ma54s6xy4OCv7+/3N3dHUMFnTt3VrNmzVS/fn3l5uaqpKREVqtVJSUlysvLk7+/vwzDqFAfAAAoH5efo+Dt7a2wsDB9/fXXki78YsFms6lNmzYKDg7WqlWrJEmrVq1ScHCwvL295ePjU6E+AABQPhbDMMoeX69hBw4cUGJiogoKCuTu7q4nn3xS9957r7KyspSQkKATJ06oadOmSk1NVdu2bSWpwn3OutLQQ/zkhZVf6WvEorQHGHoAgDrsakMPtSIo1EYEhQsICgBQt10tKLh86AEAANReBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApggKAADAFEEBAACYIigAAABTBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApggKAADAFEEBAACYIigAAABTBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApmpVUJgzZ46CgoK0Z88eSdKWLVsUExOjiIgIjRw5UjabzfHcivYBAADn1ZqgsGPHDm3ZskUBAQGSJLvdrqefflpJSUnKyMhQaGioZs6cWak+AABQPrUiKBQXF2vatGlKSUlxtG3fvl2enp4KDQ2VJA0ZMkRr166tVB8AACgfd1cXIEmvvvqqYmJi1Lp1a0dbTk6OWrVq5Xjs7e0tu92ugoKCCvd5eXk5XZOPT+NKrlXd4evbxNUlAABcxOVB4ccff9T27ds1adIkV5dSis12Una7Uab9evzSPHKk0NUlAACqiZub5Yo7xy4PCt9//72ysrLUq1cvSdKvv/6qUaNGafjw4Tp8+LDjefn5+XJzc5OXl5f8/f0r1AcAAMrH5ecoPPLII9qwYYPWr1+v9evXq2XLlpo/f75Gjx6ts2fPatOmTZKkJUuWqG/fvpKkDh06VKgPAACUj8uPKJhxc3NTWlqakpOTVVRUpICAAKWnp1eqDzWr2e885O7h6eoyasT54iIdO17s6jIAoMpZDMMoOxCPK56jED95oQsqco1FaQ9U+BwFX98m+iFtdBVXVDvdPvmfnMsB4Jp0tXMUXD70AAAAai+CAgAAMEVQAAAApggKAADAFEEBAACYqrU/jwSuF01/5ylPDw9Xl1EjioqLdeJ4kavLAFAOBAXAxTw9PPTQWxNcXUaNWPDwq5IICsC1hKEHAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwBRBAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwJTLg8KxY8c0ZswYRUREKDo6WuPGjVN+fr4kacuWLYqJiVFERIRGjhwpm83mmK6ifQAAwHkuDwoWi0WjR49WRkaGVq5cqRtuuEEzZ86U3W7X008/raSkJGVkZCg0NFQzZ86UpAr3AQCA8nF5UPDy8lJYWJjjcZcuXXT48GFt375dnp6eCg0NlSQNGTJEa9eulaQK9wEAgPJxeVC4lN1u1+LFixUeHq6cnBy1atXK0eft7S273a6CgoIK9wEAgPJxd3UBl3r++efVsGFDDRs2TJ988olLa/HxaezS5dcmvr5NXF3CNYHt5By2E3BtqTVBITU1VdnZ2Zo3b57c3Nzk7++vw4cPO/rz8/Pl5uYmLy+vCveVh812Una7Uab9evyQO3KksELTXW/biu3knIpuJwDVw83NcsWd41ox9DBr1ixt375dr732mjw8PCRJHTp00NmzZ7Vp0yZJ0pIlS9S3b99K9QEAgPJx+RGFn3/+Wa+//rratGmjIUOGSJJat26t1157TWlpaUpOTlZRUZECAgKUnp4uSXJzc6tQHwAAKB+XB4X/+Z//UWZm5mX7brTUtpkAABL/SURBVLvtNq1cubJK+wAAgPNqxdADAAConQgKAADAFEEBAACYIigAAABTBAUAAGCKoAAAAEwRFAAAgCmCAgAAMEVQAAAApggKAADAFEEBAACYIigAAABTLr8pFAA4w6uJh+rV93R1GTXm3NkiFRQWu7oMgKAA4NpQr76nPh7xsKvLqDGR77wlERRQCzD0AAAATBEUAACAKYICAAAwxTkKAFDH/K5pA3l4Xh8f78VF53X8xBlXl1GnXR/vJAC4jnh4uuuFqR+4uowakTjjPleXUOcx9AAAAEwRFAAAgCmCAgAAMFVnz1HYt2+fEhISVFBQIC8vL6WmpqpNmzauLgsAUEv8rqmHPDyvn6t9FhcV6fiJ8l/Eq84GheTkZMXHxys2NlYrVqxQUlKS3nnnHVeXBQCoJTw8PTXrmUddXUaNeerF1yURFCRJNptNO3fu1FtvvSVJioqK0vPPP6/8/Hx5e3s7NQ83N4tpX/NmjaqkzmvFlbbF1Xg09anCSmq3ymyn5o2de1/WBZXZTg2aXz/vJ6ly2+p3Xg2rsJLarTLbqakX76mrbT+LYRhGdRXkKtu3b9eUKVO0evVqR1tkZKTS09P1v//7vy6sDACAawsnMwIAAFN1Mij4+/srNzdXJSUlkqSSkhLl5eXJ39/fxZUBAHBtqZNBwcfHR8HBwVq1apUkadWqVQoODnb6/AQAAHBBnTxHQZKysrKUkJCgEydOqGnTpkpNTVXbtm1dXRYAANeUOhsUAABA5dXJoQcAAFA1CAoAAMAUQQEAAJgiKAAAAFMEBRc7fvy4OnXqpOnTp1/1uVOnTtWmTZtqoKqaFx4err59+yo2Ntbx38GDB11dVqWFh4crKipKdru9VNuePXvKPa9du3bp448/LtUWFBSkU6dOOT2P9PR0dejQQTabrdzLrwhn3rNr1qxRXFycYmNj1bdvX02cOLFSy6yr26kmXe5zafPmzYqKilJcXJy++eabMtNs27at0q9dbVRcXKyXXnpJvXv3Vt++fRUXF6dPP/20xut49dVXy7yva4wBl3r33XeNYcOGGXfeeadRVFTk6nJcpmfPnkZmZmaFpz937lwVVlN1evbsafTs2dNYtmxZqbbyruu5c+eMpUuXGk888USp9ltvvdU4efKkU/M4f/680b17d2P48OHG/Pnzy7X8ijh//vxVn5Obm2uEhYUZhw8fNgzDMOx2u7Fjx45KLbcubqeadrnPpaSkJOONN9647PNr699fVXjmmWeMCRMmGGfPnjUMwzAyMzONHj16GN99952LK6s5HFFwsaVLl2rs2LEKCgrSv//9b0nSp59+qujoaMXGxioqKkrffvutJGn48OH67LPPJEkrV67UoEGDFBcXp7i4OG3cuNFl61BdDh48qLCwsMs+vvjv1NRU9e/fX++//76ys7P14IMPKjo6Wv3799eXX37pmDYoKEh/+9vfFBsbq4iICGVkZDj6fvrpJw0fPlwDBgzQgAED9Pnnn1fpeowbN05z5sxRcXHZu7ZdrebZs2dr4MCBevHFF/W3v/1N//nPfxQbG1tqT+9f//qXBg4cqF69epVar9/64osvdOONN2r8+PFatmxZqb6goCD9/e9/d8xn48aNevnllxUXF6eoqChlZWU5nrt8+XINGjRIAwYM0IgRI7R3715J0rJly/TQQw/p8ccfV1RUlPbs2VPqPVtYWKhnnnlG0dHRiomJ0bRp03T06FEZhqHx48crLi5OMTEx2rdvn2NZX375peLi4hQdHa0HH3xQ2dnZjmWNHz/e8byLj48dO1Ynt5Mkbdy4UYMHD3Zsj0vvZVPVfvu59M9//lNr1qzRO++8o9jYWJ09e1bh4eGaOXOm7rvvPiUlJenbb7/VgAEDHPP47LPPNGDAAMXExCguLk67d++WJE2cOFEDBgxQdHS0Hn/8cR0/frza1qOyDh06pDVr1iglJUWe/3c76ltvvVWPPfaY5syZI0l6/fXXHa/VkCFDHEcPzV7/zMxMxcfHq3///oqMjNSCBQscy0tISFBSUpJGjBihPn36aPLkyTL+7woGCQkJevfddyXV7HtBEkcUXGnXrl1Gz549DbvdbqxYscIYNWqUYRiGER0dbWzevNkwjAt7G4WFhYZhGMawYcOM9evXG4ZhGPn5+YbdbjcMwzCysrKMHj16uGANqk7Pnj2NiIgIIyYmxoiJiTH69+9vHDhwwOjatavjOZc+PnDggHHrrbcaq1evdvTfd999xnvvvWcYhmH8/PPPRteuXQ2bzWYYxoU9ytmzZxuGcWF7de3a1Th69Khx/PhxIzY21sjNzTUM48Iebo8ePYzjx49X2XplZmYaTzzxhLFgwYJSbc7U/PrrrzvmZban/K9//cswDMPYtGmTcffdd5vWMnbsWOP99983DMMw+vTpY2zZsqXUfN59913DMAzj448/Nrp06eJ4r/3jH/8wJk6caBiGYXz//ffGmDFjHHuZn3/+uTF48GBHfV26dDGys7Md8730PZuQkGBMmzbNKCkpMQzDMGw2m1FSUmKMHj3a6Nq1q/HEE08Yc+bMMbp3724UFBQYR48eNcLCwoyff/7ZMAzDeO+994z77rvvstvi0sd1cTsZhmEUFBQ4jj4cOXLE6NGjh1FQUGC6HhVl9rk0ZcoUxzY0jAvv4+TkZMfjb775xujfv79hGIaxd+9e46677jL27dtnGIZhFBUVOT7HLq6PYRjGrFmzjPT09Cpfh6qyfv16IyYmpkz7jh07jK5duxrLli0z7r//fse65efnG4Zx5de/sLDQ0X7y5EmjX79+xn//+1/DMC5s4yFDhhhnz541ioqKjMjISGPDhg2Ovovbv6beCxfVydtMXys++OADxcbGymKxqE+fPpo+fbpyc3N155136sUXX1SfPn10zz336NZbby0z7YEDBzRx4kTl5ubK3d1dR48e1ZEjR+Tr6+uCNakaf/vb30qt69XOUfD09FS/fv0kSSdPntSuXbs0cOBASdItt9yi4OBgbdmyReHh4ZKkQYMGSZLatm2r9u3ba8uWLXJ3d9fBgwc1ZswYx3wtFouys7PVsWPHKlu3J598UiNGjNB9993naHOm5v79+1913pGRkZKkLl26KC8vT0VFRY69n4tsNpu+++47paamSpLi4uK0dOlSde7c2fGci9vy4h1We/bsKUnq0KGDPvnkE0nS+vXrtXv3bse2NAxDJ06ccMzjtttu04033njZOj/77DMtW7ZMbm4XDmRevKR6YmKiUlJStHnzZn311Vc6ffq0tm3bpuLiYrVr10633HKLJGngwIF67rnndPLkyatuk7q4nfLz85WYmKjs7GxZrVYdP35c+/btU5cuXSq0PcyYfS5dTlxc3GXb//Of/+iee+5RmzZtJEkeHh7y8PCQJK1YsUIrV67UuXPndPr0acdzaiPjKtcj/OyzzzR06FA1btxYktSsWTNJV379z549q5SUFGVmZspisSgvL0+7d+9WYGCgJKl3796O92X79u21f/9+de/evdRya+q9cBFBwUWKi4u1atUqeXh4aMWKFZKkc+fOadmyZUpMTFRmZqa++eYbTZgwQQ8//LDuv//+UtM/9dRTSkhIUO/evWW329W5c2cVFRW5YlWqjbu7e6k/1N+uX4MGDWSxVPw+9NKFP+CgoCAtXLiwUvO5mrZt2+ree+/VW2+9Va7pGjZseNXnXPxQsVqtkqTz58+X+QJcsWKFzp8/r5iYGMdzzpw5o8TERNWvX7/UfNzc3Bwf6hcfnz9/XtKF7TVw4EBNmDDhsrU0atSoPKsnSUpJSVF4eLgWLFggi8WiDh06aMuWLWrfvr3pNFartdQJos689+vKdpozZ44sFosiIiKq/G/+Sp9Ll+PM+/NSmzZt0uLFi7VkyRJ5e3tr5cqVeu+99ypdd3W59dZbtX//fhUUFMjLy8vRvmXLFgUFBZlOd6XXf9asWfL19dVLL70kd3d3jRw5stTreOl70mq1Om5ueKmaeC9cinMUXOTf//63br75Zn355Zdav3691q9frzfffFPLly/X3r17FRQUpAcffFAxMTHatm1bmekLCwvVunVrSRfGEy83/n2ta968uc6dO+cYl754k6/Lady4sYKDg7V8+XJJF+71sXv37lIJe+nSpZKkX375RTt37lSXLl0UEhKi7OzsUmdxb9269ap7EhXxxBNPaNGiRY6z752p+bfrWFhYWKFlL1u2TK+99prjvfbll1+qU6dOWrt2bbnmEx4erhUrVujXX3+VdOHOrNu3b3dq2p49e2r+/PmObZufn6/c3Fzl5uYqICBAFovFsafp6+urLl26aPfu3Y5x/+XLl6t9+/Zq3LixbrrpJmVmZqq4uFjFxcWlzjmoi9tJuvA3f3E7ff31146/i6p0pc+l8ujevbu+/PJL/fLLL5IuBJCTJ0/qxIkTaty4sby8vFRcXOz4m6ytWrdurb59+yolJcXxRbxnzx7NmzdP48aNU8+ePbV48WLHUa5jx45JuvLrX1hYqJYtW8rd3V179uyp0K9dauK9cCmOKLjI0qVLFR0dXaotJCREdrtdycnJOnbsmKxWq5o2baoZM2aUmf6ZZ57R2LFj9bvf/U49evQolXavVePHjy+VpqdPn66pU6fq4Ycflre3t37/+99fcfqZM2cqKSlJCxYskLu7u9LS0krdMbSkpERxcXE6c+aMpk2bJh8fH0nS3LlzlZ6erhdeeEHnzp3TDTfcoHnz5lX6aMVvtWzZUrGxsXrzzTedrvlS3bp105tvvqmYmBh17dpVf/nLX5xa7k8//aSCggLdeeedpdqjo6O1dOlS08PHl3PHHXfoySef1J/+9CeVlJTo3Llz6tu3rzp06HDVaZ955hm98MILioqKktVqVdeuXfXwww+rYcOGmjBhgqxWq+rXr68WLVqoTZs28vb2VlpamiZNmqTz58/L29tb6enpki4MH3Tr1k1//OMf5efnp3bt2unIkSN1djv95S9/0cSJE/Xcc89p9uzZ6tix4xX3aCvqSp9Lhw4dcqp+SWrTpo2ef/55/fnPf1ZJSYmsVqteeukl9ejRQx999JEiIiLUrFkzhYaGXnZHqDZJTk7WrFmzFBkZqXr16snT01NTp05V165dZRiGcnNzNXjwYLm7u6thw4ZauHDhFV//P/3pT5o8ebI++OAD3XzzzbrjjjvKXVNNvBcuxU2hcF0ICgrS5s2bK3TIFwCuZww9AAAAUxxRAAAApjiiAAAATBEUAACAKYICAAAwRVAA4HKjR48u92/1AdQMTmYEUKNmz56t7OxszZw5s8aXPXz4cMXExDgurQvg6jiiAAAATBEUAFxRTk6Oxo0bpzvvvFNhYWGaNm2a7Ha75s6dq549e6pbt26aPHmy47LJBw8eVFBQkJYvX67f//73CgsL09///ndJF24b/frrr2vNmjUKCQlx3FNh+PDhev/99yVduIzy0KFDlZqaqjvuuEPh4eH64osvHPUUFhYqMTFRd999t3r06KFXXnnFcT38K037yiuvaNOmTZo2bZpCQkIct28GcGUEBQCmSkpK9Oijj6pVq1aOex9ERkZq2bJlWr58ud555x19+umnOn36dJkv3h9++EFr167V22+/rddee01ZWVm655579Oijj6pfv3768ccf9dFHH112uVu3btXNN9+sb775RqNHj9bUqVMd9z5ISEiQu7u71q1bpw8//FBff/21I2Rcado///nPCg0NVVJSkn788UclJSVV34YD6hCCAgBTW7duVV5eniZPnqyGDRvK09NToaGhWrlypR566CHdcMMNatSokZ566il9/PHHjrsnStK4ceNUv359tWvXTu3atdPu3budXm6rVq10//33y2q1qn///jpy5IiOHj2qo0eP6osvvlBiYqIaNmwoHx8fPfTQQ1q9evVVpwVQMdwUCoCpnJwctWrVSu7upT8q8vLyFBAQ4HgcEBCg8+fPy2azOdqaN2/u+HeDBg10+vRpp5f722kl6fTp0zp+/LjOnz+vu+++29Fvt9vl7+9/1WkBVAxBAYApf39/5eTk6Pz586XCgp+fnw4dOuR4fPjwYbm7u8vHx8dxa10zlbkrZ8uWLeXh4aFvvvmmTHgBUD0YegBgqlOnTvL19dXLL7+s06dPq6ioSD/88IOioqL09ttv68CBAzp16pReeeUV9evXz6kvbx8fHx06dEh2u73c9fj5+al79+566aWXdPLkSdntdu3fv1/fffedU9M3b95cBw4cKPdygesZQQGAKavVqnnz5ik7O1s9e/bUPffcozVr1mjgwIGKiYnRsGHD1KtXL3l4eOjZZ591ap59+/aVJIWFhal///7lriktLU3nzp1TZGSk7rjjDo0fP15HjhxxatoRI0YoIyNDd9xxh6ZPn17uZQPXIy64BAAATHFEAQAAmCIoAAAAUwQFAABgiqAAAABMERQAAIApggIAADBFUAAAAKYICgAAwBRBAQAAmPp/6zHDCFEmCGUAAAAASUVORK5CYII=\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "We find that Asia is by far the most frequent continent of origin, while Oceania is the least." ], "metadata": { "id": "Q6Auwmhp8-Pf" }, "id": "Q6Auwmhp8-Pf" }, { "cell_type": "code", "source": [ "plott('education_of_employee')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "Ii-IjHtVqc6M", "outputId": "040546c6-c8f3-4e3d-a2e2-ff66aaf3bb64" }, "id": "Ii-IjHtVqc6M", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Bachelor's and Master's education are almost equally common, with both around 10000 records each." ], "metadata": { "id": "sL111zEg9FO8" }, "id": "sL111zEg9FO8" }, { "cell_type": "code", "source": [ "plott('has_job_experience')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "I0XeSsPFqc29", "outputId": "ecac6e2d-1442-408a-bf4c-e9cfebfd484c" }, "id": "I0XeSsPFqc29", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "visa['has_job_experience'].value_counts(normalize=True)" ], "metadata": { "id": "Jv_wkqWa92wL", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "3c233b56-4963-4fbc-aa6c-0368b0b05b64" }, "id": "Jv_wkqWa92wL", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Y 0.580926\n", "N 0.419074\n", "Name: has_job_experience, dtype: float64" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "source": [ "The majority of applicants have some job experience, but over 10000 (about 40%) have none." ], "metadata": { "id": "rAcy6OtY9fuC" }, "id": "rAcy6OtY9fuC" }, { "cell_type": "code", "source": [ "plott('requires_job_training')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "2xRPgoaBqcz0", "outputId": "e12d110d-01b3-4b7a-ae08-151a2efbc3c2" }, "id": "2xRPgoaBqcz0", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Unlike the previous feature, which had fairly balanced classes, most applicants do NOT require job training." ], "metadata": { "id": "YTL-Inno9vo4" }, "id": "YTL-Inno9vo4" }, { "cell_type": "code", "source": [ "plott('no_of_employees')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "nY8LC0PArFkp", "outputId": "7a79b2c5-2e33-4ca6-9d0c-b941d27e6a51" }, "id": "nY8LC0PArFkp", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "plt.figure(figsize=(16,5))\n", "plt.suptitle('Feature: no_of_employees',fontsize=18)\n", "\n", "plt.subplot(1,2,1)\n", "plt.title('Boxplot (with fliers)',fontsize=14)\n", "sns.boxplot(data=visa,x='no_of_employees')\n", "\n", "plt.subplot(1,2,2)\n", "plt.title('Boxplot (fliers removed)',fontsize=14)\n", "sns.boxplot(data=visa,x='no_of_employees',showfliers=False)\n", "\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 375 }, "id": "iE_XQm8DrRHY", "outputId": "1018bb78-e224-4d3b-9ca3-3ae81d45961e" }, "id": "iE_XQm8DrRHY", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAA4sAAAFmCAYAAADEVg8WAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzde3zO9eP/8edmYxi2OZND1JazOYycN0J9bCY5hEkkCSU/OdQnFZWQSiSVQykUZn2RfBI5DI1yKOrjsGJjQmbLzDbb3r8/1nV9du19bTa2a2yP++3mdrP34XV6X9vrel7v9/t6OxmGYQgAAAAAgEycC7sBAAAAAIDbD2ERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEABQrK1asUM+ePdW4cWP5+PjozJkzhd2kfBERESEfHx+tW7eusJsCACgiXAq7AQBQ2CIiIjR06NBs13/55Zdq3rx5gdX/ySefqHz58nr44YcLrA5k+OGHHzR9+nR17dpVI0eOlIuLi7y8vAq7WQAA3JYIiwDwj169eqlTp06m5bVr1y7QepcvX66aNWsSFh1gz549kqQ33nhDHh4ehdwaAABub4RFAPhHw4YN1bt378JuRr66fv260tPTVapUqcJuym3h4sWLkkRQBAAgF7hnEQDyYNOmTXr00Ufl6+urZs2aqV+/ftq8ebPd7Z566il16dJFjRs3Vps2bfT000/rv//9r812Pj4+Onv2rPbt2ycfHx/rP8t9dD4+PpoyZYqp/HXr1snHx0cRERHWZfPnz5ePj49OnDihmTNnqlOnTmratKkOHTokSUpJSdGiRYv0r3/9S02aNFGrVq301FNP6ddffzWVHxMTo8jISF2/fv2GY2Jpy969e7VkyRJ169ZNjRs3Vo8ePRQWFmZ3nzVr1qhPnz5q2rSpWrZsqeHDh+vHH3+8YV3Z+e677zRw4EA1b95cvr6+GjhwoL777jvr+jNnztjcz2cZ55CQkDzVc+XKFc2ZM0cPPPCAGjdurLZt22rChAmKjo622S7zmCxYsED+/v5q2rSp+vXrZz0e+/bt06OPPqrmzZurQ4cOev/99031BQQEKCQkREePHtXQoUPl6+srPz8/TZ48WZcuXcpVmxMTEzV37lzrcWnfvr0mTZqks2fPWrf59ddf5ePjo3feecduGU8++aRatGihxMRE67ILFy7o5Zdftr7GO3TooJdeesluu3I7bsnJyZo/f7569OihZs2aqVWrVgoMDNSsWbNy1VcAQP7izCIA/OPatWuKjY21WVayZEm5u7tLkt555x0tWrRIHTt21LPPPitnZ2dt2bJFzz77rKZNm6bBgwdb9/v888/l4eGh/v37q3LlyoqKitLq1av16KOPKiwsTHXr1pUkzZ49WzNnzpSnp6eeeuop6/63ch/dxIkT5ebmpuHDh0uSKleurOvXr2vEiBE6ePCgevfurcGDByshIcHaps8//1xNmjSxljF58mTt27dPW7du1V133ZWret955x0lJSVpwIABKlmypFatWqUpU6aodu3aatmypXW7OXPmaPHixWratKkmTJhgbcdjjz2mhQsXqnPnznnq74oVKzR9+nTVq1dPTz/9tCQpLCxMY8aM0fTp0zVgwAB5eXlp9uzZWr16tX788UfNnj1bklSpUqVc13PlyhUNHDhQMTEx6tu3r+69915dvHhRK1euVL9+/RQaGqqaNWva7PPWW28pPT1dQ4cO1fXr17V06VINHz5cs2fP1osvvqj+/fsrMDBQ33zzjd577z3dddddprPbf/75p4YNG6bu3burR48e+vXXXxUaGqojR45o7dq1Kl26dLZtthz3AwcOqEePHnr88cd1+vRprVq1Srt371ZoaKiqVaumhg0bqlGjRgoLC9MzzzyjEiVKWMs4f/68wsPD1bdvX5UpU0ZSxocJAwYM0PXr1/XII4+odu3a1nIjIiIUGhqqcuXK5XncXn31VYWGhio4OFi+vr5KS0vTqVOnbD4UAQA4kAEAxdwPP/xgeHt72/03fvx4wzAM48iRI4a3t7cxd+5c0/6jR482fH19jStXrliXXb161bTdyZMnjUaNGhkvv/yyzXJ/f39jyJAhdtvm7e1tTJ482bQ8NDTU8Pb2Nn744Qfrsvfee8/w9vY2hgwZYly/ft1m+2XLlhne3t7Gzp07bZZfuXLF6Ny5s6n+IUOGGN7e3kZ0dLTddtlrS+/evY3k5GTr8j///NNo1KiR8dxzz1mXRUZGGj4+PsbAgQNN27Zs2dLw9/c3UlNTb1inRVxcnNG8eXOjW7duNuN/5coVo2vXrkbz5s2N+Ph46/LJkycb3t7euS4/sxkzZhhNmjQxfvvtN5vlZ86cMXx9fW2Ok2VMgoODbfr53XffGd7e3kbDhg2Nn3/+2bo8OTnZaN++vdG/f3+bsv39/Q1vb29j2bJlNsstx/PDDz+0LrO8jkNDQ63LvvzyS8Pb29uYNWuWzf7ff/+94e3tbUycONG67IsvvjC8vb2N7du322y7cOFCw9vb2zh8+LB12VNPPWW0bdvWOHfunM22P//8s9GgQQPjvffeu6lxa926tfHEE08YAIDbA5ehAsA/BgwYoGXLltn8Gz16tCRpw4YNcnJyUnBwsGJjY23+BQQE6OrVq9bLCyVZz8AYhqGEhATFxsbK09NTd999t37++ecC7cdjjz0mFxfbC0fWr1+vevXqqVGjRjZtT0lJUbt27fTTTz8pKSnJuv1nn32mY8eO5fqsoiQNGjRIJUuWtP5ctWpV3X333Tp16pR12datW2UYhp544gnTtg8//LDOnj1r97LY7OzevVuJiYkKCQmxngGWJHd3d4WEhCgxMdH6pTa3wjAMbdiwQa1bt1aVKlVsxrB06dJq3ry5wsPDTfs9+uijNv1s1aqVJKlp06Y2Z3JLliypJk2a2IxV5r4MGjTIZtmgQYPk7u6uLVu25NjuLVu2yNnZWaNGjbJZ3qVLFzVo0EBbt25Venq6pIwveCpTpozWrl1r0+/Q0FB5e3uradOmkjLOFG7fvl0BAQEqWbKkzVjUrFlTtWvX1u7du29q3Nzd3XXy5EkdP348x34BAByDy1AB4B916tRRu3bt7K6LjIyUYRh68MEHs93/r7/+sv7/119/1bx587Rv3z6b+7wk5SmA3QzLJa6ZRUZGKikpSffff3+2+12+fFnVq1e/6Xpr1aplWubh4WFzb5zlXsx7773XtK1lWXR0tE2Qykluy7tVsbGxiouLU3h4eLZj6Oxs/vw165hUqFBBkv3XQIUKFRQXF2e3jMyBU8oIl7Vq1bph386cOaMqVapY683snnvu0W+//abLly+rYsWKKlu2rHr16qWwsDDFxsbKy8tLERERio6O1gsvvGDd748//lB6errWrl1rEyzt9Tuv4/bCCy9o0qRJCgwMVK1atdSmTRv5+/srICDA7vgCAAoWYREAcsEwDDk5Oenjjz+2uZ8rs3vuuUdSxv1cgwcPlru7u0aPHq169eqpdOnScnJy0htvvGEKjzcjLS0t23Vubm522+/t7a2pU6dmu9+tPm+wKL+ZNwxDktSuXTuNHDky1/tlNybZvYYKW//+/bV69Wp99dVXGj58uNauXauSJUva3EdpGYugoCD16dPHbjmWb9/N67h169ZN27Zt044dO7R//37t2bNHa9euVatWrbRs2TJTaAYAFCzCIgDkQt26dbVr1y7VqFFD9evXz3HbLVu2KDExUR988IHatm1rsy4uLi5Pb3g9PDzsnm3K69myOnXq6PLly2rbtm2hhjrLGacTJ06Ynl958uRJm23yWl7WM1c3U152vLy8VL58eSUkJGR79rmgREdHKyUlxeZ1k5KSoujoaNWrVy/HfWvVqqVdu3bp77//Vvny5W3WRUZGyt3dXZ6entZlTZo0UcOGDbV27Vo98sgj+vbbb9WtWzebR43Url1bTk5Oun79+g3H4mbGzcPDQ71791bv3r1lGIbeeustLV68WFu3bs3xzD4AIP8V3Y+BASAfBQUFSZLefvttu2f1Ml+CajlrZDmrYrF69Wrrc/4yK1u2rN1AKGWE1EOHDunatWvWZfHx8dZHQORWcHCwLl68qGXLltldn7n9Ut4enZEXAQEBcnJy0pIlS2zKvnDhgtatW6eaNWuqYcOGuS6vffv2KlOmjD7//HMlJCRYlyckJOjzzz9XmTJl1L59+1tut7OzswIDA/Xzzz/bfVSKpFw/yiKvEhIStHLlSptlK1euVEJCgrp165bjvt26dVN6ero++ugjm+U7duzQr7/+avfyzn79+ikyMlIzZsxQcnKy+vXrZ7Pe09NTnTt31pYtW2zu07UwDMP6rcJ5Gbe0tDT9/fffNuucnJysr4f4+Pgc+woAyH+cWQSAXGjatKnGjRun+fPnKzg4WD169FDVqlV14cIFHT16VDt37tSRI0ckSZ06dVLp0qU1adIkDRkyROXLl9eBAwe0c+dO1a5d2xQ2mzVrprVr1+rdd99V/fr15ezsLH9/f5UpU0aDBw/W888/r8cee0y9e/fW33//rTVr1qhGjRp2g2d2hg4dqj179mj27Nn64Ycf1LZtW7m7uysmJkY//PCDSpYsqc8++8y6/c08OiM36tWrpxEjRmjx4sUaMmSIHnzwQV29elWrV69WYmKi3nrrrTxdolm+fHlNnDhR06dPV//+/a2XRYaFhen06dOaPn269REOt+q5557TgQMHNH78eD344INq1qyZXF1dFRMTo507d6pRo0Z6880386WuzGrXrq33339fJ06cUKNGjXT06FGFhoaqXr16N3xOZJ8+fRQWFqaPP/5YZ8+eVatWrRQVFaWVK1eqUqVKmjBhgmmfoKAgzZkzR+vXr9ddd91l917DV155RYMGDdKQIUPUu3dvNWzYUOnp6YqOjtbWrVsVHByscePGScr9uF29elUdOnRQQECAGjZsKC8vL505c0arVq1ShQoV5O/vnz8DCgDINcIiAOTS2LFj1bhxY3322Wdavny5EhMTVbFiRd1777168cUXrdvVrl1bH3/8sd5++20tWrRIJUqUUIsWLfTZZ59pxowZNl/4ImW8mY6Pj9fKlSv1999/yzAMbd26VWXKlFFQUJAuXLigFStWaObMmapVq5aefvppOTs76/Dhw7luu6urqz788EOtXLlS//d//6f58+dLkqpUqaImTZpke+9ZQXj++edVp04drVy5UnPnzpWrq6uaNWumuXPnWr8tNC8GDx6sKlWqaMmSJdYH29933316//33b3jmLS/KlSunVatWaenSpdq8ebO2bt2qEiVKqFq1amrZsqXpDFx+qVatmt59913NmjVLX3/9tVxdXRUYGKjJkydbv3U3O66urlqyZIk++OADbdq0SVu2bFG5cuXUs2dPjR8/3u4XGrm7u+vBBx9UaGioHn74YTk5OZm2qV69ukJDQ/Xxxx9r27ZtWr9+vUqVKqXq1avL39/f5nLR3I6bm5ubHnvsMe3du1d79+7V1atXVaVKFQUEBGjUqFGqWrXqLY4kACCvnIys10kBAIDbQkBAgGrWrGlz1tcRXnnlFa1evVrbtm1TtWrVHFo3AOD2wT2LAADA6sqVK1q/fr06depEUASAYo7LUAEAt524uLgbfrmOm5vbLd+PmJSUpCtXrtxwu8qVK99SPXeC48eP69dff9VXX32lxMREjRo1qrCbBAAoZIRFAMBtZ9y4cdq3b1+O2/Tp0+eWv1Bm06ZNOT570uLYsWO3VM+d4D//+Y8WLFigqlWr6uWXX5avr29hNwkAUMi4ZxEAcNs5cuSI6TEKWVWpUkX33HPPLdVz4cIF6/MYc+LoZysCAHA7ICzijhcSEqJ7771X06ZNK5Dyp06dqpo1a2rs2LF53jciIkJDhw7V3r175eXlle12AQEBGjx4sEaMGJGn8r/88kt98MEH+vPPPzVmzBjVrFlTM2bM0MGDByVJ69ats/m5oDzzzDNq3ry5hg8fXqD1AAAcw9Fz67Vr1zR58mTt3r1bCQkJ2rp1q6ZOnWrThoJuU3G3ZMkSrVixQtu2bZMkzZo1SykpKXrppZcKuWUoTHzBDW7KlClT5OPjY/3Xpk0bjRo1SpGRkYXdtBtat25dri+vOnbsmL777jsNGzbspury9fVVeHi4PD0981z3jcTHx2v69OkaMWKEdu7caTeoPfTQQ/ruu+/ypb6cjBkzRosWLcrVvV8AAPuK89waGhqq/fv3a+XKlQoPD7f7WJf58+fbfTYoCsbIkSMVFham6Ojowm4KChFhETetXbt2Cg8PV3h4uJYuXaqkpKSbOvt2O/v888/VvXt3ubu739T+JUuWVOXKle0+p+xWxcTEKDU1VV26dFGVKlVUtmxZ0zZubm6qWLHiLdWTkpJyw218fHx01113af369bdUFwAUd8V1bj19+rTq168vHx8fVa5cWSVKlDDt5+HhcdPzscWNvjgrN3IzLxYFXl5e6tChg1auXFnYTUEhIiziplmCUOXKldWoUSMNGzZMv//+u5KSkqzbHDt2TMOGDVPTpk3l5+enKVOmWM8+7du3T40aNVJERIR1+y+++EItWrSwfooVEhKiadOm6bXXXlPr1q3VunVrzZo1S+np6dm2Kz4+XpMnT1br1q3VtGlTDRs2TCdOnJCUcVno1KlTlZiYaP3k1vJw8qzS0tL0zTffyN/f37ps1apV6tmzp/XnPXv2yMfHRx999JF12cSJE60PaI+IiJCPj49iY2NvWHdycrKmTZumFi1aqFOnTlq8eHG2fVy3bp2Cg4MlSd26dZOPj4/OnDljd7usn/Ru27ZNDz/8sJo0aaKAgAC98847NhNfQECA5s+fr6lTp6pVq1aaOHGiJGnBggXy9/dX48aN1b59e02aNMmm3ICAAG3cuDHbNgMAbqw4zq0hISFavny59u/fLx8fH4WEhNjdNyQkRNOnT7f+nJKSojlz5qhTp05q1qyZ+vbtq127dlnXW+bgHTt26JFHHlHjxo0VHh6uc+fOafTo0fLz81OzZs3Us2dPff3119n2fcqUKRo1apQ++ugjderUSZ07d5YknT9/Xs8995x1DJ988kmdOnXKut/8+fPVq1cvhYWFKSAgQM2bN9fUqVOVkpKiFStWqHPnzmrTpo1mzpxpM/Y5jXVCQoKaNm1qvVTUIjw8XI0aNdKlS5dy1TZJ+vjjj9W+fXv5+vpq0qRJSkxMNPU9ICAgx7FB0UdYRL5ISEjQpk2b5O3tLTc3N0lSYmKiRowYoTJlymjNmjVasGCBDh48qBdeeEGS5OfnpxEjRmjSpEmKj49XZGSk3nzzTb300kuqVauWtewNGzbIMAx98cUXevXVV7V69Wp9+umn2bZlypQpOnz4sBYuXKg1a9bIzc1NTzzxhJKSkuTr66sXXnhBpUuXtn5ym919dseOHdOVK1fUpEkT6zI/Pz/98ccfunjxoqSMicjT09NmUt6/f7/8/PxM5d2o7k8//VTe3t4KCwvTyJEjNWfOnGzvNXzooYesYXLNmjXZXrKT1a5duzRx4kQNHjxYX3/9td544w1t3rxZ77zzjs12y5YtU7169RQaGqoJEyboP//5j5YuXaqXX35Z3377rRYtWqSmTZva7NO0aVP98ssvNm9oAAA3r7jMrfPnz9fDDz9svXUju6CZ1dSpU7V//37NnTtXGzduVJ8+fTR69Gj997//tdnurbfe0vjx4/XNN9+oWbNmevXVV5WUlKTly5dr48aNeuGFF274GJ59+/bp2LFjWrx4sT755BNdu3ZNQ4cOValSpfTZZ5/piy++UOXKlfX444/r2rVr1v3Onj2rrVu3atGiRZo/f742b96s0aNH68iRI1q6dKlee+01ff7559qyZUuuxtrd3V3+/v7asGGDTfs2bNigdu3aqWLFirlq26ZNmzRv3jyNGzdO69at0913361ly5aZ+t2kSROdP39eUVFRuTomKHp4dAZu2q5du6xnrRITE1W9enWbM2wbN27UtWvXNHv2bOtlI9OnT9fQoUN1+vRp1alTR+PGjdPu3bv14osv6uzZs+rSpYv69OljU0+VKlX073//W05OTqpfv75OnTqlZcuW6fHHHze16dSpU9q2bZs+//xztW7dWpI0Z84cdenSRRs2bFC/fv1Urlw5OTk53fC5aTExMabt6tevr8qVKysiIkK9evXSvn37NHz4cH3wwQdKTU3V2bNn9eeff6pNmzam8kqWLJlj3e3bt9eQIUMkZXx6+tlnn2nv3r127wFxc3OTh4eHpIzLRHL7DLhFixZpxIgR6tu3rySpdu3aev755/X8889r0qRJ1stl/fz8NHLkSOt+33//vSpXrqz27dvL1dVVNWrUsJnopYzjdP36dV24cEG1a9fOVXsAALaK49zq4eGh0qVLy9XVNdfzWVRUlL7++mtt27ZNNWrUkCQNGTJEe/bs0RdffKFXXnnFuu3YsWPVoUMH689nz55Vjx49dN9990mSTYjOTqlSpTRz5kyVLFlSkrR27VoZhqGZM2da587p06erXbt2+v777/XQQw9JyjiTOnPmTJUrV07e3t7q2LGj9u3bpw8++EAlS5ZU/fr11aJFC0VERKhHjx65GuugoCBNmDBBCQkJcnd3V1JSkrZs2aJXX31VkvT111/fsG3Lly9XcHCwBg4cKEkaPXq0IiIiTKGwatWq1jFjbi+eCIu4aa1atdKMGTMkZVwysWrVKg0fPlxr1qxR9erVFRkZKR8fH5v7C3x9feXs7KyTJ0+qTp06cnV11dy5c9WrVy95eXnZ/VSzWbNmNvf8+fr6at68edY/kplFRkbK2dlZzZs3ty6z/IHOzdfjZ5aUlCQXFxc5O9uegG/durX27dunrl276pdfftH8+fP1xRdf6JdfftHJkydVu3ZtVatWLU91SRn3/WVWpUoVxcbG5rmcnBw9elQ///yzzSWu6enpSkpK0sWLF1WlShVJUuPGjW3269mzp5YvX66uXbuqQ4cO6tixo7p27WqdNCVZP/XmzCIA3LziOrfm1dGjR2UYhv71r3/ZLE9JSVHbtm1tlmWd04YOHapXXnlFu3btUtu2bfXAAw+Ytsnq3nvvtZnzjh49qjNnzqhFixY22127ds3mC2GqV69uc9ayYsWKqlu3rk1ZFStWtF4+mpux7tSpk9zc3PTdd98pODhY27Ztk2EY6tatW67bFhkZqUceecRmffPmzU1hsVSpUpKY24szwiJuWunSpVWnTh3rz40aNVKrVq305Zdfavz48Tnum3mCOnTokNLT03XlyhXFxsaqfPnyBdLevH7JjKenp65fv65r166pdOnS1uV+fn765JNPdPDgQdWpU0eVKlWSn5+fIiIidPLkSbuXoOaGi4vtr6OTk1OO94/cjPT0dI0dO9bmvkuLzI/2yNxfKWOy27x5s/bu3as9e/Zo1qxZev/997V69WqVKVNGUsabmqzlAADyprjOrXllGIacnJy0du1a0/xp+fDSIms9/fr1U8eOHbVjxw7t2bNHAwcO1KhRozRu3Lhs67PMdRbp6em67777TLdxSFKFChWs/3d1dbVZ5+TkZHdZbuZ7y1i7urrqwQcf1IYNGxQcHKz169frgQcesPYzt23LDeZ2cM8i8o2Tk5OcnJysnz7Vr19fx48fV0JCgnWbgwcPKj09XfXr15ckRUdHa8aMGZo2bZratWun559/XqmpqTblHj58WJkfB3ro0CFVqVLF7jei1a9fX+np6Tp06JB1WUJCgo4fP26t09XVVWlpaTfsT4MGDSTJ9Kmpn5+fTp06pQ0bNliDoSUsZne/okVu6y4oDRs21O+//646deqY/mWdbLMqVaqUunTpohdeeEFr167ViRMndODAAev648ePq2rVqqpUqVJBdwMAio3iMrfmVYMGDWQYhi5evGiazyyXTuakWrVqGjBggObNm6dnnnlGX375ZZ7qb9SokaKiouTp6Wmq33KbyM3IzVhLUlBQkPbu3auTJ08qPDxcQUFBeWpb/fr1dfjwYZu6s/4sSSdOnJCrq6u8vb1vuk+4sxEWcdNSUlJ08eJFXbx4UZGRkZoxY4YSExOt33AWGBgoNzc3TZ48WceOHdP+/fs1bdo0de/eXXXq1FFaWpomTZqk1q1ba+DAgXrttdd07tw5LViwwKaeCxcu6PXXX9fvv/+uzZs3a8mSJdk+97Bu3brq2rWrpk2bph9//FHHjh3TxIkT5e7ursDAQElSzZo1lZycrN27dys2NtbmRvTMvLy81KhRI/300082yy33La5fv956b6Kfn5/27duX7f2KFrmtu6CMGTNGGzdu1Lx583T8+HFFRkZq8+bNmj17do77rVu3TmvWrNGxY8cUHR2tdevWydXV1ebT759++snmnhAAQN4V17k1r+6++24FBgZq6tSp2rx5s6Kjo/XLL79oyZIl+vbbb3Pc97XXXtPOnTsVHR2t3377Tbt27dI999yTp/oDAwNVsWJFPf3009q3b5+io6O1f/9+vfnmm6ZvHc2L3Iy1JLVo0UI1atTQ//t//08eHh66//7789S2oUOHKiwsTKtXr9apU6f04Ycf2g2LP/74o1q2bHlLZ4FxZ+MyVNy0PXv2WMNB2bJlVa9ePc2bN88alkqXLq0lS5bojTfeUL9+/VSqVCl17drV+liJRYsWKSoqyvqNXp6enpo1a5aefPJJdejQQa1atZKU8UcvPT1d/fv3l5OTkx555JFsJzRJmjlzpt544w2NHj1aycnJatGihRYvXmy9LKVFixYaOHCgJkyYoLi4OI0dOzbbS0/69++vNWvWmOpr3bq1vvnmG+tZxLvuuktVq1ZViRIlcrxfMS91F4SOHTvqww8/1MKFC7V06VKVKFFCdevW1cMPP5zjfuXLl9fHH3+sWbNmKTU1VfXr19f8+fOtXwqQnJysLVu2aMmSJY7oBgAUWcV5bs2rmTNnatGiRZozZ47Onz+vChUqqEmTJjl+aCtlXMJqCdFly5bV/fffrylTpuSp7tKlS2vFihWaO3eunn32WV25ckVVqlRRmzZtbvmS3xuNtUVgYKAWLlyoYcOG2afwNvcAACAASURBVDyXMjdte+ihhxQdHa133nlHSUlJCggI0OOPP66wsDCbOjZu3KhnnnnmlvqDO5uTkfkaBOA2ExISonvvvVfTpk0rlPqTk5P14IMPavbs2dYJFmYrVqzQ1q1btXTp0sJuCgDgBphbkRvbt2/X7NmztX79+hveqoKii8tQgRyUKlVKs2bNUlxcXGE35bbm4uKif//734XdDADAHYC59c6QmJiomTNnEhSLOY4+cAOW5xwhewMGDCjsJgAA7iDMrbc/y7MiUbxxGSoAAAAAwITLUAEAAAAAJoRFAAAAAIAJYREAAAAAYHLDL7i5fPmq0tNv/bbGihXddelSwi2Xcyehz8VDceyzVDz7TZ9vjbOzkzw9y+ZLWcVdfs3NUtF6XdOX2xN9uT0Vpb5IRas/juzLjebmG4bF9HQj3yak/CrnTkKfi4fi2GepePabPuN2kJ9zs6W8ooK+3J7oy+2pKPVFKlr9uV36wmWoAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADBxcUQlK1cu12+//aK0tHQ1btxMgwYNdUS1AACgGPn444917NiJwm5GvnB1LaHr19NslsXHx0mSKlTwKIwm3TR7fSkItWrV4T0mkM8cEhajo0/r7NkYSXfeHzgAAHBn+P3333XsxEmVcCua7zXSkjLC4sW/Uwu5Jbcfy9gAyF8OCYuSJGfHVQUAAIqnEm4eKlOna2E3o0Aknt4qSUW2f7fCMjYA8hf3LAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwMQhYTE+Pk4y0q0/7969U7t373RE1QAAwA7mYgC4cznqb7iDwmK8TVgMD9+h8PAdjqgaAADYwVwMAHcuR/0N5zJUAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAICJS2FUeuzYb5Kk4cMHFUb1RYK7u7sSEhLk7Oys9PR0m3WBgUHauHGDhgwZrjVrVsjDw1NxcZc1derLMgxDs2bN0GOPPaFPP12sKVOmqVatOpKkqKhTmjVrhnVZXNxlLVo0X6NHP6MKFTwkSXFxl7VgwTtycpJCQoZr+fKlSklJ1l9/XdTUqS+rcuXG1nZY9h806DEtX75EaWmpSktL019/XdTYsc9p/fowDRo0VMuXL5WTkzR27ARrPZnLWLDgHaWlpalEiRIaOnSEVq78VKNHPyPDMPTuu3N04cKfGjNmglau/ETnzsWoWrXqGjx4mBYufFdTpkxTuXLl/2nHUK1cuVyBgcFauHCeTd8zt9dStr3/Zx2H1NRUlS5dSk899ayp7bkZz6z12ut/5rZnbk/mZRUqeFjbJEnjxk2wWZaamipXVxeNHTtB8fFxNu3Kyl6dmcuyHKvKlctlW2du+pnd6yu7sbDXxpz6nt/i4i5r7tzX9cQTY25Yfm77kZs686OczGUNGvSY9XfIXpm3clwAAEDRUuKVV155JacNrl1LkWHcWiWbNq1Xamqq5OSsShW9dOnSX7dWIJSSkiJJMuwcnOPHj0mSfv75kFJTU5WQkKDU1FQdP/5f/fTTfl2+HKuDBw8oJSVFx4//VwEB3SVJc+a8rsuXY63L1qxZpQMH9is5OVnNmvlKktasWaWDB3+0bhcVdVp///23tfzAwF5KTEyxbnvgwH4dP/6boqJOKz4+zrrtoUMHdf78n9YyLl+OVUrK/+qxsNQXHx9nrfP06VNKTk7W8eP/1eHDB5SamqrDhw/o8uVYSVJCQoIOHz6oa9cSdfz4f3Xp0l//tCNj30OH/rfO0vfM7bWUbe//WcchPj5Oly5dstv23Ixn1nrt9T9z2zO3J/OyZs18bY6NpT1Zxy8lJVnffrvJpl1Z2avTXvn3399Wn376id06c9PP7F5f2Y2FvTbm1Pf8tmbNKu3fv++GbbPXvlupMz/KyVzW8eO/2RzXG9VZtmwpffrpJ/nSDicnJ5UpU/JWuoF/5MfcvHv3TklS9+4PWP9u3+n27t2lvy5flatHvcJuSoG4Hv+HJBXZ/t2K6/F/yKtCGXXo0LlA6ylbtlSR+X0pSn2RilZ/ctMXy9/wW33N32hudvhlqJazinAE23cSMTFnFRNzVpKUlpZqXRYdfVpRUaes62Jizuro0V8UHr5DhmEoPHyn4uPjFBd3Wbt2bbcpL2v5f/yRMZHFxV227p91O0lKTLxqWhcevkPx8XHWn7PWZ6kjo007tGPH9zblZS3fsv2uXdutdRmGYbMuOvq0qb3h4Tus++zatUO7dtkbhx029WVte27G01zvTlP/M49h1vb8byx2KirqtE2bdu3aYVomSTt3fm/TLkv/c6ozo/xTNsciPHyH/vjjDzt1nspVP+1tl9NYZNfG7Pqe3b43K7dty+u2+VVnXsrKfFyzlmmvztjY2HxrBwAAuLM49jJUI/3G28DhPvxwgWnZBx+8p/T0jLCZnp6u9evXSZJSU9NyLOutt97SK6+8qQ0bwqz751ZqaqrWr1+nkJDhkqQNG8KyrS81NdXuWdXsts3Ohx8u0GuvzbFpb0bZ/9vXySnj/5nHwRK2s2v7Rx+9b7Pe3niGhAy3qTfzckv/s45h5vZYpKen66OPFti0KTU11bQso92242npv4W9OjPKf9/mWKSmpuqtt96yU+f7ueqnve0y/m9/LDIzl2Xue3b73qycjtOtbJtfdealLAt7Zdqr083NNd/agdtLfHyc4uPjNXXqVF2/nvPf9jtFdPRppaeWKOxmoBCkpyYpKuq0Zs2aUaD1uLqWKDK/L0WpL1LR6k9u+hIVdVoVKlQo8LbwBTewOeNokZh41foGPC0tVXv37tbevbuV9WxlVlFRUZKkvXt3m4LKjRiG8U8dspaRXX25DYo3Yul35vZmlG0p37DWlXkcstafte25Gc+s9WZennVdpppMdaelpVrPFmXezrws+/7nVKel/MzHwjAMRUVF2a0zN/20t11OY5FdG7Pre3b73qzcti2v2+ZXnXkpy8Jemfbq3L59e761AwAA3Fkce2bRyVkyikbiL0pq1KgpyTY4lClTVsnJyUpLS1WJEi66//72kqTvv9+qnAJj7dq1JUn3399eO3dutxN2sufk5GStx1JGdvU5OTnlS2C09D1zezPK1j/1OsnJKSMcZR6H7du32tSfte01atTM1Xhmrjfz8qzrMvXc2h6LEiVcVLVqVZ07F5NpuZNq1KiRZVn2/c+pTkv5MTExshwLJycn1apVS9HR0aY6z58/f8N+Vq1a1e522Y1Fdm3Mru/Z7XuzcjpOt7JtftWZl7Is7JVpr043N1d9++2WfGkHbi8VKnioQgUPzZw5UxcvXins5uSLt99+Q7/9fr6wm4FC4Oziptq1Kmny5JcKtJ7KlcsVmd+XotQXqWj1Jzd9Keiz6BacWYRGjRqrJ58cY7Ns9Ohn5Oyccb2js7OzgoIeVmBgH7m45Hx5z8SJEyVJgYF9rPvnlouLi4KCHrb+nFN9Li4ucnbO3aVGLi7ZfyYyatRYa12W9rq4uFjrdXFxUYkSGftnHgfLsuzanpvxzFpv5uVZ12WuJ2vdzs7OevLJsTbLXVxcTMskqUQJ2zGz9D+nOjPKH2NzLFxcXDRx4kQ7dY7JVT/tbZfTWGTXxuz6nt2+Nyu3bcvrtvlVZ17KsrBXpr06Bw4cmG/tAAAAdxaHh0UfnwaOrrIYs31zWKNGTeuZJMub6xo1aqpWrTqqXbuudV2NGjXVqFETdejQWU5OTurQoZMqVPCQh4enOnbsYlNe1vLvvvtuSZKHh6d1/6zbSRln2rKu69Chs83X8metz1JHRps6q3Nnf5vyspZv2b5jxy7WujK+8el/6yyPjsjc3g4dOlv36dixszp2tDcOtt88lbXtuRlPc72dTP3PPIZZ2/O/seik2rXr2LSpY8fOpmWS1KmTv027sj46w16dGeXXtTkWHTp01t13322nzrq56qe97XIai+zamF3f8/vxDrltW163za8681JW5uOatUx7dXp5eeVbOwAAwJ2FR2fcodzd3ZWSkiJnZ2fTZYaBgUE6ceK4QkKG6/jx31SpUmWlpqZq/PhJatmytSIi9mj48FE6evQXjR//vPXN3z33eCsiYo91WZ06dfXHH5EaMmSY3NzcJEl16tTViRPH5eXlpeHDRykq6rTKli2rlJQUjR8/SdWrV7F+1a9l/2HDnlRU1GmVL19e5cqVU0pKisaMGa+//rqoYcNGKirqtLy8vBQSMtxaj4WlvvLlK6hixUoaPvwpxcSc0ZAhw+TtfZ9+++1XJScn6emnn9OpU5FKSLiiatWqa/jwUTp06CeNH/+8mjRp/k87Riom5qwGD37Mui7zG9/M/fX2vs/u/7OOQ7ly5VWtWlUNHvy4qe25Gc+s9drrf+a2Z25P5mVubm7WNnl6emno0OE2y8qVK69KlSopJGS4GjduatOurOzVmbksy7Hy8qqgSpVq2K0zN/3M7vWV3VjYa2NOfc9vderU1Zkzp/Too4/dsPzc9iM3deZHOZnLGjbsSevvkL0ys9ZZtmwpVapUI1/awaMz8g+PzrCPR2cUXzw6I++KUl+kotWf2+nRGU7GDW78unQpIc/fapnVmDFP6Nq1a5Kzi3zuvce6vKCvKy9sRena6dyiz8VHcew3fb41zs5OqljRPV/KKu7yY2623O/y1luzi8zr2nLPYpk6XQu7KQUi8fRWSSqy/bsViae36h7uWcyTotQXqWj1Jy/3LN7qa/5GczP3LAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATFwcUUmFChV0LSnZ+nOHDp0dUS0AAMgGczEA3Lkc9TfcQWHRQ39e+Mv6c/v2nRxRLQAAyAZzMQDcuRz1N5zLUAEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBAWAQAAAAAmhEUAAAAAgAlhEQAAAABgQlgEAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYOLisJrSUx1WFQAAKJ7SkuKUeHprYTejQKQlxUlSke3frcgYm0qF3QygyHFIWKxVq44SEv5WWlq6atWq44gqAQBAMVOvXj1dv55W2M3IF66uJUx9iY/PeNtWoYJHYTTpptnrS/6rxHtMoAA4JCwOGjRUlSuX08WLVxxRHQAAKIZGjhxZZN5rFKX3TUWpL0Bxwz2LAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATAiLAAAAAAATwiIAAAAAwISwCAAAAAAwISwCAAAAAEwIiwAAAAAAE8IiAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLAIAAAAADAhLAIAAAAATFxutIGzs1O+VZafZd0p6HPxUBz7LBXPftPnwi8H+T+WRenY0JfbE325PRWlvkhFqz+O6suN6nEyDMNwSEsAAAAAAHcMLkMFAAAAAJgQFgEAAAAAJoRFAAAAAIAJYREAAAAAYEJYBAAAAACYEBYBAAAAACaERQAAAACACWERAAAAAGBCWAQAAAAAmBR4WPzjjz80YMAA9ejRQwMGDNCpU6cKusqbNmvWLAUEBMjHx0fHjx+3Ls+pD45el98uX76skSNHqkePHgoMDNTYsWMVGxsrSTp06JCCgoLUo0cPDR8+XJcuXbLu5+h1+e3pp59WUFCQgoODNWjQIP3222+SivaxtliwYIHNa7woH2dJCggIUM+ePdW7d2/17t1bu3btKvL9Tk5O1ssvv6zu3bsrMDBQL730kqTi8fpG7t3ux8eRc3JBc/RcW9AcOYc6iiPmxoLmyPmuoDlyHitIZ86csR6P3r17KyAgQH5+fndWX4wCFhISYnz11VeGYRjGV199ZYSEhBR0lTdt//79RkxMjOHv728cO3bMujynPjh6XX67fPmy8cMPP1h/fvPNN42pU6caaWlpRrdu3Yz9+/cbhmEY77//vjFlyhTDMAyHrysIf//9t/X/W7ZsMYKDgw3DKNrH2jAM48iRI8aIESOsr/GifpwNwzD9PhdG3xzd7xkzZhivv/66kZ6ebhiGYVy8eNEwjKL/+kbe3O7Hx5FzckFz5FzrCI6cQx3BEXOjIzhqvnMER85jjvTaa68Zr7766g3bdDv1pUDD4l9//WW0bNnSSE1NNQzDMFJTU42WLVsaly5dKshqb1nmX7ac+uDodY6wefNm47HHHjMOHz5s/Otf/7Iuv3TpktG8eXPDMAyHrytoYWFhRp8+fYr8sU5OTjb69+9vREdHW1/jxeE425s8i3K/ExISjJYtWxoJCQk2y4v66xt5cycdn4KekwtDQc61jlaQc6gjOGpudARHzXcFzZHzmCMlJycbbdq0MY4cOXJH9cWlIM9anjt3TlWrVlWJEiUkSSVKlFCVKlV07tw5eXl5FWTV+SanPhiG4dB1BT1m6enpWrVqlQICAnTu3DnVqFHDus7Ly0vp6emKi4tz+DoPD48C6e+LL76o3bt3yzAMLV68uMgf63nz5ikoKEh33XWXdVlxOM6SNHHiRBmGoZYtW2rChAlFut/R0dHy8PDQggULFBERobJly+rZZ5+Vm5tbkX59I2/u1Pm5IP5OO7q/BT3XFuTf0swcMYc64tg4am501HFxxHxX0H1x5DzmyN//bdu2qWrVqmrUqJGOHDlyx/SFL7iB1YwZM1SmTBkNGTKksJviEK+//rq2b9+u5557TrNnzy7s5hSogwcP6siRIxo0aFBhN8XhVqxYofXr1ys0NFSGYWj69OmF3aQClZaWpujoaDVs2FDr1q3TxIkTNW7cOCUmJhZ20wCo6My1RWEOLWpzY1GZ74rqPBYaGqq+ffsWdjPyrEDDYvXq1XX+/HmlpaVJyjj4Fy5cUPXq1Quy2nyVUx8cva4gzZo1S6dPn9a7774rZ2dnVa9eXTExMdb1sbGxcnZ2loeHh8PXFbTg4GBFRESoWrVqRfZY79+/X5GRkeratasCAgL0559/asSIETp9+nSRP86W8SxZsqQGDRqkAwcOFOnXd/Xq1eXi4qJevXpJkpo1ayZPT0+5ubkV2dc38u5OPT53+uvREXOtoxXkHFrQHDk3OoKj5jtH9MNR85ijnD9/Xvv371dgYKC1j3dKXwo0LFasWFENGjTQxo0bJUkbN25UgwYNbutLXLLKqQ+OXldQ3n77bR05ckTvv/++SpYsKUlq3LixkpKS9OOPP0qSvvjiC/Xs2bNQ1uW3q1ev6ty5c9aft23bpgoVKhTpY/3kk08qPDxc27Zt07Zt21StWjUtWbJETzzxRJE9zpKUmJioK1euSJIMw9CmTZvUoEGDIv369vLyUps2bbR7925JGd+adunSJdWtW7fIvr6Rd3fq8bmTX4+OmmsLmiPn0ILmyLmxoDlyvitojpzHHCUsLEydO3eWp6enpDvsb1mB3Q35j5MnTxqPPPKI0b17d+ORRx4xIiMjC7rKmzZjxgyjY8eORoMGDYx27doZDz30kGEYOffB0evy2/Hjxw1vb2+je/fuRlBQkBEUFGQ8/fTThmEYxk8//WT06tXLeOCBB4xhw4ZZv4mqMNblp4sXLxr9+vUzevXqZQQFBRkhISHGkSNHDMMo2sc6s8w3wRfV42wYhhEVFWX07t3b6NWrl/HQQw8Z48aNM86fP18s+j1kyBCjV69eRnBwsLF9+3bDMIrP6xu5c7sfH0fOyQXN0XNtQXL0HOpIBT03FiRHz3eO6I+j5jFH6N69u7Fjxw6bZXdKX5wMwzAKLooCAAAAAO5EfMENAAAAAMCEsAgAAAAAMCEsAgAAAABMCIsAAAAAABPCIgAAAADAhLCIYuGvv/7S4MGD5evrqzfffLOwmyNJCgkJ0Zo1awq7GQAAFArmZuD251LYDQAc4csvv5Snp6cOHDggJyenwm4OAADFHnMzcPvjzCKKhZiYGNWvX5/JCACA2wRzM3D7Iyyi0AQEBGjJkiUKDAxUy5YtNX78eCUnJ0uSVq9erQceeEB+fn566qmndP78+RuWd+DAAfXt21ctW7ZU3759deDAAUnSlClT9NVXX2nJkiXy9fXVnj17si0jPT1dH330kbp166Y2bdro2WefVVxcnCTpzJkz8vHxUWhoqDp37qzWrVtr1apV+vnnnxUYGKhWrVpp+vTp1rLWrVungQMHavr06WrZsqV69uypvXv3ZlvvwoUL5e/vr/vvv1+TJk3SlStXJElPPvmkPvvsM5vtAwMDtWXLFklSZGSkHn/8cfn5+alHjx7atGmTdbuUlBTNmjVLXbp0Ubt27TRt2jQlJSVJkmJjYzVq1Ci1atVKfn5+GjRokNLT0284zgCAoou52bZe5mYUewZQSPz9/Y2+ffsaf/75p3H58mWjZ8+exsqVK409e/YYfn5+xpEjR4zk5GRj+vTpxqBBg3Is6/Lly0arVq2MsLAw4/r168aGDRuMVq1aGbGxsYZhGMbk/9/O3YY0vcVxAP/eNTVzhQ/VbCukXhRlCaaJhqaVEMqwBzMRLMMeVkwJpboWDIJo7IVoSYVvwupFD1pi2awgUPBVVBRCCZWwZc62ckql5rb83Rf38r+3O+tm3Itd/X5e7Zz/+Z//ORvsyzk77Ndfpaqq6h/HdP78ecnLy5Pe3l4ZGRkRs9ksZWVlIiLS3d0tixcvFrPZLJ8+fZL29nZZvny57N+/X969eydv3ryR5ORkuX//voiIXL9+XZYuXSp1dXXi9XrFZrPJypUrpb+/X0RECgsLpb6+XkREGhoaJDMzU169eiUfP34Uk8kkBw8eFBERm80mW7duVcbY2dkpSUlJMjIyIoODg7JmzRq5du2a+Hw+efr0qSQlJcmLFy9EROTEiRNiNBqlv79fPnz4IEajUSor4oVHywAABSFJREFUK0VEpLKyUsxms3i9XvF6vfLgwQMZHR397s+PiIgmH2Yzs5nor/jLIk2o7du3Q6vVIjw8HGvXrkVnZyeam5uRm5uL2NhYBAcHo7y8HE+ePMHr16+/2k9bWxtiYmKwadMmqNVqGAwGLFq0CK2treMaz5UrV1BWVobo6GgEBwejpKQEd+/ehd/vV9qYTCaEhIQgNTUVM2bMgMFgQFRUFLRaLRITE/Hs2TOlbWRkJIqKihAUFITs7GwsXLgQbW1tAc9tbm7Gzp07sWDBAoSFhaG8vBwtLS3w+/1Yv3497HY77HY7AODGjRvIyspCcHAw2traoNfrkZubC7VajWXLlmHDhg24c+cORAT19fU4evQowsPDodFoYDQaYbPZAABqtRpv376F0+lEUFAQEhMTeRSIiIiYzX9gNhPxD25ogs2ZM0d5HRoaCrfbjYGBAcTGxir1YWFhCA8Ph8vlwvz588fsx+12Q6fTfVGn0+m+64jMXzmdTphMJqhUf+6jqFQq9PX1KeWoqCjldUhISEB5aGhIKWu12i++5HU6Hdxu95jj1+v1Slmv18Pv96Ovrw9arRZZWVm4efMmSkpKcOvWLdTU1AAAenp60NHRgcTEROXez58/IycnBx6PB8PDw9iyZYtyTUSU4yy7du3C6dOnUVxcDADIz8/H3r17x/FuERHRZMRs/nP8zGaa6rhYpJ/O3Llz0dPTo5SHhoYwMDAArVb7zXucTucXdb29vUhLSxvXs6Ojo2GxWJCQkBBw7Vu7p1/jcrkgIkoo9fb2Yt26dQHt/j5np9MJtVqthN3mzZtx+PBhJCQkIDQ0FPHx8QCAefPmYdWqVairqwvoc3R0FNOnT4fNZhvzvdNoNKioqEBFRQWeP3+OoqIirFixAikpKeOeJxERTW7MZmYzTU08hko/HYPBgMbGRnR2dsLr9aKqqgpxcXFf3bkEgPT0dNjtdjQ3N8Pv96OlpQUvX75ERkbGuJ5dUFCAkydPKuHg8Xhw7969H56Lx+PBxYsX4fP5cPv2bXR1dSE9PT2gncFgwIULF9Dd3Y3BwUFUV1cjKysLavXv+znx8fFQqVSwWq3IyclR7svIyIDdbkdTUxN8Ph98Ph86OjrQ1dUFlUqFvLw8WCwWZffV5XKhvb0dANDa2gqHwwERwcyZMzFt2jQedSEiojExm5nNNDXxl0X66axevRoHDhxAaWkp3r9/j/j4eFRXV3/znoiICNTW1sJiseDYsWOIiYlBbW0tIiMjx/XsHTt2QERQXFwMt9uNqKgoZGdnIzMz84fmEhcXB4fDgeTkZMyePRs1NTWIiIgIaJebmwuXy4XCwkKMjIwgNTUVZrP5izYbN27EqVOncPbsWaVOo9Hg3LlzsFqtsFqtEBEsWbIER44cAQAcOnQIZ86cwbZt29Df3w+tVouCggKkpaXB4XDg+PHj8Hg8mDVrFgoKCpCcnPxD8yQiosmN2cxspqnpFxGRiR4E0WTU2NiIhoYGXL58+V/pr6mpCVevXv3X+iMiIppqmM1E48NjqET/A8PDw7h06RLy8/MneihEREQEZjNNDTyGSv8bDx8+xJ49e8a89vjx4+/uZ/fu3Xj06FFAvdFoxL59+354fP+V9vZ2lJaWIiUlBQaDYaKHQ0REpGA2M5tpcuMxVCIiIiIiIgrAY6hEREREREQUgItFIiIiIiIiCsDFIhEREREREQXgYpGIiIiIiIgCcLFIREREREREAbhYJCIiIiIiogC/AQe/x3tfRO5uAAAAAElFTkSuQmCC\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "With such skewed data, the histogram lends little insight. Boxplots give a better idea of data concentration and distribution. Most of our records are concentrated below around 7000, but there are numerous extreme values on the high end.\n", "\n", "We will not treat these extreme values though. It is entirely reasonable that many records reflect applicants to large companies." ], "metadata": { "id": "u28-xVje_08o" }, "id": "u28-xVje_08o" }, { "cell_type": "code", "source": [ "plott('yr_of_estab')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "oBesojUXraRQ", "outputId": "f5667e22-af0a-4255-d2a0-0687624f4585" }, "id": "oBesojUXraRQ", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "visa['yr_of_estab'].describe()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sOiEzS5uFAn1", "outputId": "dbdab4c0-23f6-49a3-ebd7-a79b06bb6515" }, "id": "sOiEzS5uFAn1", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "count 25480.000000\n", "mean 1979.409929\n", "std 42.366929\n", "min 1800.000000\n", "25% 1976.000000\n", "50% 1997.000000\n", "75% 2005.000000\n", "max 2016.000000\n", "Name: yr_of_estab, dtype: float64" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "markdown", "source": [ "We find that 75% of companies in our records were founded in 1976 or later. Like the previous feature, we have many extreme values, this time on the lower end. As before, all these values are entirely sensible, so we will not alter them.\n", "\n", "However, we will convert this feature to 'years since founded', which will flip the distribution from left-skewed to right-skewed but otherwise leave the data unchanged." ], "metadata": { "id": "Lb65bfu8A2nt" }, "id": "Lb65bfu8A2nt" }, { "cell_type": "code", "source": [ "plott('region_of_employment')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "8ji8Ea-IGFLJ", "outputId": "7f657e9e-67e1-416b-fa69-0c4ea4c07960" }, "id": "8ji8Ea-IGFLJ", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Northeast, South, and West are all common regions. Island is decidedly uncommon, with fewer than 500 records." ], "metadata": { "id": "01h5otEPH_nT" }, "id": "01h5otEPH_nT" }, { "cell_type": "code", "source": [ "plott('prevailing_wage')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "4ayouQkMpW7B", "outputId": "32cf16b3-91ea-4f4b-e79d-c54de42f776c" }, "id": "4ayouQkMpW7B", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Aside from the spike around \\$0, prevailing wage follows a right-skewed normal distribution. It may well be that the spike is due to other wage units, such as hourly." ], "metadata": { "id": "E9uKFaQSIJKB" }, "id": "E9uKFaQSIJKB" }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Prevailing Yearly Wage',fontsize=14)\n", "sns.histplot(data=visa.loc[visa['unit_of_wage']=='Year'],\n", " x='prevailing_wage');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "MIe8av9nR-S3", "outputId": "77d7f8d4-295c-46e8-ad0b-ec057d800437" }, "id": "MIe8av9nR-S3", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Indeed, the distribution looks far less erratic when just focused on yearly prevailing wage." ], "metadata": { "id": "NDlJOKkgSjhy" }, "id": "NDlJOKkgSjhy" }, { "cell_type": "code", "source": [ "plott('unit_of_wage')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "fX4y3E7WGI-l", "outputId": "cff43504-1993-48a8-acf0-40032afa0153" }, "id": "fX4y3E7WGI-l", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Most of our records consider cases with yearly wages. Weekly and montly are vastly less common, as was shown earlier in the value counts table." ], "metadata": { "id": "CKdkEZRiU4qO" }, "id": "CKdkEZRiU4qO" }, { "cell_type": "code", "source": [ "plott('full_time_position')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "5bFcm8W7GLSA", "outputId": "f0cb169b-0fd5-4d07-bc55-6bfe47b69a2b" }, "id": "5bFcm8W7GLSA", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Most positions are full time." ], "metadata": { "id": "TdOu9YRmVcuV" }, "id": "TdOu9YRmVcuV" }, { "cell_type": "code", "execution_count": null, "id": "mechanical-interference", "metadata": { "id": "mechanical-interference", "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "outputId": "0c51cac3-1f70-40ff-b8db-26e8547f3e93" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "plott('case_status')" ] }, { "cell_type": "markdown", "source": [ "Only one-thirds of cases are Denied." ], "metadata": { "id": "sHqQ22OjWR_n" }, "id": "sHqQ22OjWR_n" }, { "cell_type": "code", "source": [ "visa['case_status'].value_counts(normalize=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "zWoyNkQnVm6t", "outputId": "fe9bb334-0813-4676-9445-4bb2fe1ba886" }, "id": "zWoyNkQnVm6t", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Certified 0.667896\n", "Denied 0.332104\n", "Name: case_status, dtype: float64" ] }, "metadata": {}, "execution_count": 27 } ] }, { "cell_type": "markdown", "source": [ "Let's dig deeper into prevailing wage." ], "metadata": { "id": "RRx3brEwVsQm" }, "id": "RRx3brEwVsQm" }, { "cell_type": "code", "source": [ "visa.groupby(by='unit_of_wage')['prevailing_wage'].describe().T" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "89uvnFlrQ735", "outputId": "3c7aadd0-455f-4e2c-dc37-1543c3454b4f" }, "id": "89uvnFlrQ735", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "unit_of_wage Hour Month Week Year\n", "count 2157.000000 89.000000 272.000000 22962.000000\n", "mean 414.570513 87592.864045 85606.820515 81228.077133\n", "std 275.015000 59525.124924 44802.704810 49951.473223\n", "min 2.136700 1599.280000 2183.230000 100.000000\n", "25% 152.700300 44986.240000 51408.277500 43715.955000\n", "50% 372.652300 81826.010000 85075.820000 76174.500000\n", "75% 637.311100 121629.600000 111331.910000 111341.960000\n", "max 999.919500 264362.950000 280175.950000 319210.270000" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
unit_of_wageHourMonthWeekYear
count2157.00000089.000000272.00000022962.000000
mean414.57051387592.86404585606.82051581228.077133
std275.01500059525.12492444802.70481049951.473223
min2.1367001599.2800002183.230000100.000000
25%152.70030044986.24000051408.27750043715.955000
50%372.65230081826.01000085075.82000076174.500000
75%637.311100121629.600000111331.910000111341.960000
max999.919500264362.950000280175.950000319210.270000
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 28 } ] }, { "cell_type": "markdown", "source": [ "I am skeptical about the records tagged with 'Week' for ```unit_of_wage```. According to our records, 75% of jobs with a weekly wage pay at least \\$51,000 per WEEK! As we have limited visibility into the source of these data, we will preserve these records, but ideally I would like to do further research to justify these entries." ], "metadata": { "id": "4L2xsRJsX6hf" }, "id": "4L2xsRJsX6hf" }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Prevailing Wage by Unit',fontsize=14)\n", "sns.boxplot(data=visa,\n", " x='prevailing_wage',\n", " y='unit_of_wage');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "9NV89sp8S46m", "outputId": "a08cb3fc-1045-41a2-9605-fb56cf871bf6" }, "id": "9NV89sp8S46m", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "The boxplot above is further evidence that weekly—and even monthly—data looks erroneous. Many of the weekly wages would result in a yearly earning well over a million USD! Perhaps this is personal bias, but it seems far fetched to have this many exceptionally high-paying jobs in this data set.\n", "\n", "Without any evidence to the contrary, however, we will leave these records intact." ], "metadata": { "id": "qzBqzFuYb08e" }, "id": "qzBqzFuYb08e" }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Prevailing Hourly Wage',fontsize=14)\n", "sns.boxplot(data=visa.loc[visa['unit_of_wage']=='Hour'],\n", " x='prevailing_wage');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "HA_jRjSyTXN3", "outputId": "f27bc85c-6482-454f-ccf6-f3523a9d9507" }, "id": "HA_jRjSyTXN3", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "b=visa.loc[visa['unit_of_wage']=='Hour']['prevailing_wage'].argmax()\n", "visa.iloc[b]['full_time_position']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "-SCMc2pco29j", "outputId": "91af4e73-1761-45d4-9921-b69e022087b9" }, "id": "-SCMc2pco29j", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'Y'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 31 } ] }, { "cell_type": "code", "source": [ "a=visa.loc[visa['unit_of_wage']=='Hour']['prevailing_wage'].max()\n", "print('The maximum hourly wage is ${} for a full time position, equivalently ${} per year!'.format(a,2080*a))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MIlbtpGhoGHa", "outputId": "701a1f3c-dfc3-4730-faef-47f4d03236ad" }, "id": "MIlbtpGhoGHa", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The maximum hourly wage is $999.9195 for a full time position, equivalently $2079832.56 per year!\n" ] } ] }, { "cell_type": "markdown", "source": [ "For the record, some hourly data points seem unreasonable too: The maximum wage is about \\$1000 per hour, or around \\$2 million per year. Again, there is no evidence that this data is necessarily erroneous; it simply stands out." ], "metadata": { "id": "QxwGcOX1n4Xu" }, "id": "QxwGcOX1n4Xu" }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Prevailing Wage by Region',fontsize=14)\n", "sns.boxplot(data=visa,\n", " x='prevailing_wage',\n", " y='region_of_employment');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "DSlArTDVEgBb", "outputId": "b3a7779d-9b17-41d4-bd3e-cbd075e57939" }, "id": "DSlArTDVEgBb", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "The middle 50% of the data for prevailing wage is higher in the midwest and island regions. The northeast has the lowest first quartile. Every region has many extreme values on the high end." ], "metadata": { "id": "2T1jV3NL7oxk" }, "id": "2T1jV3NL7oxk" }, { "cell_type": "code", "source": [ "sns.catplot(data=visa,\n", " x='case_status',\n", " col='education_of_employee',\n", " kind='count',\n", " col_wrap=2);" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 725 }, "id": "stcW_7X1FIig", "outputId": "fa9c9c00-f108-424a-8fb5-9eadff2aad51" }, "id": "stcW_7X1FIig", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Looking toward our dependent variable, it appears that ```case_status``` is influenced by education. The trouble here is that it is difficult to compare the regions, as the scale is different for each. Is the _percentage_ Certified for Doctorate any greater or less than that for, say, Bachelor's?\n", "\n", "Let's instead examine the percentage Certified and Denied." ], "metadata": { "id": "YofS02QG8SIF" }, "id": "YofS02QG8SIF" }, { "cell_type": "code", "source": [ "# dataframe of percentages\n", "a=visa.groupby('education_of_employee')['case_status'].value_counts(normalize=True)\n", "b=pd.DataFrame(index=['High School',\"Bachelor's\",\"Master's\",'Doctorate'],\n", " columns=['Certified','Denied'])\n", "for (c,d) in a.index:\n", " b.loc[c,d]=a[(c,d)]\n", "b" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "omF0GmgRF4MK", "outputId": "6cefc292-cfac-417a-f9a6-f90d9e455eb7" }, "id": "omF0GmgRF4MK", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Certified Denied\n", "High School 0.340351 0.659649\n", "Bachelor's 0.622142 0.377858\n", "Master's 0.786278 0.213722\n", "Doctorate 0.872263 0.127737" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CertifiedDenied
High School0.3403510.659649
Bachelor's0.6221420.377858
Master's0.7862780.213722
Doctorate0.8722630.127737
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 35 } ] }, { "cell_type": "code", "source": [ "# barplot of percentages\n", "plott()\n", "plt.title('Percent Certified by Education Level',fontsize=14)\n", "plt.bar(b.index,\n", " b['Certified'],\n", " label='Certified')\n", "plt.bar(b.index,\n", " b['Denied'],\n", " bottom=b['Certified'],\n", " label='Denied')\n", "plt.legend(loc='lower right')\n", "plt.xlabel('Education Level')\n", "plt.ylabel('Percent Certified/Denied')\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "8ySgUscIJ58y", "outputId": "fa94bd63-9e74-4898-8f0a-51f546501454" }, "id": "8ySgUscIJ58y", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Indeed, here we can compare ```case_status``` on like scales. We find that applicants with a Doctorate are most often certified (over 87% of the time), while employees with only a High School education are certified around a third of the time. We find a clear trend: Visa status is directly connected with level of education. Higher levels of education lead to a greater percentage of visas certifed.\n", "\n", "To plot the rest of our features as percentages, we make the above code into a function." ], "metadata": { "id": "I3RqCDa89j9i" }, "id": "I3RqCDa89j9i" }, { "cell_type": "code", "source": [ "def percent_status(col):\n", " '''Plot percent Certified/Denied\n", " for classes of a categorical variable.'''\n", "\n", " # generate dataframe of percentages\n", " a=visa.groupby(col)['case_status'].value_counts(normalize=True)\n", " #compute ascending order of classes\n", " ser=pd.Series(dtype='float')\n", " for name in visa[col].unique():\n", " ser[name]=a[(name,'Certified')]\n", " # dataframe\n", " b=pd.DataFrame(index=ser.sort_values().index.tolist(),\n", " columns=['Certified','Denied'])\n", " for (c,d) in a.index:\n", " b.loc[c,d]=a[(c,d)]\n", " \n", " # plot percentages\n", " plott()\n", " plt.title('Percent Certified by '+col,fontsize=14)\n", " plt.bar(b.index,b['Certified'],label='Certified')\n", " plt.bar(b.index,b['Denied'],bottom=b['Certified'],label='Denied')\n", " plt.legend(loc='lower right')\n", " plt.xlabel(col)\n", " plt.ylabel('Percent Certified/Denied')\n", " plt.show()" ], "metadata": { "id": "Mf3NfypwtYps" }, "id": "Mf3NfypwtYps", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "percent_status('continent')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "ie73TsCbw4ky", "outputId": "508fa520-1786-4355-b453-5cd5dabb39ea" }, "id": "ie73TsCbw4ky", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "* Europe, Africa, and Asia see the greatest percentage of certified visas.\n", "* South America sees the least.\n", "* Every region has over 50% certification rate." ], "metadata": { "id": "oLMPOdP9-bXd" }, "id": "oLMPOdP9-bXd" }, { "cell_type": "code", "source": [ "percent_status('has_job_experience')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "15xyJn4S9Vcm", "outputId": "946b4c76-6581-4ace-af7a-6aca265f6555" }, "id": "15xyJn4S9Vcm", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Having job experince certainly helps, as the proportion of Certified visas is higher for applicants with job experience. Let's test whether this difference is significant, assuming a level of siginificance of 5%. Please note that our populations are binomially distributed, independent, and simply randomly sampled; they furthermore satisfy the sample size inequalities. The assumptions for a two proportion z-test are therefore met. Our null hypothesis is that our sample proportions come from populations with the same ratios of Certification to Denial; the alternative is that the ratios are truly different." ], "metadata": { "id": "32RM_732-x7-" }, "id": "32RM_732-x7-" }, { "cell_type": "code", "source": [ "from statsmodels.stats.proportion import proportions_ztest\n", "\n", "def cert_ztest(col):\n", " '''Run a two proportions independent\n", " z-test for case_status.'''\n", " # collect data\n", " size=visa[col].value_counts()\n", " a=visa.groupby(col)['case_status'].value_counts()\n", " cert=[]\n", " for idx in size.index.tolist():\n", " cert.append(a[idx,'Certified'])\n", " # run test\n", " t,p_val=proportions_ztest(cert,size)\n", " print('The p-value is',p_val)" ], "metadata": { "id": "KTyGNBDBCQPj" }, "id": "KTyGNBDBCQPj", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "cert_ztest('has_job_experience')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "N8kIy3TIComH", "outputId": "6566e90d-b603-4e5f-e423-9b5a91c4508c" }, "id": "N8kIy3TIComH", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The p-value is 1.2710489965841227e-206\n" ] } ] }, { "cell_type": "markdown", "source": [ "With an astoundingly low p-value, we can confidently conclude that these sample proportions reflect a real-world difference for visa applications: We find that a greater proportion of applications are certified when the applicant has previous job experience." ], "metadata": { "id": "5zpQvyoil34u" }, "id": "5zpQvyoil34u" }, { "cell_type": "code", "source": [ "percent_status('requires_job_training')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "5TBKUwD19a4A", "outputId": "a4e17603-7d53-4318-844b-deefcb64cee5" }, "id": "5TBKUwD19a4A", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "visa.groupby('requires_job_training')['case_status'].value_counts(normalize=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "odC2i4ZsDgVI", "outputId": "fe6d20bc-ebdb-4971-927d-1a5454ae2474" }, "id": "odC2i4ZsDgVI", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "requires_job_training case_status\n", "N Certified 0.666459\n", " Denied 0.333541\n", "Y Certified 0.678849\n", " Denied 0.321151\n", "Name: case_status, dtype: float64" ] }, "metadata": {}, "execution_count": 43 } ] }, { "cell_type": "markdown", "source": [ "Unlike the previous feature, the requirement of job training has very close proportions: 66.6% Certified (N) versus 67.9% (Y). There is apparently no difference in ```case_status``` based on the job training requirement. To confirm, we can run another two proportion z-test. (As before, the assumptions for the test are met.) We set the level of siginificance at 0.05. Our null hypothesis is that our sample proportions come from populations with the same ratios of Certification to Denial; the alternative is that the ratios are truly different." ], "metadata": { "id": "VTl-G-WyZvG5" }, "id": "VTl-G-WyZvG5" }, { "cell_type": "code", "source": [ "cert_ztest('requires_job_training')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GpPEclpDDRu6", "outputId": "98d68242-15c8-46ee-fd65-df8b24e7c51b" }, "id": "GpPEclpDDRu6", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The p-value is 0.1787590242870024\n" ] } ] }, { "cell_type": "markdown", "source": [ "Our p-value exceeds 0.05, so we fail to reject the null hypothesis. We must conclude that there is no significant difference between the two proportions. Put another way, ```requires_job_training``` does not impact ```case_status``` on its own." ], "metadata": { "id": "UFd8cU8FarTH" }, "id": "UFd8cU8FarTH" }, { "cell_type": "code", "source": [ "percent_status('region_of_employment')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "Pne0v1rv9egO", "outputId": "525506ee-cc9e-426a-bc70-f25bf72ad619" }, "id": "Pne0v1rv9egO", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "There seems to be little difference in visa certification rate among island, western, and northeastern jobs. The south and midwest look to have appreciably higher rates." ], "metadata": { "id": "rRfhzqqXbwlQ" }, "id": "rRfhzqqXbwlQ" }, { "cell_type": "code", "source": [ "percent_status('unit_of_wage')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "du7GY_gx9mdA", "outputId": "6cda8957-28f8-46fc-a5e0-e6c4685e8c6e" }, "id": "du7GY_gx9mdA", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Applicants to hourly jobs see far less than 50% certification. This is well below the rest, with yearly salaried jobs being most likely to lead to visa certification." ], "metadata": { "id": "wsCZ8xqFcfVq" }, "id": "wsCZ8xqFcfVq" }, { "cell_type": "code", "source": [ "percent_status('full_time_position')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "G-JVGarR9r0y", "outputId": "8cbc7f5b-5cf5-4dcd-bae3-79aaea5e0989" }, "id": "G-JVGarR9r0y", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "visa.groupby('full_time_position')['case_status'].value_counts(normalize=True)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0wr9nW0NDnP2", "outputId": "399cee15-5503-4286-cd4c-56900e625593" }, "id": "0wr9nW0NDnP2", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "full_time_position case_status\n", "N Certified 0.685260\n", " Denied 0.314740\n", "Y Certified 0.665832\n", " Denied 0.334168\n", "Name: case_status, dtype: float64" ] }, "metadata": {}, "execution_count": 48 } ] }, { "cell_type": "markdown", "source": [ "Similar to ```requires_job_training```, it appears here that there is not much difference in the resulting certification based on whether the position is full time. We find that 68.5% are Certified for part time work and 66.6% are Certified for full time work. Our assumptions being once more satisfied, we can run a two proportion z-test. We set the level of siginificance at 0.05. Our null hypothesis is that our sample proportions come from populations with the same ratios of Certification to Denial; the alternative is that the ratios are truly different." ], "metadata": { "id": "E0RvP4UAebK0" }, "id": "E0RvP4UAebK0" }, { "cell_type": "code", "source": [ "cert_ztest('full_time_position')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W5QUkuBsDXVG", "outputId": "c56c2f61-d214-4578-d116-ea0d262e2011" }, "id": "W5QUkuBsDXVG", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The p-value is 0.042452929825717224\n" ] } ] }, { "cell_type": "markdown", "source": [ "With a p-value under 0.05, we can reject the null hypothesis. In this case, there is a statistically significant difference between these two proportions. That means that an applicant is more likely to be certified for part time work than full time work." ], "metadata": { "id": "boxWk9Nze8PA" }, "id": "boxWk9Nze8PA" }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Prevailing Wage and Case Status',fontsize=14)\n", "sns.boxplot(data=visa,x='prevailing_wage',\n", " y='case_status',\n", " order=['Certified','Denied']);" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "mJVJIYA5-jQV", "outputId": "4265f790-7481-4a84-f9f2-b1d9306474f8" }, "id": "mJVJIYA5-jQV", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAhYAAAFTCAYAAABlOkWRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deVhUZf8G8HtmAFFQAUUE3LISzRUVkdgKMY0EcSlzL1+DH+6ZhVtpiSZar+ZSLqX5pmju4pKaogimiCtmLmEKKC7sssPMnN8fvJzXkUVGzzCM3J/r6pJznnOe832YQ3PPc87MyARBEEBEREQkAbm+CyAiIqIXB4MFERERSYbBgoiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIhqmJ07d8LR0bHKy6Rb6enpcHBwQExMjL5LITIIDBZUK02fPh0ODg5wcHBA+/bt0atXL4SGhiIvL0/fpcHHxwdHjhx55napuLm54fvvv9dY9/3338PBwQEHDx7UWD9t2jQMGzZM5zXVZKmpqQgJCYG3tzc6dOgAd3d3jB07FpGRkXqtKykpCdOmTYOHhwc6dOgANzc3BAQE4K+//hK3Ke8xrQovLy/89NNPUpZLLwAjfRdApC+vv/46Fi1aBKVSibNnz2L27NnIy8vDl19+WWZbpVIJhUIBmUym87pMTU1hamr6zO1ScXZ2xpkzZzBu3DhxXUxMDGxtbXHmzBn07dtXY/27776r85pqqjt37mDo0KEwMzPD1KlT0bZtWwiCgFOnTmHOnDk4fvy4XuoqLi7GmDFj0KJFCyxduhQ2NjZ4+PAhTp48iaysLL3URC8+zlhQrWViYgJra2vY2trC19cXvr6+OHr0KABg+fLl6NevH3bu3Alvb2907NgReXl5yM7Oxueffw4XFxc4OjpixIgRuHz5MgAgJycHnTp1QkREhMZxoqOj0b59e6SlpQEAvvnmG/Tp0wedOnWCl5cXFi1ahMLCQnH7p13qeLK9tNb9+/fD29sbjo6OGDduHNLT08VtlEolFixYACcnJzg5OWHBggWYM2cORo4cWeFxnJ2dceHCBRQVFQEAioqKcOHCBQQEBGhcFrh16xYePnyInj17QqVSYebMmfDy8kKnTp3w1ltvYe3atVCr1VrVIggC1q5dC29vb3Tq1Am+vr7Ys2dPhbUCQFxcHMaMGQNnZ2d07doVQ4cOxYULFzS2cXBwwK+//opJkyahS5cu6NWrV5l+4+LiMHDgQHTs2BH+/v6Ii4ur9LgAxDC6Y8cO+Pj4oHXr1nj55ZcxYsQIhIeHi9utX78evr6+6NKlC9zd3TFr1iw8evRIbM/Ozsann34KFxcXdOzYEb169cLPP/+s0V7R+Vee+Ph4JCYm4osvvkDXrl1hb28PR0dHTJgwAS4uLgBKZh0AYPLkyXBwcBCXExMTERQUBFdXV3Tp0gUDBgzAsWPHxL5HjhyJu3fvYtGiReLsH1D++RsTEwMHBwfxnHzaOMmwMVgQ/ZepqSmKi4vF5Tt37mDfvn347rvvsGfPHpiYmCAgIAAPHjzA6tWrsXv3bnTv3h2jR4/Gw4cPYW5ujjfffBN79+7V6Hfv3r14/fXX0ahRIwBA3bp1sWDBAhw4cABz5szBgQMH8MMPPzxX7Xfv3sWBAwewYsUKrFu3DlevXsXSpUvF9nXr1mHXrl0ICQnBr7/+CrVajX379lXap7OzMwoKCnDp0iUAwMWLF2FpaQl/f38kJCQgNTUVQMmThqmpKbp06QK1Wg0bGxssXboUBw4cwJQpU7B69Wrs2LFDq1qWLl2K7du344svvsD+/fsREBDw1Ff+ubm58PPzQ1hYGLZt24Z27dohICAAGRkZGtutXLlSDBQ+Pj6YNWsWkpOTxT4CAwPRrFkz7NixA5988glCQ0Mr/T1lZmYiKioKw4cPh5mZWZn2Bg0aiD/LZDLMnDkT+/btw7fffou4uDjMmzdPY9w3btzA6tWrcfDgQSxYsAA2NjYASsJWZedfeaysrCCXy3Ho0CEolcpyt9m+fTsAICQkBNHR0eJyXl4ePDw8sG7dOuzZswdvvfUWJk6ciJs3bwIoCbRNmzbF+PHjER0djejo6Ep/T4+rbJz0AhCIaqHg4GAhICBAXL506ZLQo0cPYfLkyYIgCMKyZcuE1157TUhJSRG3+eOPP4QuXboI+fn5Gn35+fkJa9asEQRBEI4cOSJ06tRJyM7OFgRBEPLz8wVHR0chPDy8wlrCwsIEb29vcXnHjh1Cly5dqry8bNkyoUOHDsKjR4/Edd9//71Gn66ursLq1avFZbVaLbz11lvCiBEjKqxLEATB09NTWL58uXicTz75RBAEQRgyZIiwf/9+QRAEYcqUKcLo0aMr7GPx4sUa7U+rJTc3V+jYsaMQGxur0U9ISIgwduzYSut9nFqtFlxdXYXdu3eL69q0aSN888034nJxcbHQqVMncZstW7YI3bp1E3JycsRtdu/eLbRp00Y4ffp0uce5dOmS0KZNG+Hw4cNVrq1UZGSk0L59e0GlUgmCIAiBgYHC9OnTy922KudfeTZu3Ch07txZ6NKlizB8+HBhyZIlwo0bNzS2adOmjfDbb789td53331XWLlypbj85ptvCj/++KPGNk+en4IgCKdPnxbatGkjpKWlPXWcZPh4jwXVWlFRUXB0dIRSqYRSqUSvXr3w+eefi+02NjZo3LixuHzlyhXk5+eLU8ilCgsLkZSUBADw8PCAqakpjhw5An9/f0REREAQBHh7e4vbHzx4EBs2bEBiYiLy8vKgUqk0LhU8Czs7O9SvX19cbtKkiXjpJTs7GykpKejYsaPYLpPJ0KlTJ9y/f7/Sfp2dnRETE4MJEyYgJiYG/fv3BwD06NEDp0+fho+PD86cOYMRI0aI+2zevBnbtm1DcnIyCgsLUVxcDHt7+yrXEh8fj8LCQowdO1bjnpbH+ylPWloavvvuO8TExCA1NRVqtRoFBQW4d++exnalU/YAYGRkBCsrK3GK/ubNm3BwcNCYeXjaO3AELb7H8dSpU1izZg1u3ryJ7OxsqNVqFBcXIyUlBTY2Nhg6dCgmT56MK1euwNXVFW+++SZ69OgBoGrnX3mGDx+O/v37IyYmBnFxcTh69CjWrl2L+fPnw9/fv8L98vLysGLFChw/fhwpKSlQKpUoLCzU+P09q8rGSYaPwYJqre7du2PevHkwMjJCkyZNYGxsrNFer149jWW1Wo3GjRtj06ZNZfoyNzcHABgbG+Ptt9/G3r174e/vj/DwcPTu3Rt169YFUHI5YerUqRg/fjzc3d3RoEEDREREPHW6/WmerF0mk2n1hFcRZ2dnzJkzB1lZWbh06RLmz58PAHBycsL8+fNx8+ZNpKamomfPngCAAwcOYMGCBQgODoajoyPMzc2xadMmrd7FUlr3Dz/8ADs7O402I6OK/5cVHByMtLQ0zJgxA/b29jAxMcEHH3ygcXmrvD5kMtlzBbuWLVtCJpPh5s2b6N27d4Xb3b17F4GBgXjvvfcwadIkWFhY4K+//sLUqVPFGj09PREREYETJ07g9OnTCAwMRN++ffH1119X6fyriLm5OXr16oVevXphypQp+Ne//oVly5ZVGixCQ0MRFRWF4OBgtGzZEnXr1kVwcHCZ3+eT5HJ5mXPvycswlY2TDB+DBdVadevWRcuWLau8ffv27ZGamgq5XI7mzZtXuJ2fnx9GjBiB+Ph4REdHY9WqVWLb+fPnYWNjg/Hjx4vrSq/v60r9+vVhbW2Ny5cvi692BUHA5cuXYW1tXem+zs7OKCoqwrp162BlZSX+vrp27YqkpCSEh4ejXr164gzEuXPn0LlzZ40ZjMTERK1qefnll2FiYoLk5OQyr84rc+7cOcyePRtvvPEGgJK3f6akpFR5/9Jj79q1C3l5eWKwvHjxYqX7WFhYwM3NDRs3bsTIkSPL3Gfx6NEjNGjQAH/++SeKi4sxY8YMKBQKACj3nhErKyv4+/vD398fHh4emDp1Kr788ssqn39PI5PJ0Lp1a1y5ckVcZ2xsXCZcnT9/Hv7+/ujTpw+AkpmRxMREtGrVSmM/lUqlsZ+lpSXy8/ORk5MjBp6rV69WeZwmJibPPDaqGXjzJlEVvf766+jatSvGjRuHyMhIJCUl4cKFC1i2bBnOnj0rbte1a1fY2dnhk08+gYWFhcaTY6tWrfDgwQOEh4cjKSkJYWFhT72JUgqjRo3CTz/9hN9//x3//PMPFi5cWKUnXXt7ezRr1gy//PILnJycxPVmZmZo3749fvnlF3Tv3l2cBWjVqhWuXLmCyMhI3L59GytXrkRsbKxWtZibm2PMmDFYtGgRtm/fjoSEBFy9ehWbN2/Gr7/+WmGtL730EsLDwxEfH4+4uDh8/PHHZWZynqZfv35QKBSYOXMm/v77b5w8eVIjGFZkzpw5AIBBgwbht99+wz///IObN28iLCwMfn5+AEpmNtRqNTZs2ICkpCTs27cPGzZs0Ojnu+++w5EjR3D79m3cvHkThw8fRvPmzWFiYlLl8+9xV69eRVBQEA4ePIj4+HgkJCRg27Zt2LFjh8bsir29PU6dOoWUlBTxbaitWrXC77//jitXruD69ev49NNPNd69VLrfuXPn8ODBA/FyUufOnVGvXj18++23SEhIwKFDhxAWFlblcZLhY7AgqiKZTIY1a9bA2dkZn3/+Od5++21MmTIFt27dQpMmTTS29fX1xbVr1/DOO++Ir06Bkrf2/etf/8KCBQvg5+eHP/74A5MmTdJ57WPGjIGfnx9mzJiBIUOGAAB69+6NOnXqPHVfZ2dn5ObmwtnZWWN9jx49kJubK14GAYAhQ4bg7bffxrRp0zB48GDcvXsXH374oda1TJkyBRMmTMC6devwzjvv4MMPP8Thw4fRrFmzCutcsGAB8vLyMHDgQEydOhWDBg2q9J6M8piZmWH16tVISEjAgAEDEBoaimnTpj11v+bNm2Pnzp1wdXXFN998Az8/P4wePRoRERH46quvAABt27bFrFmzsH79erzzzjvYtm0bPvvsM41+TExMsGTJEvTv3x9Dhw5Fbm6uGGy0Of9K2djYoHnz5li5ciXee+89+Pv7Y/369RgzZozG/UTBwcGIiYnBG2+8gQEDBgAo+RC5Ro0aYfjw4fjoo4/QuXNndO/eXaP/SZMm4d69e/D29hYDtIWFBRYvXow//vgDvr6+2Lp1KyZPnlzlcZLhkwlSXIglIoPj7++Pbt26aTzBsBYiel68x4KoFrh79y6io6Ph5OQEpVKJrVu34vr16xqfoVAbayEi6TFYENUCcrkcu3fvxqJFi6BWq/HKK69g7dq1Gm/7rI21EJH0eCmEiIiIJMObN4mIiEgyDBZEREQkGQYLIiIikgxv3pRIRkYu1Grpbldp1MgcaWk5kvVnSGrz2IHaPf7aPHagdo+/No8dMLzxy+UyWFqW/TZfgMFCMmq1IGmwKO2ztqrNYwdq9/hr89iB2j3+2jx24MUZPy+FEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBREREkmGwICIiIskwWBAREZFk+MmbJLmwsP8gKSnhmfc3NlaguFglYUWVy8rKBAA0bGhRbcesjDbjb968JYYNG6XjioiIqo7BgiSXlJSA63/HQ2FaM56on0ZVUBIsUh4p9VyJdkrrJiKqSRgsSCcUphao17KXvsuokryEowBgMPWWKq2biKgm4T0WREREJBkGCyIiIpIMgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJMFgQERGRZBgsiIiISDIMFkRERCQZBgsiIiKSDIMFERERSYbBgoiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBREREkmGwICIiIskwWBAREZFkGCyIiIhIMkb6LoDKOnnyBBo0qIuOHZ30XQoRGZiTJ08AAFxdPfRcCdVWDBY1UHR0JIyNFQwWRKS16OhIAAwWpD+8FEJERESSYbAgIiIiyTBYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBREREkmGwICIiIskwWBAREZFkGCyIiIhIMgwWREREJBkGCyIiIpIMgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJMFgQERGRZBgsiIiISDIMFkRERCQZBgsiIiKSjJG+CyAiImmlpDzEmDHDnqsPuVwOtVr9XH3UrVsP+fl5Wu3TqFFjKBQKPHz4AGZm5sjNzUH9+vWRnZ0NY2NjWFpaISXlIQRBgEwmg4+PH/bv3wMAkMlkkMnkUKtVsLS0QmZmBhQKBZRKJQDAyMgIcrkCVlZWSEtLRXFxsXhcIyMjKBQKfPjhR1i/fi1UKhWUSiWsrBohPT0NxsbGkMvlmDlzLpKT72L16hUwNjbB7NlfQhAELFz4FZo0aYqPP/4MDRtalBlXZmYGVqxYApkMmDBhKgRBwIoVSwAAEydOhbV1fXG7VauWIyhoEgRBwKpVyzFs2CiEhf0Hw4aNRljYBgQFTRKPUdn2j697fB9dU8ydO3dudRyouLgYK1aswJw5c7B161Zs3boVt2/fhrOzMxQKRZX7+fnnn9G8eXPUq1cPALB582acO3cOXbp0AQDMmDEDS5YsQWxsLIyNjbFt2za4ublpVevOnTvxww8/4O23367yPvn5RRAErQ5ToZMnT0ChkMPFxV2aDqvZyZMnkP4oD8YWrfVdSpUUZ90CAIOpt1Rx1i1YNawHNzdPfZciGTOzOsjLK9J3GXojxfhPnjyBO3cSn7sWQYL/oSmVxU/f6An5+XnIzc0FABQXl/wuiopK/lWr1WJbqb//vq6xXFp3QUG+uE8ptVoNlUqFnJycMqGptO3ChfMoLi4W2/Pz8zXab9y4hqioSAiCGmp1yfK5c7HIyMhAVlYmiooK0bmzY5lxbdu2GRcunEVGRjqKigpx48Y1jWUXl57IyyvCtm2bcf58LAoLS7Y5fz4WN25cQ0LCbdy4cRUJCbdRWPi/Y1S2/ePrHt9HCjKZDPXqmZTbVm0zFjNmzEBhYSF27NgBc3NzKJVK7NixA0VFRTA2Nn7q/mq1GjKZDP/5z3/w+uuvo1GjRgCAoUOHitukpqbi0KFDOHv2LOTykqs8vXr10s2AiIhqoJSUh/ouwaCpVMpK25OT71a6fOLEcfj5DdSYHcjMzEBU1HFx+fGfS5Yj8eGHo5CZmY3o6EgIgiD+KwiCeIzSf6OjT8DPb6DGduVtHxUVCQD/bT9Rpi5dqZZgcfv2bRw5cgSRkZEwNzcvObCREYYMGQIAWLNmDQ4fPgyVSgUbGxvMmzcP1tbWWL58Of7++2/k5OQgOTkZ/fv3x8OHDzFp0iTUqVMH3377LX777Tfk5eVh/PjxGDVqFAoKCjBgwAAMGDAADRo0wPHjx7Fs2TIAwK5duxAWFgaVSgVzc3PMnTsXrVu3RlFREUJCQnD69GlYWlqiXbt21fFrqVBWViYePcpCaOg8vdbxrBITE6BWVX0Wip6NWlmAxMQEgz1PymNsrEBxsUrfZeiNFONPT0+TqBp6FiqVEuHhOzFy5Bhx3d69u6BU/u9xLb008/jyli1bkJ9fBLVaENdVNGmkVqsRHr7zvz9XvL1SqYRMprnP43XpSrUEi7/++gstW7ZEw4YNy7Tt2bMHSUlJ2Lp1K+RyOcLCwrBw4UJ8++23AIC4uDjs3LkTVlZWAIBt27Zh2bJlaNOmjUY/5ubmWLNmDQYNGoQ9e0qut+3cuVNsP3v2LH777Tds2rQJJiYmiIyMxMyZM7Flyxb8+uuvuHPnDvbv3w+lUonhw4ejWbNmuvp1EBHRC+zUqZMaT+CnTp0EUNmlJQHHjh2DIPxvxqSyS1EqlfK/fT5te0EMG6X7vDDBojIRERH4888/MWDAAAAQZxNKeXh4iKHieY9z7do1vPvuuwBKHoRHjx4BAGJiYuDv7w9jY2MYGxvDz88P58+ff+5jPquGDS3QuHEjTJ06U281PI/Q0HmIT0rVdxkvPLmRKVo0b4zg4M/1XYpkrK3rIyUlW99l6I0U43/emzbp+bm4uJZZPnbsKCoOFzK8+eabyM8vwokTx6FSKSGTyf4bCsruo1AYiceofHsZZLKS57vH99G1agkWr732GhISEpCVlVVm1kIQBAQFBWHw4MHl7mtmZiZJDYIgYNCgQZg8ebIk/RER1USl72Ig/VAojODnN1Bjna/vAERFHdd4dwoAjeX3338fqakl91ioVCXrBEEoc9kEKHnHzuP3WFS0/f+OUyzuUx2q5XMsWrVqBS8vL3zxxRfIyckBUDIzsW3bNnh5eSEsLAxZWVkASu7+vXbtWoV9mZmZITtb+0Tv5eWFPXv24P79++Lx//zzTwBAz549sWfPHiiVShQUFGDfvn1a909EVBNYWzfRdwkGTaGo/PW2nZ29xjZ2dvaws7MXlz083ihzg6SFhSXc3d8Ql93d33hi2ROWlpawsLCEm5snZDIZ3Nw84e7+BmQyGezs7DX+dXPzQMOGFk/d3t3dE+7unhr7VIdquxSycOFCrFy5EoMGDYKxsTHUajU8PT0xdepUZGZmYsSIEQBKZhaGDh2Ktm3bltvPqFGjMHPmTJiamor3YVSFk5MTpkyZgqCgIKhUKhQXF6Nv377o0KED3nvvPVy/fh0+Pj6wtLREx44dkZbGxE9EhkmKWQt+jkX5n2MRGDhB43MsAgMnaHyORUWzAr6+A5CQcBsyGcTZhoSE2wCgsY+v7wDcvXtH3Obu3TtlPseiqts/vq66ZisAQCZI8WZlQlpajnh37vMKDZ0HY2OFwd9jUa+lYbzVNy/hKAAYTL2l8hKO4hXeY/FCkWL8pe8SMrTzgo+9YY1fLpehUSPz8tuquRYiIiJ6gTFYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBREREkmGwICIiIskwWBAREZFkGCyIiIhIMgwWREREJBkGCyIiIpIMgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJMFgQERGRZBgsiIiISDIMFkRERCQZBgsiIiKSDIMFERERScZI3wVQWW5unmjQoK6+yyAiA+Tm5qnvEqiWY7CogVxdPWBtXR8pKdn6LoWIDIyrq4e+S6BajpdCiIiISDIMFkRERCQZBgsiIiKSDIMFERERSYbBgoiIiCTDYEFERESS0SpYpKenIzc3FwCgUqmwY8cO7Nq1C2q1WifFERERkWHRKlgEBgYiISEBALBkyRKsW7cOP//8MxYuXKiT4oiIiMiwaBUsbt++jXbt2gEAwsPDsXbtWmzYsAEHDhzQSXFERERkWLT65E25XI7i4mLcunUL9evXh52dHdRqtXh5hIiIiGo3rYKFh4cHJk+ejMzMTPj4+AAA4uPjYWNjo5PiiIiIyLBoFSzmz5+PXbt2wcjICP379wcAZGRkYOLEiTopjoiIiAyLVsHCxMQEQ4YM0Vjn7OwsaUFERERkuLQKFp9++ilkMlm5bYsWLZKkICIiIjJcWgWLli1baiynpKTg0KFD8PX1lbQoIiIiMkxaBYsJEyaUWTd48GCsXLlSsoKIiIjIcD33R3q3a9cOZ86ckaIWIiIiMnBazVicOnVKY7mgoAD79+/HK6+8ImlRREREZJi0ChazZs3SWK5Xrx7atm2Lb7/9VtKiiIiIyDBpFSwiIiJ0VQcRERG9ALS6x8Lf37/c9QMHDpSkGCIiIjJsWgWL0m82fZwgCLhz545kBREREZHhqtKlkM8++wwAUFxcLP5c6u7du7x5k4iIiABUMVi0aNGi3J8BoGvXrujbt6+0VREREZFBqlKwKP1grM6dO8Pd3V2nBdGLQVWQibyEo/ouo0pUBZkAYDD1liqpu7G+yyAi0qDVu0Lc3d1RVFSEW7duISMjA4IgiG0uLi6SF0eGqXnzlk/fqBLGxgoUF6skqubpsrJK/gwaNrSotmNWpurjb/zcv2siIqlpFSzOnj2LKVOmoKioCDk5OTA3N0dubi6aNm2Ko0cN69Ue6c6wYaOea39r6/pIScmWqBrDU9vHT0SGTat3hXz99dcYO3Yszpw5AzMzM5w5cwZBQUEYNmyYruojIiIiA6JVsLh9+zZGjdJ8NRoQEICff/5ZypqIiIjIQGkVLOrXr4+cnBwAgLW1NeLj4/Ho0SPk5eXppDgiIiIyLFrdY9G7d29ERkbC19cXgwYNwqhRo2BkZIQ+ffroqj4iIiIyIM/8JWT/+te/0LlzZ+Tm5sLDw0PywoiIiMjwaHUpJCQkRGO5e/fu8PT0xIIFCyQtioiIiAyTVsFi586d5a4PDw+XpBgiIiIybFW6FLJ9+3YAgEqlEn8ulZSUBAuLmvHBQkRERKRfVQoWe/bsAVDyJWSlPwOATCZD48aNERoaqpvqiIiIyKBUKVj88ssvAIAlS5bg448/1mlBREREZLi0usdi9OjRyM3NBVByWWTHjh3YvXs31Gq1ToojIiIiw6JVsAgMDERCQgIA4N///jfWrVuH9evXY+HChTopjoiIiAyL1h/p3a5dOwDA3r17sXbtWmzYsAEHDhzQSXFERERkWLT6gCy5XI7i4mLcunUL9evXh52dHdRqtXh5hIiIiGo3rYKFh4cHJk+ejMzMTPj4+AAA4uPjYWNjo5PiiIiIyLBoFSzmz5+PXbt2wcjICP7+/gCAjIwMTJw4USfFERERkWHRKliYmJhgyJAhGuucnZ01ln19fbF3797nr4yIiIgMjlY3b1bFnTt3pO6SiIiIDITkwUImk0ndJRERERkIyYMFERER1V4MFkRERCQZyYOFIAhSd0lEREQG4pmChVqtxsOHD8tt++qrr56rICIiIjJcWgWLR48e4ZNPPkGnTp3w1ltvAQCOHj2KJUuWiNv4+vpKWyEREREZDK0+x2LOnDlo0KABIiIi8M477wAAHB0dERoayq9TN3BhYf9BUlKCvssAABgbK1BcrJKkr6ysTABAw4YWkvRXHSoaf/PmLTFs2Cg9VEREVHVaBYtTp04hKioKxsbG4lPcmjcAABq9SURBVNtKrayskJaWppPiqPokJSXgdvw1NDXX6pTQiQIJ+8rIUQIA6uSnStirbpU3/vv/HQcRUU2n1bNI/fr1kZGRgSZNmojrkpOTYW1tLXlhVP2amhvhw05W+i5DUuvj0gHA4MdVOg4ioppOq3ss3n33XUyaNAmnT5+GWq3GhQsXEBwcjPfff19X9REREZEB0WrG4qOPPkKdOnXw1VdfQalUYubMmRgyZAhGjx6tq/qIiIjIgGgVLGQyGUaPHs0gQUREROXS6lLI6dOnkZSUBABISUlBcHAwZsyYgZSUFJ0UR0RERIZFq2Dx5ZdfQqFQAAAWLlwIpVIJmUyGzz//XCfFERERkWHR6lLIgwcPYGdnB6VSiejoaERERMDY2Bju7u66qo+IiIgMiFbBwtzcHKmpqfj777/x8ssvw8zMDEVFRVAq+R57IiIi0jJYjBgxAoMHD0ZxcTFmzpwJADh//jxat26tk+KIiIjIsGgVLAICAtC7d28oFAq0aNECAGBjY4OQkBCdFEdERESGRevPb37ppZcqXSYiIqLaS6tgkZOTg+XLlyM2NhYZGRkQBEFsO378uNS1ERERkYHR6u2mc+fOxV9//YVx48YhMzMTs2fPhq2tLT744AMdlUdERESGRKsZi5MnT+LAgQOwtLSEQqGAt7c3OnbsiP/7v/9juCAiIiLtZizUajXq168PAKhXrx6ys7NhbW2NhIQEnRRHREREhkWrGYu2bdsiNjYWLi4u6N69O+bOnQszMzO0atVKR+URERGRIdFqxiIkJATNmjUDAMyaNQumpqbIzs7G4sWLdVIcERERGRatgsWGDRvELxxr1KgR5s+fj5EjR2LLli06KY6IiIgMi1bBYt++fejQoYPGug4dOmDfvn2SFkVERESGSatgIZPJND67AgBUKhXUarWkRREREZFh0ipYdO/eHUuXLhWDhFqtxvLly9G9e3edFEdERESGRat3hcyaNQuBgYFwc3ODnZ0d7t27B2tra6xatUpX9REREZEB0SpYNG3aFLt27UJcXBzu3bsHW1tbdOrUCXK5VhMfRERE9ILS+kvI5HI5unTpgi5duuiiHiIiIjJgnGogIiIiyWg9Y0G6d/LkCTRoUBcdOzrpuxQiqoVOnjwBAHB19dBzJWSIGCxqoOjoSBgbKxgsiEgvoqMjATBY0LPhpRAiIiKSDIMFERERSYbBgoiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBREREkmGwICIiIskwWBAREZFkGCyIiIhIMgwWREREJBkGCyIiIpIMgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJMFgQERGRZIz0XQAREdUsBQX5SEi4jcDA0QAAQRCgVCrRqFFjpKenQaFQQC6Xo0kTG8jlChgZGWHixKnIyspEaOg8DB48FL/8sg4ymQyDBr2HHTu24pNPZsDOzh5Lly7Gw4f3MX78VOzduxNBQZPQsKEFACAx8Ta+/vpLCIIgHsvKqhHS09Ngbd0EcrkCxcVFuH//Huzs7PHRR+MRFrZB7CMzM0Oj/127tkKlUkKpVCItLRUzZsxB/foNsGLFEiiVShgbG2HChKni8UtrCA2dh+nTv0Dz5i1x5Uoc/v3vUAQGTkRExGGNegEgMzMDq1Yt16ihdFkQhHJ/fnz/qnh8XDNmzEHz5i213v9Zj/0sFHPnzp2r86NUgZeXFzZv3oytW7di3bp1OHXqFOzs7GBra/vMfX733XdIT0/Hq6++qtV+06dPx/3799GpU6cq75OfXwRB0LbC8p08eQIKhRwuLu7SdFjFYypz0uFoU7fajlkdLj7IBwCDH9fFB/kwMreCm5unvkvRKTOzOsjLK9J3GXpTU8a/ZcsvUKlUUKvV4n8AkJ+fBwBQq9VQqVR49OgRsrIykZGRjqKiQhw+fAAZGemIi7so9vXXX1cAABcvnkdubg4uXToPpVKJS5fO48GD+ygsLETnzo4wM6uDL7+cg8zMDKhUKuTkZGv8W3qsnJxsAEB29iPcuHENCQm3xT62bdus0f/Dhw+QlZWJ7OxHUCqVuHHjGtLSUnHhwlmNujt3dhTrXbx4PjIy0nHjxjV4eb2Fr76ajeLiIly4cA5paSnisUpt27YZ58/HatRQunzjxrVyf358/1KVPfaPj6u0Lm08WaMUZDIZ6tUzKbetRl0KWbZsGcLDw/H7779jwIABCAgIwKVLl565v8mTJ8PHx0fCComIXmyJibdRVKR9uDlx4hiSk+/+d6nsq6y8vFwcP35UY1kQBERHn0BWVib++eefx/avmuTku2IfiYm3ceLEMY3+y9v+xInjGuuioyORlZUJoGTspTUkJ99FRMQRsR+VSqlRL1AyExAdHalRw/+WIxEVdRyCICAqKhJRUZFl9q+KzMwMjZqTk+8iKSlBq/0fr1GbYz+rGnsp5K233kJcXBx++uknfPPNN1iyZAliY2NRVFQEBwcHzJ07F2ZmZpg+fTpMTExw+/Zt3L9/H126dEFoaChkMhmmT5+ODh06YMSIESgqKqqwjwcPHuCzzz5DSkoK7O3tIZfrN29lZWXi0aMshIbOq7ZjJiYmoB7U1XY80k5OkRoPExOq9ZzQB2NjBYqLVfouQ29qwvhv3br5TPupVE+vWyhnWletViM8fCfi468/03FL+1izZmWValCplBrLSqUS4eE7MXLkGKxZs1KjbePG9RXWO3LkGOzduwtqtaBRQ+myUqkUZ7GVSiVksrL7V8XevbvK1Lx69QqEhCyu8v6P16jNsZ9VjZqxeFLnzp0RHx+PH3/8EfXr18f27dsRHh6OJk2aYM2aNeJ2f//9N9auXYt9+/bhypUr+OOPP8r0VVkfISEhcHJywoEDB/DFF1/gzJkz1TZGIqKa5FlmK56HSqXEqVMnkZSU9Fx9aDvbUUoQBJw6dRIAyumjbBAqrRcATp06KT7pl9ZQulwSogSxn9JQ9fj+VVHettqM9ckatTn2s6qxMxbA/9JtREQEcnJycOjQIQAlJ37btm3F7by9vVGnTh0AwGuvvYbExES4urpq9FVZHzExMZg9ezYAoHnz5nBxcdHtwJ6iYUMLNG7cCFOnzqy2Y4aGzkPB/fhqOx5px9xEjsZNWyI4+HN9l6JT1tb1kZKSre8y9KYmjH/27E+f+Un6WSgURnBxcUV8/PVnDhcKhRFsbGyeqW6ZTAYXl5LnCzs7+yf6kOHJcFFaLwC4uLjixInjUKmUYg0PHjyASqWETCb774yFAEAGmazkOe3x/avCxcUVx44d0VhnZ2ev1f6P16jNsZ9VjZ6xuHz5Ml599VUIgoA5c+Zgz5492LNnD3777TcsWbJE3K40VACAQqEodzrsaX0QEREQEDD+mfZTKBRP3UZWej3gMXK5HH5+AzFt2rRnOm5pHwEB46tUg0Kh+XrayMgIfn4DAZQd+4gRH1ZYLwD4+g6AXC7TqKF02cjICEZGCvHn0uM+vn9V+PoOKFNzYOAErfZ/vEZtjv2samywOHLkCDZv3owxY8bAy8sLP//8MwoKCgAAOTk5uHlTu+uAlfXRs2dP7NixAwCQlJSEU6dOSTgSIiLD0aJFK5iYlH+3f2U8PN587JV02QBRr54Z3nijl8ayTCaDm5sHGja0QOvWrbV6JQ6UvHIv7aNFi1bw8HhTo//ytvfweENjnZubp/gWzBYtWok12NnZw8vLW+xHoTDSqBcALCws4ebmqVHD/5Y94e7+BmQyGdzdPeHu7llm/6qwsLDUqNnOzl6rt5s+WWN1vN20RgWLSZMmwc/PD71798b27duxZs0adO7cGQEBAWjbti0GDx4MX19fDBs2TOtgUVkfs2bNQkxMDHx8fDBv3jw4OzvrYnhERAbB1tYOAGBsbAxjY2MYGZW8Ym7UqDFkMhmMjIxgYmKCZs2ao0WLVmjd+hX4+Q1EQMB41K1bFyNHlrzSl8lkGDx4CGQyGcaNmwxf3wFo0aIVTE1NERQ0Ga++6qDxCjogYDzq1KkDExMT2NraoU6dOuK/pceytbWDTCaDvX0zBARM0Ojjyf5bt34FLVu2gr19M5iamiIwcAJ8fQegdetX0KJFK7z88itlXsGXjqF0ViAoaCJkMhk++mhcmXpLj/lkDaXLFf2srcfHpc1sRUU16ppMKO82XdJaWlqOeOft8woNnQdjY4Ve7rH4sJNVtR2zOqyPSwcAgx/X+rh0mDZ9hfdYvOBqyvhL331UnedbTRm7vhja+OVyGRo1Mi+/rZprISIiohcYgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJMFgQERGRZBgsiIiISDIMFkRERCQZBgsiIiKSDIMFERERSYbBgoiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBREREkmGwICIiIskwWBAREZFkjPRdAJXl5uaJBg3q6rsMIqql3Nw89V0CGTAGixrI1dUD1tb1kZKSre9SiKgWcnX10HcJZMB4KYSIiIgkw2BBREREkmGwICIiIskwWBAREZFkGCyIiIhIMgwWREREJBkGCyIiIpIMgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJMFgQERGRZBgsiIiISDIMFkRERCQZBgsiIiKSDIMFERERSYbBgoiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUnGSN8FUM1xP0eJ9XHp+i5DUvdzlABg8OO6n6NEK30XQURUBQwWBABo3rylvksQGRsrUFyskqQvy6xMAIBpQwtJ+qsO5Y2/FWrWY0REVBEGCwIADBs2St8liKyt6yMlJVvfZehNbR8/ERk23mNBREREkmGwICIiIskwWBAREZFkGCyIiIhIMgwWREREJBkGCyIiIpIMgwURERFJhsGCiIiIJMNgQURERJJhsCAiIiLJ8CO9JSKXywyiT0NRm8cO1O7x1+axA7V7/LV57IBhjb+yWmWCIAjVWAsRERG9wHgphIiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BRw9y6dQtDhgxBnz59MGTIENy+fVvfJT0TLy8v9O3bF/3790f//v0RFRUFALh48SL8/PzQp08fjBkzBmlpaeI+umirDqGhofDy8oKDgwNu3Lghrq/ssazuNl2qaPwVnQPAi3MeZGRk4KOPPkKfPn3g6+uLCRMmID09XS9jrO7xVzZ2BwcH+Pr6io/99evXxf0iIiLQt29f9O7dG1OmTEF+fr5O23Rp3Lhx8PPzg7+/P4YNG4arV68CqD1/+xUSqEYZOXKksHv3bkEQBGH37t3CyJEj9VzRs3nzzTeF69eva6xTqVSCt7e3EBsbKwiCIKxcuVKYPn26ztqqS2xsrJCcnFxmzJU9ltXdpksVjb+8c0AQXqzzICMjQzh9+rS4vHDhQmHGjBnVPkZ9jL+isQuCILRp00bIyckps09OTo7w+uuvC7du3RIEQRBmzpwpLF++XGdtuvbo0SPx599//13w9/cXBKH2/O1XhMGiBklNTRW6desmKJVKQRAEQalUCt26dRPS0tL0XJn2yntSuXTpkvDOO++Iy2lpaUKXLl101lbdHh9zZY9ldbfpY/zlLZd6kc+DgwcPCqNHj672MdaE8ZeOXRAqDhYHDhwQAgICxOW4uDjBx8dHZ23VadeuXcKAAQNq5d/+k/jtpjXIvXv3YGNjA4VCAQBQKBRo0qQJ7t27BysrKz1Xp71p06ZBEAR069YNU6dOxb1792BnZye2W1lZQa1WIzMzUydtFhYW1TPQclT2WAqCUK1t+jx3njwHGjRo8MKeB2q1Gps3b4aXl1e1j1Hf43987KVGjhwJlUoFDw8PTJw4ESYmJmXqtLOzw7179wBAJ23VYdasWTh58iQEQcCPP/7Iv33wHgvSkU2bNiE8PBw7duyAIAj46quv9F0SVbPadg7MmzcP9erVw4gRI/RdSrV7cuzHjx/Hzp07sWnTJsTHx2PlypV6rlB35s+fj+PHj+Pjjz/GokWL9F1OjcBgUYPY2triwYMHUKlUAACVSoWHDx/C1tZWz5Vpr7RmExMTDBs2DOfPn4etrS2Sk5PFbdLT0yGXy2FhYaGTNn2q7LGs7jZ9Ke8cKF3/op0HoaGhSEhIwNKlSyGXy6t9jPoc/5NjB/732Jubm+Pdd9+t8LFPTk4Wt9VFW3Xy9/dHTEwMmjZtWuv/9hksapBGjRqhXbt22LdvHwBg3759aNeuncFdBsnLy0N2djYAQBAEHDhwAO3atUOHDh1QUFCAs2fPAgC2bNmCvn37AoBO2vSpsseyutv0oaJzANDNY63P8+Df//43/vzzT6xcuRImJiZ6GaO+xl/e2LOyslBQUAAAUCqVOHTokPjYu7u74/Lly+K7FrZs2YK3335bZ226lJubq3HJJSIiAg0bNqz1f/sAIBMEQdDb0amMmzdvYvr06Xj06BEaNGiA0NBQtG7dWt9laSUpKQkTJ06ESqWCWq3Gyy+/jNmzZ6NJkyY4f/485syZg8LCQtjb22Px4sVo3LgxAOikrTqEhITg8OHDSE1NhaWlJSwsLLB///5KH8vqbqvu8a9atarCcwDQzWOtj/Pg77//Rr9+/dCqVSuYmpoCAJo1a4aVK1dW+xire/wVjX3s2LH44osvIJPJoFQq4ejoiJkzZ8LMzAwAcOTIESxevBhqtRrt2rXDwoULUa9ePZ216UpqairGjRuH/Px8yOVyNGzYEMHBwWjfvn2t+duvCIMFERERSYaXQoiIiEgyDBZEREQkGQYLIiIikgyDBREREUmGwYKIiIgkw2BBRHq3atUqzJo1CwBw584dODg4QKlUAgDGjh2LXbt26bM8ItIC325KRDXKnTt30KtXL1y5cgVGRvw6IyJDwxkLIqqS0hkEIqLKMFgQ1XJeXl5YvXo1fHx84OTkhBkzZqCwsBAxMTHw8PDAmjVr4OrqihkzZkCtVmPNmjXw9vaGs7MzJk+ejMzMTAAllyw2btyo0befnx8OHz4MoOQTOj09PdG1a1cMHDhQ/PhpAFi+fDmmTZtWbn0jR47Etm3bAAA7d+7E0KFDERoaCicnJ3h5eSEyMlLcNikpCcOHD4ejoyM++OADfPnllxX2Wyo4OBjr1q0DADx48AAODg7YtGkTACAxMRE9evSAWq1GVlYWAgMD0bNnTzg5OSEwMBD379+v8rEvXryI999/H927d4efnx9iYmIqf2CIDBSDBRFh7969+Omnn/D777/j1q1b+P777wGUfGxxVlYWjh07hnnz5uGXX37BkSNHsHHjRkRFRaFhw4bit5b269dP/L4CAIiPj0dycjLeeOMNAEDHjh2xe/dunDlzBv369cPkyZNRWFioda1xcXF46aWXcPr0aYwdOxazZs1C6RXdadOmoVOnToiJicGECROwZ8+ep/bn5OSEM2fOAADOnDmD5s2bIzY2Vlzu1q0b5HI51Go1Bg4ciGPHjuHYsWOoU6eOxje2VnbsBw8eIDAwEEFBQThz5gyCg4MxadIkpKenaz1+opqOwYKIMHz4cNja2sLCwgJBQUHYv38/AEAul2PSpEkwMTGBqakptmzZgo8//hhNmzaFiYkJJkyYgEOHDkGpVMLb2xvXrl3D3bt3AZSEld69e4tfTtW/f39YWlrCyMgIY8aMQVFREW7duqV1rXZ2dnjvvfegUCgwYMAApKSkIDU1FcnJybh8+bJYb/fu3eHl5fXU/nr06IFz585BrVYjNjYWY8eOFb+NMzY2Fj169AAAWFpaok+fPqhbty7Mzc0RFBQkBpCnHXvPnj3w8PCAp6cn5HI5XF1d0aFDB43ZFqIXBe+MIiKNr1i2s7PDw4cPAZQ8mdapU0dsS05Oxvjx48WvxwZKwkdaWhpsbGzg6emJ/fv3IyAgAPv27UNISIi43U8//YTt27fj4cOHkMlkyMnJQUZGhta1Pv7FWnXr1gVQ8m2qGRkZaNiwobiudFyPfwNleVq0aIG6devi6tWrOHfuHMaPH4/t27fjn3/+QWxsLEaOHAkAyM/Px9dff42oqChkZWUBKPmGy9Kvqa7s2MnJyTh48CCOHTsmtiuVSjg7O2s9fqKajsGCiDSefJOTk8VvIZXJZBrbNW3aFAsWLEC3bt3K7adfv35YsWIFnJycUFhYKD5xnj17Fj/++CN+/vlnvPrqq5DL5XBycoKUb0qztrZGVlYW8vPzxSf4p4WKUk5OTjh06BCKi4thY2MDJycn7N69G1lZWeJXfq9btw63bt3C1q1bYW1tjatXr8Lf3x+CIDz12La2tujfv79G0CJ6UfFSCBEhLCwM9+/fR2ZmJlatWgUfH59ytxs6dCiWLl0qXu5IT0/HkSNHxHZPT08kJydj2bJl8PHxEWc2cnNzoVAoYGVlBaVSiRUrViAnJ0fSMdjb26NDhw5Yvnw5ioqKcOHCBY0Zgsr06NEDGzduRPfu3QEAzs7O2LhxI7p16waFQiGOoU6dOmjQoAEyMzOxYsWKKh/bz88Px44dQ1RUFFQqlXhz7OM3fxK9KBgsiAj9+vXDmDFj4O3tjRYtWiAoKKjc7UaNGgUvLy+MGTMGjo6OeO+99xAXFye2m5iYoHfv3vjjjz/Qr18/cb2bmxvc3d3Rp08feHl5oU6dOhqXX6TyzTff4OLFi3B2dsbSpUvh4+Mj3uNRGScnJ+Tm5sLJyQkA0K1bNxQUFIhBAwBGjx6NwsJC9OzZE0OGDIG7u3uVj21ra4vvv/8eq1evhouLCzw9PfHTTz9BrVZLOHqimoEfkEVUy3l5eSEkJASvv/66vkuR3JQpU9C6dWtMmjSpVh2bSJ84Y0FEL4y4uDgkJiZCrVbjxIkTOHr0KLy9vV/4YxPVJLx5k4heGKmpqZg4cSIyMzPRtGlTzJ07F6+99hrCw8MxZ86cMtvb2dmJb63V1bGJahteCiEiIiLJ8FIIERERSYbBgoiIiCTDYEFERESSYbAgIiIiyTBYEBERkWQYLIiIiEgy/w+RKtRtHjrkcAAAAABJRU5ErkJggg==\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "visa.groupby('case_status')['prevailing_wage'].describe().T" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "GAA2oCQKi2ZO", "outputId": "88d87f35-e113-49e0-84d7-dbbe3ba5eb01" }, "id": "GAA2oCQKi2ZO", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "case_status Certified Denied\n", "count 17018.000000 8462.000000\n", "mean 77293.619243 68748.681580\n", "std 52042.715576 53890.166031\n", "min 2.136700 2.956100\n", "25% 38375.330000 23497.295000\n", "50% 72486.270000 65431.460000\n", "75% 108879.107500 105097.640000\n", "max 318446.050000 319210.270000" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
case_statusCertifiedDenied
count17018.0000008462.000000
mean77293.61924368748.681580
std52042.71557653890.166031
min2.1367002.956100
25%38375.33000023497.295000
50%72486.27000065431.460000
75%108879.107500105097.640000
max318446.050000319210.270000
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 51 } ] }, { "cell_type": "markdown", "source": [ "The mean prevailing wage for certified applications is higher than that for denied applications by about \\$8500. The middle 50% for certified is concentrated higher too. Interestingly, the whiskers extend further for denied applications. Taken with its wider IQR, this shows that denied applications have greater variance in prevailing wage.\n", "\n", "This is also true across various regions." ], "metadata": { "id": "jgq3D4hNiz4B" }, "id": "jgq3D4hNiz4B" }, { "cell_type": "code", "source": [ "plt.figure(figsize=(15,15))\n", "for idx,region in enumerate(visa['region_of_employment'].unique()):\n", " plt.subplot(3,2,idx+1)\n", " plt.title('Prevailing wage in '+region)\n", " sns.boxplot(data=visa.loc[visa['region_of_employment']==region],\n", " x='prevailing_wage',\n", " y='case_status',\n", " order=['Certified','Denied'])\n", "plt.tight_layout()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 975 }, "id": "VHiACbj9j9_x", "outputId": "30e838dd-2f8b-4ab5-a339-d86fb3d5569e" }, "id": "VHiACbj9j9_x", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Denied applications in the northeast seem to have some of the lowest prevailing wages. The prevailing wages for the island region, on the other hand, appear generally higher than the rest, with certified cases having the greatest middle 50% among the regions. This is likely due to the higher cost of living on islands, as most goods must be imported." ], "metadata": { "id": "oqwzvPPsll7k" }, "id": "oqwzvPPsll7k" }, { "cell_type": "markdown", "id": "alleged-spirituality", "metadata": { "id": "alleged-spirituality" }, "source": [ "## Data Preprocessing" ] }, { "cell_type": "markdown", "source": [ "### Feature Engineering" ], "metadata": { "id": "D42w4a2XME_w" }, "id": "D42w4a2XME_w" }, { "cell_type": "code", "execution_count": null, "id": "increasing-louisiana", "metadata": { "id": "increasing-louisiana" }, "outputs": [], "source": [ "visa['years_in_business']=2016-visa['yr_of_estab']\n", "visa.drop('yr_of_estab',axis=1,inplace=True)" ] }, { "cell_type": "markdown", "source": [ "Rather than recording the year businesses were established, we will instead consider the number of years in business." ], "metadata": { "id": "5jZn2oPQmVeI" }, "id": "5jZn2oPQmVeI" }, { "cell_type": "markdown", "source": [ "### Outlier detection and treatment" ], "metadata": { "id": "p0zlVioEL_Ur" }, "id": "p0zlVioEL_Ur" }, { "cell_type": "code", "source": [ "visa.loc[visa['no_of_employees']<1].shape" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OMci272TZbRL", "outputId": "ceb4919f-3308-4146-c9b7-8b755d2dd3e5" }, "id": "OMci272TZbRL", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(33, 12)" ] }, "metadata": {}, "execution_count": 54 } ] }, { "cell_type": "markdown", "source": [ "There are 33 records where the application lists a negative number of employees. These clearly constitute an error, and with no clear method to impute these values, our best recourse is to drop the rows." ], "metadata": { "id": "T1F5fZnSmg7T" }, "id": "T1F5fZnSmg7T" }, { "cell_type": "code", "source": [ "idx=visa.loc[visa['no_of_employees']<1].index\n", "visa.drop(idx,axis=0,inplace=True)" ], "metadata": { "id": "wYh2e25MZ8gK" }, "id": "wYh2e25MZ8gK", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "### Data Prep" ], "metadata": { "id": "98y_YHTcsxfY" }, "id": "98y_YHTcsxfY" }, { "cell_type": "code", "source": [ "visa.drop('case_id',axis=1,inplace=True)" ], "metadata": { "id": "dMdzuF2Ss6Bx" }, "id": "dMdzuF2Ss6Bx", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We drop ```case_id```, as the unique identifiers will not lend any predictive power to our model." ], "metadata": { "id": "-dH6OfgEm8lX" }, "id": "-dH6OfgEm8lX" }, { "cell_type": "code", "source": [ "# convert object dtype to category\n", "for col in visa.select_dtypes('object').columns:\n", " visa[col]=pd.Categorical(visa[col])" ], "metadata": { "id": "QwEcXShij5xt" }, "id": "QwEcXShij5xt", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "After making all object variables into category type, we encode them numerically." ], "metadata": { "id": "bIOODwvqnEqM" }, "id": "bIOODwvqnEqM" }, { "cell_type": "code", "source": [ "# define categorical ordering\n", "order_struct={\n", " 'education_of_employee':{'High School':1,\n", " \"Bachelor's\":2,\n", " \"Master's\":3,\n", " 'Doctorate':4},\n", " 'has_job_experience':{'Y':1,'N':0},\n", " 'requires_job_training':{'Y':1,'N':0},\n", " 'full_time_position':{'Y':1,'N':0},\n", " 'case_status':{'Certified':1,'Denied':0}\n", " }\n", "no_order={'continent','region_of_employment','unit_of_wage'}" ], "metadata": { "id": "mNz4WKZokEJQ" }, "id": "mNz4WKZokEJQ", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# convert data to numeric\n", "visa.replace(order_struct,inplace=True)\n", "visa=pd.get_dummies(visa,columns=no_order)" ], "metadata": { "id": "qpng9JadkEG9" }, "id": "qpng9JadkEG9", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We use one hot encoding for the categorical variables with no inherent order." ], "metadata": { "id": "BMrp-4oNnRWN" }, "id": "BMrp-4oNnRWN" }, { "cell_type": "code", "source": [ "X=visa.drop('case_status',axis=1)\n", "y=visa['case_status']" ], "metadata": { "id": "27lzEJkwl3ks" }, "id": "27lzEJkwl3ks", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# split\n", "X_train,X_test,y_train,y_test=train_test_split(X,y,\n", " test_size=0.3,\n", " stratify=y,\n", " random_state=57)" ], "metadata": { "id": "SfnEoAyol8_C" }, "id": "SfnEoAyol8_C", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now that we've split our data, it is ready for modeling." ], "metadata": { "id": "TvDh5DZJzqbx" }, "id": "TvDh5DZJzqbx" }, { "cell_type": "markdown", "id": "difficult-union", "metadata": { "id": "difficult-union" }, "source": [ "## Second EDA\n" ] }, { "cell_type": "markdown", "source": [ "We only removed 33 records out of 25480. That's only 0.1% of the data. Accordingly, there is not enough change in our data to produce appreciably different visualizations. We only include plots for features that were directly altered/engineered. " ], "metadata": { "id": "7Vcxvad0iiBJ" }, "id": "7Vcxvad0iiBJ" }, { "cell_type": "code", "execution_count": null, "id": "interested-talent", "metadata": { "id": "interested-talent", "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "outputId": "403235c3-abaa-4675-f829-a33baf55be9d" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ], "source": [ "plott('years_in_business')" ] }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Boxplot of Years Since Founding',fontsize=14)\n", "sns.boxplot(data=visa,x='years_in_business');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "CUQLZS-aCsAq", "outputId": "819b4097-3aad-4102-a915-c88aeb14ce0f" }, "id": "CUQLZS-aCsAq", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAcwAAAFTCAYAAAC06zwQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deVxVdf7H8TergGTmgvuSNqJDIihqueSITppCNE6jjUWaGmbiMuaC2lDmkrhkpuSS1tg0mpqluYxjP9vU1DAzK00SE1xKcRdBWe75/cFwp5ssXxW5gq/n4+Hj4T3L9/s533sub77nXO51sSzLEgAAKJSrswsAAKA0IDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJkpEZGSkXnrpJaf0nZGRoaFDh6pFixby9/fX0aNHnVJHSQkNDdXixYudXcYta/HixQoNDbU/njNnjsLCwpxYEUoLArMMi4mJkb+/v/1f69atNXDgQCUlJTm7tCK9//77Cg4OLpa2Vq1apYSEBC1dulRbt25VjRo1HNZv27ZNAQEB+vrrrx2WX758WV26dNELL7xQLHUUhyNHjmjkyJF64IEHdO+996pdu3aKiorSvn377Nu899576t27txOrlHbu3Olw7uX9mzx5slPryk+/fv30z3/+09lloBRwd3YBuLnatGmjadOmSZJOnjypadOmKTo6Wv/+97+dXFnJSU5OVsOGDeXv75/v+rZt2+qxxx5TTEyMVq9eLW9vb0nSzJkzZVmWxowZU+w1ZWdny83NTS4uLsb7ZGVlqV+/fqpbt65effVVVatWTSdPntS2bdt0/vx5+3aVKlUq9nqv1/r163XnnXfaH+eN7a2kfPnyKl++vLPLQCnADLOM8/T0VNWqVVW1alUFBASob9++OnTokC5fvmzf5sCBA+rbt68CAwPVqlUrxcTE6OLFi5KkL7/8UgEBAdq5c6d9+3fffVfNmzfXkSNHJOVebo2NjdWkSZPUsmVLtWzZUnFxcbLZbAXWdf78eY0ZM0YtW7ZUYGCg+vbtqx9//FFS7uxk7NixSk9Pt89M5syZU2BbmzZtUnh4uO6991516NBB8+bNU94HWEVGRurtt99WQkKC/P39FRkZmW8bo0aNkiTNmDHDftzLli1TXFycvL299cYbb6hz584KDAxUeHi41qxZ47D/jBkz1KVLFwUGBio0NFTTpk3TlStX7OvzLvu9//776ty5s5o2bar09HQlJCSoZ8+eCg4OVosWLfToo48qMTEx3xoPHjyolJQUxcbGqnnz5qpVq5aCg4MVHR2t+++/377dby/J+vv7a/ny5Ro6dKiCgoLUqVOnq+o/ceKEnnvuObVu3VrNmjVTRESEduzYYV//8ccfq0ePHmratKlCQ0M1a9YsZWZmFvic5KlUqZL9/Ktatap8fX0lFf78S/lfYcibtZ45c8Zhm+3btyssLExBQUGKjIy0n5d53njjDbVt21bBwcEaPXq00tPTHdb/9pJsTEyMBg4cqCVLlqh9+/Zq2bKlxo4dq4yMDPs26enpGj16tIKDg9WmTRstWLBAAwcOVExMTJFjgtKLwLyNpKWlacOGDWrUqJG8vLwk5b7w+/fvLx8fH61cuVJz587V119/rXHjxkmSWrVqpf79+2v06NE6f/68kpKSNHXqVP39739XnTp17G2vXbtWlmXp3Xff1YQJE7RixQotWbKkwFpiYmL0zTff6PXXX9fKlSvl5eWlAQMG6PLlywoODta4cePk7e2trVu3auvWrerXr1++7Xz33XcaNmyY/vjHP2rt2rV67rnntHDhQr3zzjuScn8Y9ujRQ8HBwdq6dWuBwevl5aXp06dr+fLl2rx5s8aOHav+/fsrODhYr776qt577z3FxsZq/fr1ioqK0gsvvKBPP/3Uvr+3t7emTJmiDRs26IUXXtCGDRs0b948hz6OHj2qdevWafbs2VqzZo3KlSunZ599Vi1atNCaNWu0YsUK9enTR25ubvnWWKlSJbm6uuo///mPsrOzCxzb/MTHx9uDslu3bho/fryOHz8uKfcciIyM1LFjxxQfH6+1a9dq8ODB9n23bNmikSNH6vHHH9f69es1ZcoUbdy4UbNmzbqmGn6tsOf/WmRmZmrBggWaMmWK3n33XV28eFEvvviiff2GDRs0e/ZsDRkyRO+//77uvvtuvfXWW0W2u2vXLv3444/6xz/+oVmzZumjjz7S22+/bV8/depUJSQkaO7cuVqyZIl++OEH7dq165pqRylkocwaM2aM1aRJEysoKMgKCgqyGjVqZHXo0ME6cOCAfZvly5dbzZs3ty5evGhftmPHDqtRo0bW4cOHLcuyrMzMTKtHjx7W4MGDrUceecQaNmyYQz9PPPGE9eCDD1o2m82+LD4+3mrfvr3DNhMmTLAsy7J++uknq1GjRtaXX35pX3/hwgWrefPm1ooVKyzLsqxVq1ZZQUFBRR7jiBEjrMjISIdlr732mkPfEyZMsJ544oki27Isy3r11Vetxo0bWxEREVZmZqZ16dIlq2nTplZCQoLDdpMmTbIGDBhQYDtLly61Onfu7FDT73//eys1NdW+7OzZs1ajRo2snTt3GtVmWZb1zjvvWM2aNbOCgoKsxx9/3Jo1a5aVmJjosE3Hjh2tRYsW2R83atTImjFjhv1xVlaWFRgYaK1evdqyrNxzICgoyDp9+nS+ffbu3duaO3euw7KPPvrICgoKcnjOfy3vHMo79/L+HTt27Lqf/7w28+pctWqV1ahRIyspKcm+zZo1a6yAgAB7Xb169bLGjx/v0E6fPn2sjh072h+/9tprVvfu3e2Px4wZYz3wwANWdna2fdn48eOtPn36WJZlWWlpaVZAQIC1bt06+/pLly5ZISEh1pgxY/IdD5QN3MMs40JCQjRx4kRJuZfBli1bpn79+mnlypWqUaOGkpKS5O/vb79UJknBwcFydXXVwYMHVa9ePXl4eGjmzJkKCwtTpUqV8p05NmvWzOF+XHBwsGbPnq20tDSHtiUpKSlJrq6uCgoKsi+744471KhRIx08ePCaju/QoUPq0KGDw7IWLVpo7ty5+fZdlMGDB+v111/X008/LQ8PD+3fv19XrlzRgAEDHI4vKytLtWrVsj/euHGjlixZopSUFKWnpysnJ+eqS9LVqlVTlSpV7I8rVqyoHj16qH///rr//vt1//33q0uXLqpZs2aB9T3++OOKiIjQzp07tXfvXm3evFlvvPGGJk+erEceeaTA/X59/9bd3V2VKlWyX9rct2+f/P39C7z3+f3332vv3r1atGiRfZnNZtPly5eVmpoqPz+/AvtdsmSJwz1MPz8/ffbZZ8X2/Ht6eqpBgwYO7WdlZen8+fOqWLGikpKS9OijjzrsExQUpJSUlELbveeeexxm+n5+fvrmm28k5b7xKisrS4GBgfb1Pj4++t3vfndNtaP0ITDLOG9vb9WrV8/+OCAgQCEhIVq+fLmGDx9e6L6/Dog9e/bIZrPp4sWLOnPmjCpUqHBT6r2WN8HcDO7uuS+JvB+W1n/vhc6bN++qIMvbds+ePRoxYoQGDx6s9u3bq0KFCvr4448VFxfnsL2Pj89V/b388svq06ePPv/8c3388ceaNWuW4uPj1b59+wJr9PX1VadOndSpUycNHz5c/fv312uvvVZoYObVmsfFxaXQe8y/ZrPZFB0dra5du161rqg3GNWuXfua3oSU9/y7urraxz5Pfpeh8zuuvJpvRH7t/rYe3H64h3mbcXFxkYuLi/1eUcOGDZWYmKi0tDT7Nl9//bVsNpsaNmwoKfc36okTJyo2NlZt2rTRqFGjrvrh9c033zj8QNmzZ4/8/PzyneE1bNhQNptNe/bssS9LS0tTYmKivU8PDw/l5OQUeTwNGjTQ7t27HZZ99dVXql69+jXPLvPTsGFDeXp66vjx46pXr57Dv7wZ5u7du1WtWjUNHjxYgYGBql+/vv3+oInGjRsrKipK//znP9WqVSutXr3aeF8XFxc1aNBAly5duuZjy/P73/9eBw4csM8481t/6NChq46/Xr16VwWLCZPn/6677lJGRobDebl///7r6itvZpjnt4+vVZ06deTh4aFvv/3WviwjI8PhTUsomwjMMi4zM1OpqalKTU1VUlKSJk6cqPT0dHXs2FGSFB4eLi8vL40ZM0YHDhxQQkKCYmNj9eCDD6pevXrKycnR6NGj1bJlSz322GOaNGmSfv75Z82dO9ehn5MnT2ry5Mk6dOiQNm7cqMWLF6tv37751lS/fn116tRJsbGx2rVrlw4cOKCRI0fK19dX4eHhkqRatWrpypUr2rZtm86cOePwDsVf69evnxISEjRnzhz99NNP+vDDD/Xmm29qwIABxTJ+vr6+6tevn6ZNm6b33ntPycnJ2r9/v5YtW6bly5fbj+fEiRP68MMPdeTIES1dulTr1q0rsu0jR45oxowZ2r17t44dO6YdO3bowIED9tD4rf3792vQoEHauHGjDh48qOTkZK1cuVKrVq3SH//4x+s+xrCwMFWuXFnPPvusdu3apSNHjmjz5s32d8kOHjzY/malxMREJSUlaePGjfY/V7pWJs9/s2bN5OPjo5kzZyo5OVn/+c9/tHTp0mvu68knn9QHH3ygFStW6PDhw1qwYMENB2b58uXVo0cPzZgxQ9u3b9fBgwf1/PPPy2azOf0KCW4uLsmWcV988YXatWsnKfeF3qBBA82ePVutW7eWlHvJdvHixZoyZYr+8pe/qFy5curUqZPGjx8vSZo/f75SUlK0du1aSbm/+cfFxSkqKkrt2rVTSEiIpNzgtdls6tmzp1xcXPToo48WGJhS7qXIKVOmaNCgQbpy5YqaN2+uRYsW2d+927x5cz322GMaMWKEzp07p+joaA0ZMuSqdgICAjR79mzNmTNHCxYsUOXKlRUVFaUnnnii2MZw+PDhqlKlit588029+OKL8vX1VZMmTeyhHBoaqv79+2vKlCm6cuWK2rZtq6FDh2rChAmFtuvt7a3Dhw9r2LBhOnv2rKpUqaLw8HA9/fTT+W5frVo11alTR/Hx8Tp27Jgsy1KNGjXUr18/RUVFXffx+fj46J133tHUqVP1zDPPKCsrS3fffbfGjh0rSWrfvr0WLFig119/XW+++abc3NxUv3599ejR47r7LOr5r1ixoqZPn67p06dr1apVatmypYYNG6bRo0dfUz/dunXTkSNHNGvWLF2+fFmhoaF66qmn9MEHH1x37ZI0ZswYZWRkaNCgQfLx8VHfvn116tQpeXp63lC7uLW5WFyYxw2KjIzU7373O8XGxjq7FMApMjMz1bFjR/Xv37/AP4FC6ccMEwCu0b59+5SUlKTAwEBdunRJb7zxhi5duqRu3bo5uzTcRAQmAFyHt956Sz/99JPc3d3VuHFjvfPOO6pevbqzy8JNxCVZAAAM8C5ZAAAMEJgAABggMAEAMFDkm37Onr0km614bnNWruyr06fTit4QxYpxdw7G3TkYd+coK+Pu6uqiu+7K//tRiwxMm80qtsDMaw8lj3F3DsbdORh35yjr484lWQAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAAD7s4u4EYsXfq2jhxJvq59z58/J0m6886KxVlSsapTp556937S2WUAAFTKA/PIkWQd+PGg3LyuPfRyLucGZuqF7OIuq1jk1QcAuDWU6sCUJDevivKp1+ma90tP3ixJ17VvScirDwBwa+AeJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABhwL6mOtm37XBUqeKtp05Yl1SVuY9u2fS5Jatv2ASdXAqCsKLHA3Lr1M3l4uBGYKBFbt34micAEUHy4JAsAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAF3ZxcA3EwDB/ZVVlams8u45Xl6llNm5hVnl1Fi7ryzos6fP6eqVaspNfVEAVu5SLKKtd9atWrr2LGjatq0mf72tzGKjR2no0cPS5KqVPHTqVMn5ebmrpyc7ALb8PHxVXp6mu67r41OnTqjgwd/kLe3tzIyMuzH4+rqKpvNZm+rZ8+/qmvX8EJrS0k5rLi4iWrevJW2bftMQUEh2rNnlwICAvX993vVsOE9Sko6KG9vX2VkpF21/9q1a43H4fvv92rmzKmSpAce6Ki+fZ/WunVr9P77y6+qddSo4Tp9+qT8/Kpr6tRXJEnTp7+s/fu/lYeHh7KyslS7dl299NJU4/6vFzNMlGmEpZnbKSwl6fz5c5JUSFhKxR2WknTs2FFJ0rfffiNJ9rCUpFOnTkpSoWEpSenpuWG1Y8cXOnjwB0lSRkaGpP8dj81mc2hrxYplRda2cGG8MjIytG3bZ5KkPXt2ScoNN0lKSjr4376uDstrNW/eHPv/P//8E0nS++8vz7fW06dzx+XkyV/sy/bv/1aSlJWVJUk6ejTlhmsyQWCizDpwYL+zSwAK9MwzfUu0v40bC54BpqQc1vHjx26o/fDwwmeweb7/fq/S0y85LHvxxfEOj/NqHTVquMPymJgRmj795XzbjY2NMS31upXYJdnz58/pwoXzioubWGxtpqQky5bjVmzt3Ups2ZeVkpJcLOPl4eGmrKycYqiq9EhJSXZ2CUChMjNL9urHihXLCrwsu3BhfInV8evZZZ6UlJ8cHufVmje7zHPy5C8OM81fK4lZJjNMALjN3ejs8lr8dnZZmpTYDPPOOyuqSpXKGjFiXLG1GRc3UQePnCq29m4lru5eqlunisaM+fsNt1W16h1KTb1YDFWVHnFxE7kkCxiqWbNWiYWmj0/5UhuazDABwAk8PT1LtL+ePf9a4LqoqMElVsegQUOuWla37t0Oj/NqrVzZz2G5n191NWnSNN92a9euW0wVFozARJnl79/E2SUABZo//x8l2l9hf1ZSt2591axZ64baN/2zkoCAQPn4lHdY9uKLkx0e59U6ffqrDsunTn1Fo0aNzbdd/qwEuEEeHiX7W3xp5elZztkllKg776woSapatVohW7kUe7+1atWWJDVt2kySVLt2ffu6KlVyZ1NuboXfKfPx8ZUk3XdfG91zT2NJkre3t6T/HY+rq6tDW4XNLvNERQ2Wt7e32rbtIEkKCgqRlBtwktSw4T3/7cu3yLaK8utZ5gMPdJQk9ejRK99a82aZfn7V7cvyZpkeHh6SSmZ2KfHBBSjjFiz4h9P6vh3vHd8KStO4v/TSFGeXYFe3bn3Fxy+WJPXvP/Cm9hUQEKg331zqsCwsLEJhYRFXbfvbWaakAmeZNxszTAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADLiXVEft2nVQhQreJdUdbnPt2nVwdgkAypgSC8y2bR9Q1ap3KDX1Ykl1idtY27YPOLsEAGUMl2QBADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYIDABADAgLuzC7hROZfPKT1583XtJ+m69i0JufVVcXYZAID/KtWBWadOveve9/z53EO/886KxVVOMatyQ8cHAChepTowe/d+0tklAABuE9zDBADAAIEJAIABAhMAAAMEJgAABghMAAAMEJgAABggMAEAMEBgAgBggMAEAMAAgQkAgAECEwAAAwQmAAAGCEwAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADBCYAAAYIDABADBAYAIAYMC9qA1cXV2KtcPibg9mGHfnYNydg3F3jrIw7oUdg4tlWVYJ1gIAQKnEJVkAAAwQmAAAGCAwAQAwQGACAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADJRYYP7000/q1auXunTpol69eunw4cMl1fVtJTQ0VF27dlVERIQiIiK0ZcsWSdKePXv08MMPq0uXLurXr59Onz7t5EpLt7i4OIWGhsrf31+JiYn25YWd57wGblxB417QeS9x7heHs2fP6umnn1aXLl0UHh6u6OhonTlzRlLh41vmxt4qIZGRkdbq1asty7Ks1atXW5GRkSXV9W2lY8eO1oEDBxyW5eTkWJ07d7YSEhIsy7Ks+Ph4KyYmxhnllRkJCQnW8ePHrxrvws5zXgM3rqBxz++8tyzO/eJy9uxZa8eOHfbHU6dOtcaOHVvo+JbFsS+RGebp06e1b98+hYWFSZLCwsK0b98++28ouLm+++47lStXTiEhIZKkxx57TBs3bnRyVaVbSEiIatSo4bCssPOc10DxyG/cC8O5XzwqVqyo1q1b2x8HBQXp+PHjhY5vWRz7Ir+tpDj8/PPPqlatmtzc3CRJbm5u8vPz088//6xKlSqVRAm3lZEjR8qyLLVo0UIjRozQzz//rJo1a9rXV6pUSTabTefOnVPFihWdWGnZUth5blkWr4Gb7LfnfYUKFTj3bwKbzaZly5YpNDS00PEti2PPm37KmH/961/68MMPtWrVKlmWpZdeesnZJQE3Hed9yZk4caJ8fHz0xBNPOLuUElcigVmjRg2dOHFCOTk5kqScnBydPHnymi6twEzemHp6eqp3797avXu3atSooePHj9u3OXPmjFxdXUvtb3m3qsLOc14DN1d+533ecs794hMXF6fk5GS9+uqrcnV1LXR8y+LYl0hgVq5cWU2aNNG6deskSevWrVOTJk24FFXM0tPTdfHiRUmSZVnasGGDmjRponvvvVeXL1/Wrl27JEnvvvuuunbt6sxSy6TCznNeAzdPQee9JM79YvTKK6/ou+++U3x8vDw9PSUVPr5lcexL7Aukk5KSFBMTowsXLqhChQqKi4tTgwYNSqLr28aRI0c0ZMgQ5eTkyGazqWHDhnr++efl5+en3bt364UXXtCVK1dUq1YtTZ8+XVWqVHF2yaXWpEmTtGnTJp06dUp33XWXKlasqPXr1xd6nvMauHH5jfv8+fMLPO8lce4Xgx9//FFhYWGqX7++vLy8JEm1a9dWfHx8oeNb1sa+xAITAIDSjDf9AABggMAEAMAAgQkAgAECEwAAAwQmAAAGCEzgJpk/f77Gjx9/Q23MmTNHI0eOLKaK/qd79+7auXNnsbcLlGUl8lmywO3omWeecXYJBVq/fr2zSwBKHWaYQBGys7OdXQKAWwCBiVJt0aJFGjJkiMOySZMmadKkSbp48aLGjRundu3aqX379po1a5b9s1xTUlL05JNPqnXr1mrdurWee+45Xbhwwd5GaGioFi5cqPDwcAUFBSk7O1sLFy5U+/btFRwcrC5dumj79u2F1vbry6lHjx6Vv7+/PvjgA/3hD39Q69atNW/ePKNjzMzM1PDhwxUcHKw//elP+uGHH+zr/P39lZycbH8cExOjWbNmScr97M6BAwcqJCRErVq1Uu/evWWz2ezH98UXX9jrHDZsmEaPHq3g4GB1795d3377rb3NEydOaMiQIbrvvvsUGhqqt99+275u79696tGjh5o3b642bdro5ZdfliRduXJFI0eOVOvWrRUSEqI///nPOnXqlNHxArcqAhOl2sMPP6wtW7bYwy47O1vr16/XI488opiYGLm7u2vTpk1avXq1tm3bppUrV0rK/czRgQMHasuWLfr3v/+tX375RXPmzHFoe/369Vq4cKF27dqllJQU/etf/9J7772nr7/+WosXL1atWrWuud6vvvpKGzdu1JIlSxQfH6+kpKQi99m8ebO6du2qL7/8UmFhYXr22WeVlZVV5H5vvfWWqlWrpu3bt2vbtm0aMWKEXFxc8t32448/Vvfu3bVr1y6FhoZq4sSJknK/ymnQoEHy9/fX559/riVLlmjJkiXasmWLJI23DPEAAAUiSURBVGny5Ml68skntXv3bn300Ud66KGHJEkffPCB0tLS9Omnn2rnzp2aMGGC/SPVgNKKwESp5ufnp5CQEPsX027ZskV33XWXqlevrs8++0zjxo2Tj4+PKleurL59+9rv3dWrV09t27aVp6enKlWqpKeeekoJCQkObUdGRqpGjRry8vKSm5ubMjMzlZSUpKysLNWuXVt169a95nqjo6Pl5eWlxo0bq3Hjxg6zxYIEBASoa9eu8vDw0FNPPaXMzEx98803Re7n7u6u1NRUHT9+XB4eHgoJCSkwMFu0aKEOHTrIzc1NERER9rq+/fZbnTlzRtHR0fL09FSdOnXUs2dPbdiwwd5HSkqKzpw5o/LlyysoKMi+/Ny5c0pOTpabm5vuvfde+fr6mg4TcEviTT8o9f70pz9p2bJl6tmzpz788ENFRETo+PHjys7OVrt27ezb2Ww2+9dAnTp1SpMnT9auXbt06dIlWZalChUqOLT766/eqlevnsaNG6c5c+bo4MGDateunWJiYlStWrVrqvXXHzzt7e2t9PT0IvepXr26/f+urq6qVq2aTp48WeR+/fv319y5c9WvXz9JUq9evRQVFVVkXV5eXrpy5Yqys7N17NgxnTx5UiEhIfb1OTk59seTJ0/Wa6+9poceeki1a9dWdHS0OnbsqIiICP3yyy8aMWKELly4oIcfflh/+9vf5OHhUWTdwK2KwESp17lzZ7344otKTEzUp59+qlGjRsnd3V2enp7asWOH3N2vPs1feeUVubi4aO3atapYsaL+7//+76ovHf7tbCw8PFzh4eFKS0tTbGysZsyYoenTp9/UY5OkX375xf5/m82mEydO2L+Jw9vbWxkZGfb1qamp9hD39fVVTEyMYmJilJiYqD59+qhp06a6//77jfuuUaOGateurU2bNuW7vn79+nrllVdks9m0adMmDR06VDt37pSPj4+io6MVHR2to0ePKioqSnfffbf+8pe/XM8QALcELsmi1CtXrpy6dOmi5557Tk2bNlXNmjXl5+entm3baurUqUpLS5PNZlNKSoq+/PJLSdKlS5fk4+OjO+64QydOnNCiRYsK7ePQoUPavn27MjMz5enpqXLlysnVtWRePt9//702bdqk7OxsLVmyRJ6enmrWrJkkqXHjxlq3bp1ycnL0+eefO1xW/uSTT5ScnCzLsnTHHXfIzc2twEuyBQkMDFT58uW1cOFCXb58WTk5OUpMTNTevXslSWvWrLF/MXDeDN3V1VU7duzQgQMHlJOTI19fX7m7u5fYeAE3C2cwyoRHHnlEiYmJioiIsC+bNm2asrKy1K1bN7Vs2VJDhw5VamqqpNx7ifv27VNISIiioqL04IMPFtp+ZmamZs6cqdatW6tdu3Y6c+aMRowYcVOPKU+nTp20YcMGtWzZUmvWrNGcOXPslzbHjx+vTz75RCEhIVq7dq06d+5s3y85OVlPPfWUgoOD1atXL/31r3/Vfffdd019u7m5af78+frhhx/UqVMn3XfffXr++eeVlpYmKfeecffu3RUcHKzJkydr1qxZ8vLy0qlTpzR06FC1aNFC3bp1U6tWrRyeG6A04vswUSYcP35cDz30kLZt28abSwDcFMwwUerZbDa99dZb6tatG2EJ4KbhTT8o1dLT09W2bVvVrFmzyPuQN8OAAQP01VdfXbV84MCBxh+NVxxtALj5uCQLAIABLskCAGCAwAQAwACBCQCAAQITAAADBCYAAAYITAAADPw/7kHvrVMd9AkAAAAASUVORK5CYII=\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "* We find that most businesses in our records are less than 50 years old.\n", "* There are a steady supply of businesses greater than 75 years old, about the same amount for each year. It only starts to trail off after around 175 years." ], "metadata": { "id": "_ceJO-gBCl1z" }, "id": "_ceJO-gBCl1z" }, { "cell_type": "code", "source": [ "plt.figure(figsize=(16,5))\n", "plt.suptitle('Feature: no_of_employees',fontsize=18)\n", "\n", "plt.subplot(1,2,1)\n", "plt.title('Boxplot (with fliers)',fontsize=14)\n", "sns.boxplot(data=visa,x='no_of_employees')\n", "\n", "plt.subplot(1,2,2)\n", "plt.title('Boxplot (fliers removed)',fontsize=14)\n", "sns.boxplot(data=visa,x='no_of_employees',showfliers=False)\n", "\n", "plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 375 }, "id": "q9XjYFQVigdt", "outputId": "fc605a9c-0c5e-4feb-b32f-69cec4a74467" }, "id": "q9XjYFQVigdt", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "With so few records removed, there is little difference in the plots for number of employees. Put another way, the comments on this feature from the first EDA are still applicable. The middle 50% of observations is between about 1000 and 3500. There are also tiny companies, with not more than several employees, and massive companies, with over half a million employees, in our data set." ], "metadata": { "id": "QwZR8EcYDvtv" }, "id": "QwZR8EcYDvtv" }, { "cell_type": "markdown", "id": "domestic-iceland", "metadata": { "id": "domestic-iceland" }, "source": [ "## Building bagging and boosting models" ] }, { "cell_type": "markdown", "source": [ "When scoring our models, we must find a good balance between false positives and false negatives.\n", "* A false positive is a certified visa that should have been denied. We want to reduce false positives because these workers may be ineffective and not benefit the company.\n", "* On the other hand, a false negative constitutes a denied visa that should have been certified. This deprives the workforce of an asset, someone who could benefit US companies.\n", "\n", "With this in mind, we will score our models on F1, as it balances recall and precision, reducing both false positives and false negatives." ], "metadata": { "id": "dfMMD01bMkKI" }, "id": "dfMMD01bMkKI" }, { "cell_type": "markdown", "source": [ "### Functions" ], "metadata": { "id": "9V9RR10TmfdZ" }, "id": "9V9RR10TmfdZ" }, { "cell_type": "code", "source": [ "def scores(model):\n", " '''Print training and testing\n", " metrics for the specified model.'''\n", "\n", " score_idx=['Accuracy','Precision','Recall','F1']\n", " X_train_pred=model.predict(X_train)\n", " score_list_train=[metrics.accuracy_score(y_train,X_train_pred),\n", " metrics.precision_score(y_train,X_train_pred),\n", " metrics.recall_score(y_train,X_train_pred),\n", " metrics.f1_score(y_train,X_train_pred)]\n", "\n", " X_test_pred=model.predict(X_test)\n", " score_list_test=[metrics.accuracy_score(y_test,X_test_pred),\n", " metrics.precision_score(y_test,X_test_pred),\n", " metrics.recall_score(y_test,X_test_pred),\n", " metrics.f1_score(y_test,X_test_pred)]\n", "\n", " df=pd.DataFrame(index=score_idx)\n", " df['Train']=score_list_train\n", " df['Test']=score_list_test\n", " return df" ], "metadata": { "id": "H92Kf99dmey8" }, "id": "H92Kf99dmey8", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "This function gets the scores of a model on training data and testing data." ], "metadata": { "id": "5E894mMhyiGz" }, "id": "5E894mMhyiGz" }, { "cell_type": "code", "source": [ "model_comp_table=pd.DataFrame(columns=['Train Acc','Test Acc','Train F1','Test F1','Status'])\n", "y_train_test_lens=[y_train.shape[0],y_test.shape[0]]\n", "\n", "def tabulate(model,name):\n", " '''Compute train/test accuracy and\n", " F1 for a given model. Add to table.'''\n", "\n", " # run predictions with model\n", " X_train_pred=model.predict(X_train)\n", " X_test_pred=model.predict(X_test)\n", "\n", " # run overfitting test\n", " train_acc_count=np.logical_not(np.logical_xor(y_train,X_train_pred)).sum()\n", " test_acc_count=np.logical_not(np.logical_xor(y_test,X_test_pred)).sum()\n", " t,p_val=proportions_ztest([train_acc_count,test_acc_count],y_train_test_lens)\n", " \n", " # assign rating based on p-value\n", " rating=str()\n", " if p_val<0.05:\n", " rating='Overfit'\n", " else:\n", " rating='General'\n", "\n", " # collect data for new table row\n", " model_comp_table.loc[name]=[metrics.accuracy_score(y_train,X_train_pred),\n", " metrics.accuracy_score(y_test,X_test_pred),\n", " metrics.f1_score(y_train,X_train_pred),\n", " metrics.f1_score(y_test,X_test_pred),\n", " rating]\n", "\n", " return model_comp_table" ], "metadata": { "id": "w1AaK_l2MpHa" }, "id": "w1AaK_l2MpHa", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "This function adds accuracy and F1 scores to a DataFrame that we can use to compare models. It also runs a two proportion z-test to determine whether the training accuracy and testing accuracy are different enough for the model to be considered overfit. We assume a level of significance of 5% in the interpretation of the p-value. All conditions are necessarily met to justify the test." ], "metadata": { "id": "z-SFO41TyoAE" }, "id": "z-SFO41TyoAE" }, { "cell_type": "code", "source": [ "def confusion_heatmap(model,show_scores=False):\n", " '''Heatmap of confusion matrix of\n", " model predictions on test data.'''\n", "\n", " actual=y_test\n", " predicted=model.predict(X_test)\n", " # generate confusion matrix\n", " cm=metrics.confusion_matrix(actual,predicted)\n", " cm=np.flip(cm).T\n", "\n", " # heatmap labels\n", " labels=['TP','FP','FN','TN']\n", " cm_labels=np.array(cm).flatten()\n", " cm_percents=np.round((cm_labels/np.sum(cm))*100,3)\n", " annot_labels=[]\n", " for i in range(4):\n", " annot_labels.append(str(labels[i])+'\\nCount:'+str(cm_labels[i])+'\\n'+str(cm_percents[i])+'%')\n", " annot_labels=np.array(annot_labels).reshape(2,2)\n", "\n", " # print figure\n", " plt.figure(figsize=(8,5))\n", " plt.title('Confusion Matrix',fontsize=20)\n", " sns.heatmap(data=cm,\n", " annot=annot_labels,\n", " annot_kws={'fontsize':'x-large'},\n", " xticklabels=[1,0],\n", " yticklabels=[1,0],\n", " cmap='Greens',\n", " fmt='s')\n", " plt.xlabel('Actual',fontsize=14)\n", " plt.ylabel('Predicted',fontsize=14)\n", " plt.tight_layout();\n", " return" ], "metadata": { "id": "2EmydL_wKjLV" }, "id": "2EmydL_wKjLV", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The function prints a heatmap with the confusion matrix for a given model." ], "metadata": { "id": "NyHKiMj7yuEU" }, "id": "NyHKiMj7yuEU" }, { "cell_type": "markdown", "source": [ "### Bagging" ], "metadata": { "id": "rq-dyuyXmnsG" }, "id": "rq-dyuyXmnsG" }, { "cell_type": "markdown", "source": [ "#### Baseline: Decision Tree" ], "metadata": { "id": "ZxNLV1DPMKB6" }, "id": "ZxNLV1DPMKB6" }, { "cell_type": "code", "source": [ "dtree=tree.DecisionTreeClassifier(random_state=1)\n", "dtree.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "GRaA0HEoMO31", "outputId": "a2028ded-ecac-4181-f7a9-e94003b8825e" }, "id": "GRaA0HEoMO31", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "DecisionTreeClassifier(random_state=1)" ], "text/html": [ "
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 68 } ] }, { "cell_type": "markdown", "source": [ "As a baseline comparison, we train a single decision tree." ], "metadata": { "id": "t4-JxFmwowRs" }, "id": "t4-JxFmwowRs" }, { "cell_type": "code", "source": [ "scores(dtree)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "oqeHvfkfMfuN", "outputId": "875c6c2b-7d4b-49d2-96c2-c97c5591b89d" }, "id": "oqeHvfkfMfuN", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 1.0 0.649902\n", "Precision 1.0 0.741256\n", "Recall 1.0 0.731229\n", "F1 1.0 0.736208" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy1.00.649902
Precision1.00.741256
Recall1.00.731229
F11.00.736208
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 69 } ] }, { "cell_type": "markdown", "source": [ "Predictably, this model is comically overfit." ], "metadata": { "id": "KyPG350poz1_" }, "id": "KyPG350poz1_" }, { "cell_type": "code", "source": [ "tabulate(dtree,'dTree (baseline)')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 81 }, "id": "J0gMS3utNk7J", "outputId": "d61fa348-ecc0-4f41-9d71-42f751ef746b" }, "id": "J0gMS3utNk7J", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.0 0.649902 1.0 0.736208 Overfit" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.00.6499021.00.736208Overfit
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 70 } ] }, { "cell_type": "markdown", "source": [ "We add this model to a DataFrame that we will use to compare the different models." ], "metadata": { "id": "RFRVW1HapqYG" }, "id": "RFRVW1HapqYG" }, { "cell_type": "markdown", "source": [ "#### Bagging Classifier" ], "metadata": { "id": "7kEKiBjBmzTl" }, "id": "7kEKiBjBmzTl" }, { "cell_type": "code", "execution_count": null, "id": "unknown-institution", "metadata": { "id": "unknown-institution", "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "outputId": "0d948351-e024-49ca-bdee-7e2c75807cec" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "BaggingClassifier(random_state=1)" ], "text/html": [ "
BaggingClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 71 } ], "source": [ "bag=BaggingClassifier(random_state=1)\n", "bag.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "source": [ "The default estimator on a bagging classifier is a decision tree classifier, so this model consists of many parallel decision trees trained on samples taken with replacement (bootstrap samples)." ], "metadata": { "id": "AJyhm9Nmp8x3" }, "id": "AJyhm9Nmp8x3" }, { "cell_type": "code", "source": [ "scores(bag)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "z9-gRQ_gm-Fd", "outputId": "cb265cfb-5771-4cf5-ac77-e8a5b67d0d20" }, "id": "z9-gRQ_gm-Fd", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.984168 0.698232\n", "Precision 0.990625 0.770660\n", "Recall 0.985630 0.780631\n", "F1 0.988121 0.775614" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.9841680.698232
Precision0.9906250.770660
Recall0.9856300.780631
F10.9881210.775614
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 72 } ] }, { "cell_type": "code", "source": [ "tabulate(bag,'Bagging Classifier')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "OMnQipoMN-kE", "outputId": "5cf498ee-c005-466e-8d21-93615ba7125f" }, "id": "OMnQipoMN-kE", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 73 } ] }, { "cell_type": "markdown", "source": [ "This model is also overfit. The test performance is nonetheless slightly better than the single decision tree." ], "metadata": { "id": "clAMCwDfqa7x" }, "id": "clAMCwDfqa7x" }, { "cell_type": "markdown", "source": [ "#### Random Forest Classifier" ], "metadata": { "id": "ijwUZPIonA_E" }, "id": "ijwUZPIonA_E" }, { "cell_type": "code", "source": [ "rf=RandomForestClassifier(random_state=1)\n", "rf.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "EADQ2fMgm-Cw", "outputId": "aa89cb64-e0e3-4c2b-a6dd-f581752545c1" }, "id": "EADQ2fMgm-Cw", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RandomForestClassifier(random_state=1)" ], "text/html": [ "
RandomForestClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 74 } ] }, { "cell_type": "code", "source": [ "scores(rf)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "wIZL3NH5m9_0", "outputId": "6cc771d2-2af2-4b79-fdd6-62dad7e16e1e" }, "id": "wIZL3NH5m9_0", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 1.0 0.725082\n", "Precision 1.0 0.771920\n", "Recall 1.0 0.835326\n", "F1 1.0 0.802373" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy1.00.725082
Precision1.00.771920
Recall1.00.835326
F11.00.802373
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 75 } ] }, { "cell_type": "markdown", "source": [ "As with the others, the random forest classifier overfits on the training data. Test performance is better than before, though this model is not generalized." ], "metadata": { "id": "EOoBbE-jzVSR" }, "id": "EOoBbE-jzVSR" }, { "cell_type": "code", "source": [ "tabulate(rf,'Random Forest')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "MKve2C7POJhY", "outputId": "0c4a7864-811f-4234-f173-e8caaa21aef8" }, "id": "MKve2C7POJhY", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 76 } ] }, { "cell_type": "markdown", "source": [ "### Boosting" ], "metadata": { "id": "a4dgTVuqn6xH" }, "id": "a4dgTVuqn6xH" }, { "cell_type": "markdown", "source": [ "#### AdaBoost" ], "metadata": { "id": "K3q-FvDXn4R_" }, "id": "K3q-FvDXn4R_" }, { "cell_type": "code", "source": [ "abc=AdaBoostClassifier(random_state=1)\n", "abc.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "F8XdF1ODm98q", "outputId": "26bac9d2-65ba-47d5-f7d2-0d30d8494333" }, "id": "F8XdF1ODm98q", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "AdaBoostClassifier(random_state=1)" ], "text/html": [ "
AdaBoostClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 77 } ] }, { "cell_type": "code", "source": [ "scores(abc)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "dSgxaywDm958", "outputId": "4afda85b-2411-42d3-f950-40a64f81a63f" }, "id": "dSgxaywDm958", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.734786 0.733857\n", "Precision 0.756469 0.755537\n", "Recall 0.889328 0.889433\n", "F1 0.817536 0.817036" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7347860.733857
Precision0.7564690.755537
Recall0.8893280.889433
F10.8175360.817036
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 78 } ] }, { "cell_type": "markdown", "source": [ "The basic AdaBoost Classifier has good performance and does not suffer from overfitting. Accuracy on training and testing is around 73% and F1 is about 82%. This is a much more promising baseline." ], "metadata": { "id": "eXyKTQ9G0DRq" }, "id": "eXyKTQ9G0DRq" }, { "cell_type": "code", "source": [ "tabulate(abc,'AdaBoost')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "lZkWK7QfOOUU", "outputId": "fb443ed2-73c0-425f-c261-658dd10f0d32" }, "id": "lZkWK7QfOOUU", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 79 } ] }, { "cell_type": "markdown", "source": [ "#### Gradient Boosting" ], "metadata": { "id": "jtmIXFVioEce" }, "id": "jtmIXFVioEce" }, { "cell_type": "code", "source": [ "gbc=GradientBoostingClassifier(random_state=1)\n", "gbc.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "6TnlfLAaoIz8", "outputId": "d7c03baa-0172-44b5-dcf1-9ea0a8c28721" }, "id": "6TnlfLAaoIz8", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "GradientBoostingClassifier(random_state=1)" ], "text/html": [ "
GradientBoostingClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 80 } ] }, { "cell_type": "code", "source": [ "scores(gbc)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "73U_JdKdoIxT", "outputId": "49230ecd-1c7d-4dc3-ba1d-fb4a3c7c4043" }, "id": "73U_JdKdoIxT", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.754997 0.749312\n", "Precision 0.782289 0.780200\n", "Recall 0.877479 0.869829\n", "F1 0.827155 0.822581" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7549970.749312
Precision0.7822890.780200
Recall0.8774790.869829
F10.8271550.822581
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 81 } ] }, { "cell_type": "markdown", "source": [ "Gradient Boosting also does not appear overfit, with the added bonus of slightly better performance over AdaBoost.\n", "* Train and test accuracy is around 75%.\n", "* Train and test F1 is around 82%." ], "metadata": { "id": "ot4jmucd0aL4" }, "id": "ot4jmucd0aL4" }, { "cell_type": "code", "source": [ "tabulate(gbc,'Gradient Boosting')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "TOwnGrNNOS23", "outputId": "1f49e7bb-ae14-4f9d-a342-2e9539d0e482" }, "id": "TOwnGrNNOS23", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 82 } ] }, { "cell_type": "markdown", "source": [ "#### XGBoost" ], "metadata": { "id": "NOVeGYmToWe1" }, "id": "NOVeGYmToWe1" }, { "cell_type": "code", "source": [ "xgbc=XGBClassifier(random_state=1)\n", "xgbc.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 75 }, "id": "jLxVwUOWoIuA", "outputId": "8984d247-6e24-459c-ea40-2f993ce433f4" }, "id": "jLxVwUOWoIuA", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "XGBClassifier(random_state=1)" ], "text/html": [ "
XGBClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 83 } ] }, { "cell_type": "code", "source": [ "scores(xgbc)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "HtXSPfq-ofCD", "outputId": "f990e132-ed89-4293-d556-5b9649b7db9a" }, "id": "HtXSPfq-ofCD", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.753762 0.750098\n", "Precision 0.780624 0.779646\n", "Recall 0.878235 0.872574\n", "F1 0.826558 0.823497" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7537620.750098
Precision0.7806240.779646
Recall0.8782350.872574
F10.8265580.823497
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 84 } ] }, { "cell_type": "markdown", "source": [ "XGBoost offers near identical performance to the gradient boosting above." ], "metadata": { "id": "NY0R6k53mzP8" }, "id": "NY0R6k53mzP8" }, { "cell_type": "code", "source": [ "tabulate(xgbc,'XGBoost')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 238 }, "id": "-FG0Bp1bOYSP", "outputId": "7f60bb93-d39b-4226-99ed-867904455ed9" }, "id": "-FG0Bp1bOYSP", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 85 } ] }, { "cell_type": "markdown", "id": "prime-athletics", "metadata": { "id": "prime-athletics" }, "source": [ "## Will tuning the hyperparameters improve the model performance?" ] }, { "cell_type": "markdown", "source": [ "### Bagging" ], "metadata": { "id": "7wYCv0lbFh3g" }, "id": "7wYCv0lbFh3g" }, { "cell_type": "markdown", "source": [ "#### Tuned Decision Tree" ], "metadata": { "id": "Yhhn7CVBFslx" }, "id": "Yhhn7CVBFslx" }, { "cell_type": "code", "execution_count": null, "id": "banned-difficulty", "metadata": { "id": "banned-difficulty" }, "outputs": [], "source": [ "dtree_tuned=tree.DecisionTreeClassifier(random_state=1)\n", "\n", "params={'max_depth':np.arange(3,10),\n", " 'min_samples_leaf':np.arange(5,10),\n", " 'max_features':[None,'sqrt'],\n", " 'max_leaf_nodes':np.arange(5,25,5),\n", " 'min_impurity_decrease':[0.0,0.0005,0.001],\n", " 'class_weight':[None,'balanced']}" ] }, { "cell_type": "markdown", "source": [ "* We will test values of tree depth from 3 to 9.\n", "* We will consider limiting the maximum number of features to consider when calculating the best split.\n", "* We will try balancing the class weights, though I don't expect this will improve performance." ], "metadata": { "id": "iKD6FUP5kPxs" }, "id": "iKD6FUP5kPxs" }, { "cell_type": "code", "source": [ "go=GridSearchCV(estimator=dtree_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1)\n", "go.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "id": "UOf6QwdoGDqv", "outputId": "b54a3b37-2395-46be-f2da-2c7b98a7d94e" }, "id": "UOf6QwdoGDqv", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 1680 candidates, totalling 8400 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,\n", " param_grid={'class_weight': [None, 'balanced'],\n", " 'max_depth': array([3, 4, 5, 6, 7, 8, 9]),\n", " 'max_features': [None, 'sqrt'],\n", " 'max_leaf_nodes': array([ 5, 10, 15, 20]),\n", " 'min_impurity_decrease': [0.0, 0.0005, 0.001],\n", " 'min_samples_leaf': array([5, 6, 7, 8, 9])},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=1), n_jobs=-1,\n",
              "             param_grid={'class_weight': [None, 'balanced'],\n",
              "                         'max_depth': array([3, 4, 5, 6, 7, 8, 9]),\n",
              "                         'max_features': [None, 'sqrt'],\n",
              "                         'max_leaf_nodes': array([ 5, 10, 15, 20]),\n",
              "                         'min_impurity_decrease': [0.0, 0.0005, 0.001],\n",
              "                         'min_samples_leaf': array([5, 6, 7, 8, 9])},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 87 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nwa8W4TNH-Im", "outputId": "2aa1119a-6fcf-499d-a047-5b94eaa3a322" }, "id": "nwa8W4TNH-Im", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'class_weight': None,\n", " 'max_depth': 7,\n", " 'max_features': None,\n", " 'max_leaf_nodes': 20,\n", " 'min_impurity_decrease': 0.0,\n", " 'min_samples_leaf': 5}" ] }, "metadata": {}, "execution_count": 88 } ] }, { "cell_type": "markdown", "source": [ "Grid Search Findings\n", "* Our response variable is not so unbalanced that a rebalancing of class weights is required.\n", "* A maximum tree depth of 7 seems to be ideal, with 20 as the upper bound for leaf nodes.\n", "* The model requires at least 5 samples for each leaf. Along with limiting the depth and maximum number of leaves, this reduces overfitting.\n", "* We get best performance when considering all features to determine the best split." ], "metadata": { "id": "JsZXlrv32-c8" }, "id": "JsZXlrv32-c8" }, { "cell_type": "code", "source": [ "dtree_tuned=go.best_estimator_\n", "\n", "dtree_tuned.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 92 }, "id": "7PBKlY9g0baz", "outputId": "81437f1e-157b-45b4-ea76-599ec7b46217" }, "id": "7PBKlY9g0baz", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "DecisionTreeClassifier(max_depth=7, max_leaf_nodes=20, min_samples_leaf=5,\n", " random_state=1)" ], "text/html": [ "
DecisionTreeClassifier(max_depth=7, max_leaf_nodes=20, min_samples_leaf=5,\n",
              "                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 89 } ] }, { "cell_type": "markdown", "source": [ "We refit the best estimator on the full training set. (In the grid search, we employ cross validation, so there is a holdout set used to test each model. The upside is a more reliable search, but the cost is that we have yet to train the best estimator on the full training set.)" ], "metadata": { "id": "-t8REJ-elCjQ" }, "id": "-t8REJ-elCjQ" }, { "cell_type": "code", "source": [ "scores(dtree_tuned)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "SAYQ5m16II0W", "outputId": "a2b59de8-892a-48ca-bebc-f775c6a313f6" }, "id": "SAYQ5m16II0W", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.749382 0.750098\n", "Precision 0.777421 0.779646\n", "Recall 0.875546 0.872574\n", "F1 0.823571 0.823497" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7493820.750098
Precision0.7774210.779646
Recall0.8755460.872574
F10.8235710.823497
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 90 } ] }, { "cell_type": "code", "source": [ "tabulate(dtree_tuned,'dTree (tuned)')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 269 }, "id": "8cXyWbtO0Zrb", "outputId": "f0cd5279-e127-4537-e796-3be5c600fa3b" }, "id": "8cXyWbtO0Zrb", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 91 } ] }, { "cell_type": "markdown", "source": [ "With around 75% accuracy, the tuned decision tree has similar performance to untuned boosting models. F1 score is around 82%. Importantly, tuning this tree has eliminated overfitting." ], "metadata": { "id": "srhmxkXM3qjx" }, "id": "srhmxkXM3qjx" }, { "cell_type": "markdown", "source": [ "#### Tuned Bagging Classifier" ], "metadata": { "id": "BfZwNqvgsK4I" }, "id": "BfZwNqvgsK4I" }, { "cell_type": "code", "source": [ "params={'base_estimator':[tree.DecisionTreeClassifier(random_state=2)],\n", " 'base_estimator__max_depth':[None,1,2,3,4,5],\n", " 'n_estimators':[10,20,30,40,50],\n", " 'max_samples':[0.7,0.8,0.9,1.0],\n", " 'max_features':[0.7,0.8,0.9,1.0],}" ], "metadata": { "id": "kvtyHI7fszix" }, "id": "kvtyHI7fszix", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "* We will test different depths for the base estimator (Decision Tree Classifier).\n", "* We will consider between 10 and 50 estimators.\n", "* Max samples and max features will also be adjusted, testing different fractions to include." ], "metadata": { "id": "7P6GifmERS4w" }, "id": "7P6GifmERS4w" }, { "cell_type": "code", "source": [ "bag_tuned=BaggingClassifier(random_state=1)\n", "\n", "go=GridSearchCV(estimator=bag_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1)\n", "\n", "go.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "id": "AoI3vy0gtn4V", "outputId": "26ecdf7f-c169-4e73-cbe4-a30c6dc6c42d" }, "id": "AoI3vy0gtn4V", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 480 candidates, totalling 2400 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1), n_jobs=-1,\n", " param_grid={'base_estimator': [DecisionTreeClassifier(max_depth=5,\n", " random_state=2)],\n", " 'base_estimator__max_depth': [None, 1, 2, 3, 4, 5],\n", " 'max_features': [0.7, 0.8, 0.9, 1.0],\n", " 'max_samples': [0.7, 0.8, 0.9, 1.0],\n", " 'n_estimators': [10, 20, 30, 40, 50]},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1), n_jobs=-1,\n",
              "             param_grid={'base_estimator': [DecisionTreeClassifier(max_depth=5,\n",
              "                                                                   random_state=2)],\n",
              "                         'base_estimator__max_depth': [None, 1, 2, 3, 4, 5],\n",
              "                         'max_features': [0.7, 0.8, 0.9, 1.0],\n",
              "                         'max_samples': [0.7, 0.8, 0.9, 1.0],\n",
              "                         'n_estimators': [10, 20, 30, 40, 50]},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 93 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ZnXGPydTvMfu", "outputId": "607a03bd-d7df-44f0-e266-3fc4ffb919c1" }, "id": "ZnXGPydTvMfu", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'base_estimator': DecisionTreeClassifier(max_depth=5, random_state=2),\n", " 'base_estimator__max_depth': 5,\n", " 'max_features': 0.9,\n", " 'max_samples': 0.8,\n", " 'n_estimators': 10}" ] }, "metadata": {}, "execution_count": 94 } ] }, { "cell_type": "markdown", "source": [ "* Trees with a depth of 5 score best on F1.\n", "* Our model performs better with fewer estimators.\n", "* Taking less than 100% of the samples yields a higher-scoring model." ], "metadata": { "id": "Gf12bCFhRoYt" }, "id": "Gf12bCFhRoYt" }, { "cell_type": "code", "source": [ "bag_tuned=go.best_estimator_\n", "\n", "bag_tuned.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 118 }, "id": "oGHuXI1-0udc", "outputId": "02f080bf-6817-4dfa-b2b7-b2c11cac117c" }, "id": "oGHuXI1-0udc", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,\n", " random_state=2),\n", " max_features=0.9, max_samples=0.8, random_state=1)" ], "text/html": [ "
BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=5,\n",
              "                                                        random_state=2),\n",
              "                  max_features=0.9, max_samples=0.8, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 95 } ] }, { "cell_type": "code", "source": [ "scores(bag_tuned)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "WdpMRiVrvPb0", "outputId": "d4b5984b-ef9f-46e4-ec35-ac911646257f" }, "id": "WdpMRiVrvPb0", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.738098 0.735298\n", "Precision 0.747520 0.746795\n", "Recall 0.918067 0.913546\n", "F1 0.824062 0.821797" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7380980.735298
Precision0.7475200.746795
Recall0.9180670.913546
F10.8240620.821797
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 96 } ] }, { "cell_type": "markdown", "source": [ "Note that the tuned bagging classifier has great recall. This model does a great job at reducing false negatives, meaning the US job market is not losing out on promising talent. On the other hand, precision is lower, which leaves us with an F1 score comparable to other generalized models." ], "metadata": { "id": "al5mYAiP1-Qz" }, "id": "al5mYAiP1-Qz" }, { "cell_type": "code", "source": [ "tabulate(bag_tuned,'Bagging clfr (tuned)')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 300 }, "id": "G7K98RXD1DVl", "outputId": "06a36460-6970-4b6d-a0f5-f6bc7d37b4a7" }, "id": "G7K98RXD1DVl", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 97 } ] }, { "cell_type": "markdown", "source": [ "As an experiment, we try another grid search with a second option for the base estimator: Logistic Regression." ], "metadata": { "id": "k3SKSEPJ1fJ0" }, "id": "k3SKSEPJ1fJ0" }, { "cell_type": "code", "source": [ "params={'base_estimator':[tree.DecisionTreeClassifier(random_state=2),\n", " LogisticRegression(random_state=2,max_iter=1000)],\n", " 'n_estimators':[10,20,30,40,50],\n", " 'max_samples':[0.7,0.8,0.9,1.0],\n", " 'max_features':[0.7,0.8,0.9,1.0],}" ], "metadata": { "id": "2-QkrVVYvY6c" }, "id": "2-QkrVVYvY6c", "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "bag_tuned=BaggingClassifier(random_state=1)\n", "\n", "go=GridSearchCV(estimator=bag_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1)\n", "\n", "go.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "id": "_8wlJa1Tv9Uj", "outputId": "585dd445-9a78-4e3c-87fc-2f89a03dbc01" }, "id": "_8wlJa1Tv9Uj", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 160 candidates, totalling 800 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1), n_jobs=-1,\n", " param_grid={'base_estimator': [DecisionTreeClassifier(random_state=2),\n", " LogisticRegression(max_iter=1000,\n", " random_state=2)],\n", " 'max_features': [0.7, 0.8, 0.9, 1.0],\n", " 'max_samples': [0.7, 0.8, 0.9, 1.0],\n", " 'n_estimators': [10, 20, 30, 40, 50]},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1), n_jobs=-1,\n",
              "             param_grid={'base_estimator': [DecisionTreeClassifier(random_state=2),\n",
              "                                            LogisticRegression(max_iter=1000,\n",
              "                                                               random_state=2)],\n",
              "                         'max_features': [0.7, 0.8, 0.9, 1.0],\n",
              "                         'max_samples': [0.7, 0.8, 0.9, 1.0],\n",
              "                         'n_estimators': [10, 20, 30, 40, 50]},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 99 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5-4pEIVJwlEq", "outputId": "59ccc460-6f25-4021-f65f-aa6d8c91b9bb" }, "id": "5-4pEIVJwlEq", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'base_estimator': DecisionTreeClassifier(random_state=2),\n", " 'max_features': 0.7,\n", " 'max_samples': 0.8,\n", " 'n_estimators': 50}" ] }, "metadata": {}, "execution_count": 100 } ] }, { "cell_type": "markdown", "source": [ "After fitting the grid search, we find that the decision tree classifier was still the strongest base estimator for our metric. However, this time more estimators were required." ], "metadata": { "id": "kNmsKPH51nDz" }, "id": "kNmsKPH51nDz" }, { "cell_type": "code", "source": [ "scores(go)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "SkP7EhTHwn9K", "outputId": "23f5d355-d13e-4d10-9841-26ace6bede53" }, "id": "SkP7EhTHwn9K", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.996351 0.722200\n", "Precision 0.995313 0.749916\n", "Recall 0.999244 0.876495\n", "F1 0.997274 0.808280" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.9963510.722200
Precision0.9953130.749916
Recall0.9992440.876495
F10.9972740.808280
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 101 } ] }, { "cell_type": "markdown", "source": [ "It appears our model now overfits, as we have not limited the tree depth. We will not consider this model." ], "metadata": { "id": "iBRJKc-21vH9" }, "id": "iBRJKc-21vH9" }, { "cell_type": "markdown", "source": [ "#### Tuned Random Forest" ], "metadata": { "id": "tF2Sz8y2wJVH" }, "id": "tF2Sz8y2wJVH" }, { "cell_type": "code", "source": [ "params={'n_estimators':np.append(np.arange(150,350,50),1000),\n", " 'max_depth':np.arange(1,6),\n", " 'min_samples_split':np.arange(4,9),\n", " 'max_features':['sqrt','log2'],\n", " 'max_samples':[0.25,0.5,0.75,1.0]}" ], "metadata": { "id": "z96RaViKwJC1" }, "id": "z96RaViKwJC1", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "* We consider a number of estimators between 150 and 300. We also consider a huge model: 1000 trees.\n", "* Various depths will be studied.\n", "* As with the previous bagging model, we will test various values for the maximum number of features and samples." ], "metadata": { "id": "C1JhNGBtUb7Y" }, "id": "C1JhNGBtUb7Y" }, { "cell_type": "code", "source": [ "rf_tuned=RandomForestClassifier(random_state=1,warm_start=True)\n", "\n", "go=GridSearchCV(estimator=rf_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1)\n", "\n", "go.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "id": "2RLLslbkyNyM", "outputId": "c3abbc60-7752-4a3c-8a16-ecd0854ba090" }, "id": "2RLLslbkyNyM", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 1000 candidates, totalling 5000 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=RandomForestClassifier(random_state=1, warm_start=True),\n", " n_jobs=-1,\n", " param_grid={'max_depth': array([1, 2, 3, 4, 5]),\n", " 'max_features': ['sqrt', 'log2'],\n", " 'max_samples': [0.25, 0.5, 0.75, 1.0],\n", " 'min_samples_split': array([4, 5, 6, 7, 8]),\n", " 'n_estimators': array([ 150, 200, 250, 300, 1000])},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5,\n",
              "             estimator=RandomForestClassifier(random_state=1, warm_start=True),\n",
              "             n_jobs=-1,\n",
              "             param_grid={'max_depth': array([1, 2, 3, 4, 5]),\n",
              "                         'max_features': ['sqrt', 'log2'],\n",
              "                         'max_samples': [0.25, 0.5, 0.75, 1.0],\n",
              "                         'min_samples_split': array([4, 5, 6, 7, 8]),\n",
              "                         'n_estimators': array([ 150,  200,  250,  300, 1000])},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 103 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "id": "yUrmvgxG0-of", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "b7c0276d-3ec6-4134-8469-2249eca03a71" }, "id": "yUrmvgxG0-of", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'max_depth': 5,\n", " 'max_features': 'sqrt',\n", " 'max_samples': 0.75,\n", " 'min_samples_split': 7,\n", " 'n_estimators': 300}" ] }, "metadata": {}, "execution_count": 104 } ] }, { "cell_type": "markdown", "source": [ "* Limiting the number of features at each split by the square root of the total number of features yielded better results than by using log base 2.\n", "* A maximum tree depth of 5 worked best.\n", "* We required 250 estimators in the highest scoring model." ], "metadata": { "id": "U6niOgWXU_Af" }, "id": "U6niOgWXU_Af" }, { "cell_type": "code", "source": [ "# train best estimator omitting warm_start parameter\n", "rf_tuned=RandomForestClassifier(max_depth=5,\n", " max_features='sqrt',\n", " max_samples=0.75,\n", " min_samples_split=8,\n", " n_estimators=250,\n", " random_state=1)\n", "\n", "rf_tuned.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 92 }, "id": "z-KnX-k9bURO", "outputId": "36064664-0dfd-477e-eaf2-a3a117dd169a" }, "id": "z-KnX-k9bURO", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "RandomForestClassifier(max_depth=5, max_samples=0.75, min_samples_split=8,\n", " n_estimators=250, random_state=1)" ], "text/html": [ "
RandomForestClassifier(max_depth=5, max_samples=0.75, min_samples_split=8,\n",
              "                       n_estimators=250, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 105 } ] }, { "cell_type": "code", "source": [ "scores(rf_tuned)" ], "metadata": { "id": "f3wdjOpl08Iq", "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "outputId": "cf489f28-dcf1-405e-89ab-b0f4a8445100" }, "id": "f3wdjOpl08Iq", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.725803 0.723772\n", "Precision 0.729101 0.727011\n", "Recall 0.938151 0.939228\n", "F1 0.820520 0.819605" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7258030.723772
Precision0.7291010.727011
Recall0.9381510.939228
F10.8205200.819605
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 106 } ] }, { "cell_type": "markdown", "source": [ "Similar to the tuned bagging classifier above, we're getting over 90% recall with this model. Unfortunately, accuracy is still below 75% and F1 is no higher than before." ], "metadata": { "id": "0OsQqVeFWF1V" }, "id": "0OsQqVeFWF1V" }, { "cell_type": "code", "source": [ "tabulate(rf_tuned,'Random Forest (tuned)')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 332 }, "id": "VhjFfUp8bb3A", "outputId": "7dc85199-135f-4238-cb4f-4f073d147084" }, "id": "VhjFfUp8bb3A", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General\n", "Random Forest (tuned) 0.725803 0.723772 0.820520 0.819605 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
Random Forest (tuned)0.7258030.7237720.8205200.819605General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 107 } ] }, { "cell_type": "markdown", "source": [ "### Boosting" ], "metadata": { "id": "cwGoAvY3lZ9c" }, "id": "cwGoAvY3lZ9c" }, { "cell_type": "markdown", "source": [ "#### Tuned AdaBoost" ], "metadata": { "id": "D_WCWGywlfOp" }, "id": "D_WCWGywlfOp" }, { "cell_type": "code", "source": [ "params={'base_estimator':[tree.DecisionTreeClassifier(random_state=2)],\n", " 'base_estimator__max_depth':[1,2,3],\n", " 'n_estimators':np.arange(10,80,10),\n", " 'learning_rate':np.linspace(0.1,1,10)}" ], "metadata": { "id": "blU6ewjOl_rS" }, "id": "blU6ewjOl_rS", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Next we'll train an AdaBoost classifier, using a decision tree as the base estimator. We will consider between 10 and 70 estimators within the ensemble, various depths of the trees, and several different learning rates." ], "metadata": { "id": "z7WWoW2zW-ud" }, "id": "z7WWoW2zW-ud" }, { "cell_type": "code", "source": [ "abc_tuned=AdaBoostClassifier(random_state=1)\n", "\n", "go=GridSearchCV(estimator=abc_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1)\n", "\n", "go.fit(X_train,y_train)" ], "metadata": { "id": "X-VgWmCnmaOX", "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "outputId": "70af74f0-05b0-4a09-cee8-bdd57f6ca537" }, "id": "X-VgWmCnmaOX", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 210 candidates, totalling 1050 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=AdaBoostClassifier(random_state=1), n_jobs=-1,\n", " param_grid={'base_estimator': [DecisionTreeClassifier(max_depth=3,\n", " random_state=2)],\n", " 'base_estimator__max_depth': [1, 2, 3],\n", " 'learning_rate': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),\n", " 'n_estimators': array([10, 20, 30, 40, 50, 60, 70])},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5, estimator=AdaBoostClassifier(random_state=1), n_jobs=-1,\n",
              "             param_grid={'base_estimator': [DecisionTreeClassifier(max_depth=3,\n",
              "                                                                   random_state=2)],\n",
              "                         'base_estimator__max_depth': [1, 2, 3],\n",
              "                         'learning_rate': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),\n",
              "                         'n_estimators': array([10, 20, 30, 40, 50, 60, 70])},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 109 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "id": "5CkHBoR7mnyW", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "6fb2d370-0cc6-4fbe-bd3b-82718ea741ae" }, "id": "5CkHBoR7mnyW", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=2),\n", " 'base_estimator__max_depth': 3,\n", " 'learning_rate': 0.1,\n", " 'n_estimators': 50}" ] }, "metadata": {}, "execution_count": 110 } ] }, { "cell_type": "markdown", "source": [ "With a depth of 3 and a slower learning rate, our best model required 50 estimators." ], "metadata": { "id": "g5GxTXqkXVqO" }, "id": "g5GxTXqkXVqO" }, { "cell_type": "code", "source": [ "abc_tuned=go.best_estimator_\n", "\n", "abc_tuned.fit(X_train,y_train)" ], "metadata": { "id": "KZIRckXI4cvD", "colab": { "base_uri": "https://localhost:8080/", "height": 118 }, "outputId": "faf026d4-e8d8-4a30-810d-b7cf551747ef" }, "id": "KZIRckXI4cvD", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,\n", " random_state=2),\n", " learning_rate=0.1, random_state=1)" ], "text/html": [ "
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,\n",
              "                                                         random_state=2),\n",
              "                   learning_rate=0.1, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 111 } ] }, { "cell_type": "code", "source": [ "scores(abc_tuned)" ], "metadata": { "id": "68hdXW9hmpPL", "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "outputId": "d1684263-14e9-49d5-f2ad-875c8cb288ed" }, "id": "68hdXW9hmpPL", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.750561 0.749050\n", "Precision 0.778933 0.778166\n", "Recall 0.874958 0.873358\n", "F1 0.824158 0.823019" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7505610.749050
Precision0.7789330.778166
Recall0.8749580.873358
F10.8241580.823019
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 112 } ] }, { "cell_type": "markdown", "source": [ "We get similar F1 performance to the tuned random forest, but better accuracy. Recall is lower than some other models." ], "metadata": { "id": "lC1GTEDiXffM" }, "id": "lC1GTEDiXffM" }, { "cell_type": "code", "source": [ "tabulate(abc_tuned,'AdaBoost (tuned)')" ], "metadata": { "id": "aaR_kAlk4cKs", "colab": { "base_uri": "https://localhost:8080/", "height": 363 }, "outputId": "24a11f68-fecc-4bcd-8d99-ab8347dc7a52" }, "id": "aaR_kAlk4cKs", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General\n", "Random Forest (tuned) 0.725803 0.723772 0.820520 0.819605 General\n", "AdaBoost (tuned) 0.750561 0.749050 0.824158 0.823019 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
Random Forest (tuned)0.7258030.7237720.8205200.819605General
AdaBoost (tuned)0.7505610.7490500.8241580.823019General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 113 } ] }, { "cell_type": "markdown", "source": [ "#### Tuned Gradient Boosting" ], "metadata": { "id": "s2c61dRfnsud" }, "id": "s2c61dRfnsud" }, { "cell_type": "code", "source": [ "params={'init':[None,'zero',AdaBoostClassifier(random_state=2)],\n", " 'n_estimators':[50,75,100,150],\n", " 'max_features':[None,'sqrt']+np.linspace(0.7,1.0,4).tolist()}" ], "metadata": { "id": "AKOorlEHnxCZ" }, "id": "AKOorlEHnxCZ", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We will next tune the gradient booster with several different initialization conditions. We will also check the number of estimators and the maximum number of features when looking for the best split." ], "metadata": { "id": "HgFe3YXZXptg" }, "id": "HgFe3YXZXptg" }, { "cell_type": "code", "source": [ "gbc_tuned=GradientBoostingClassifier(random_state=1,\n", " warm_start=True)\n", "\n", "go=GridSearchCV(estimator=gbc_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=1)\n", "\n", "go.fit(X_train,y_train)" ], "metadata": { "id": "WwkTVB2LoN2Q", "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "outputId": "24a7822a-6189-471c-fd9b-3d396eb291f6" }, "id": "WwkTVB2LoN2Q", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 72 candidates, totalling 360 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=GradientBoostingClassifier(random_state=1,\n", " warm_start=True),\n", " n_jobs=-1,\n", " param_grid={'init': [None, 'zero',\n", " AdaBoostClassifier(random_state=2)],\n", " 'max_features': [None, 'sqrt', 0.7, 0.7999999999999999,\n", " 0.9, 1.0],\n", " 'n_estimators': [50, 75, 100, 150]},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5,\n",
              "             estimator=GradientBoostingClassifier(random_state=1,\n",
              "                                                  warm_start=True),\n",
              "             n_jobs=-1,\n",
              "             param_grid={'init': [None, 'zero',\n",
              "                                  AdaBoostClassifier(random_state=2)],\n",
              "                         'max_features': [None, 'sqrt', 0.7, 0.7999999999999999,\n",
              "                                          0.9, 1.0],\n",
              "                         'n_estimators': [50, 75, 100, 150]},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 115 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "id": "7rZupNsYoWRR", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "65e593ca-1a7a-489d-8194-e5b22deffc65" }, "id": "7rZupNsYoWRR", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'init': 'zero', 'max_features': 0.7, 'n_estimators': 50}" ] }, "metadata": {}, "execution_count": 116 } ] }, { "cell_type": "code", "source": [ "# train best estimator omitting warm_start parameter\n", "gbc_tuned=GradientBoostingClassifier(random_state=1,\n", " init=AdaBoostClassifier(random_state=2),\n", " max_features=0.7,\n", " n_estimators=100)\n", "\n", "gbc_tuned.fit(X_train,y_train)" ], "metadata": { "id": "mRqe4OIO4q-h", "colab": { "base_uri": "https://localhost:8080/", "height": 118 }, "outputId": "365bd4f6-f45c-4f3b-f036-bb94b166319a" }, "id": "mRqe4OIO4q-h", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "GradientBoostingClassifier(init=AdaBoostClassifier(random_state=2),\n", " max_features=0.7, random_state=1)" ], "text/html": [ "
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=2),\n",
              "                           max_features=0.7, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 117 } ] }, { "cell_type": "code", "source": [ "scores(gbc_tuned)" ], "metadata": { "id": "iNYNLWrzoYGm", "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "outputId": "d99f5cc0-0c7a-4d49-9134-57d0a6dcf20d" }, "id": "iNYNLWrzoYGm", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.754379 0.751801\n", "Precision 0.782449 0.782020\n", "Recall 0.875882 0.871398\n", "F1 0.826533 0.824293" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7543790.751801
Precision0.7824490.782020
Recall0.8758820.871398
F10.8265330.824293
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 118 } ] }, { "cell_type": "markdown", "source": [ "As with AdaBoost, our accuracy is over 75%. The F1 score does not seem to be able to improve above 82%." ], "metadata": { "id": "N5RntnrUYmm7" }, "id": "N5RntnrUYmm7" }, { "cell_type": "code", "source": [ "tabulate(gbc_tuned,'Grad Boost (tuned)')" ], "metadata": { "id": "4oKx6N6L4xq8", "colab": { "base_uri": "https://localhost:8080/", "height": 394 }, "outputId": "9ce68fac-255a-4e90-de2a-dcdc6b0bb59b" }, "id": "4oKx6N6L4xq8", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General\n", "Random Forest (tuned) 0.725803 0.723772 0.820520 0.819605 General\n", "AdaBoost (tuned) 0.750561 0.749050 0.824158 0.823019 General\n", "Grad Boost (tuned) 0.754379 0.751801 0.826533 0.824293 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
Random Forest (tuned)0.7258030.7237720.8205200.819605General
AdaBoost (tuned)0.7505610.7490500.8241580.823019General
Grad Boost (tuned)0.7543790.7518010.8265330.824293General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 119 } ] }, { "cell_type": "markdown", "source": [ "#### Tuned XGBoost" ], "metadata": { "id": "WxsPtYuTp9S4" }, "id": "WxsPtYuTp9S4" }, { "cell_type": "code", "source": [ "params={'eta':[0.1,0.2,0.3],\n", " 'gamma':[0,2],\n", " 'subsample':[0.5,0.75,1.0],\n", " 'colsample_by_tree':[0.5,0.75,1.0],\n", " 'colsample_bylevel':[0.5,0.75,1.0],\n", " 'scale_pos_weight':[0.5,1.0]}" ], "metadata": { "id": "bW4SuH0As4sw" }, "id": "bW4SuH0As4sw", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "We will try many parameters for XGBoost.\n", "* The parameter eta is the learning rate for the model.\n", "* The parameter gamma concern loss reduction when partitioning a leaf node on a tree.\n", "* Several sampling parameters control how the model samples the training data during fitting.\n", "* The last parameter (```scale_pos_weight```) can help with biased classes in the response variable." ], "metadata": { "id": "Z79wJl01YwoV" }, "id": "Z79wJl01YwoV" }, { "cell_type": "code", "source": [ "xgbc_tuned=XGBClassifier(random_state=1)\n", "\n", "go=GridSearchCV(estimator=xgbc_tuned,\n", " param_grid=params,\n", " scoring='f1',\n", " cv=5,\n", " verbose=1)\n", "\n", "go.fit(X_train,y_train)" ], "metadata": { "id": "LNP6jSd-qAB6", "colab": { "base_uri": "https://localhost:8080/", "height": 135 }, "outputId": "d4c8aa95-2bf2-4a7a-ca19-7b00393c714d" }, "id": "LNP6jSd-qAB6", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Fitting 5 folds for each of 324 candidates, totalling 1620 fits\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "GridSearchCV(cv=5, estimator=XGBClassifier(random_state=1),\n", " param_grid={'colsample_by_tree': [0.5, 0.75, 1.0],\n", " 'colsample_bylevel': [0.5, 0.75, 1.0],\n", " 'eta': [0.1, 0.2, 0.3], 'gamma': [0, 2],\n", " 'scale_pos_weight': [0.5, 1.0],\n", " 'subsample': [0.5, 0.75, 1.0]},\n", " scoring='f1', verbose=1)" ], "text/html": [ "
GridSearchCV(cv=5, estimator=XGBClassifier(random_state=1),\n",
              "             param_grid={'colsample_by_tree': [0.5, 0.75, 1.0],\n",
              "                         'colsample_bylevel': [0.5, 0.75, 1.0],\n",
              "                         'eta': [0.1, 0.2, 0.3], 'gamma': [0, 2],\n",
              "                         'scale_pos_weight': [0.5, 1.0],\n",
              "                         'subsample': [0.5, 0.75, 1.0]},\n",
              "             scoring='f1', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 121 } ] }, { "cell_type": "code", "source": [ "go.best_params_" ], "metadata": { "id": "ALq-QBkB1Q-m", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "29dc0d09-0cd8-4a65-aed6-4e58d932a800" }, "id": "ALq-QBkB1Q-m", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'colsample_by_tree': 0.5,\n", " 'colsample_bylevel': 1.0,\n", " 'eta': 0.1,\n", " 'gamma': 0,\n", " 'scale_pos_weight': 1.0,\n", " 'subsample': 1.0}" ] }, "metadata": {}, "execution_count": 122 } ] }, { "cell_type": "markdown", "source": [ "* A lower learning rate ended up being better for F1.\n", "* The classifier did not need to adjust class weights for better performance.\n", "* While ```colsample_by_tree``` is less than 100%, both ```colsample_by_level``` and ```subsample``` are maxed out at 1." ], "metadata": { "id": "mPeh54OAZmYX" }, "id": "mPeh54OAZmYX" }, { "cell_type": "code", "source": [ "xgbc_tuned=go.best_estimator_\n", "\n", "xgbc_tuned.fit(X_train,y_train)" ], "metadata": { "id": "lDFYa2pA44n_", "colab": { "base_uri": "https://localhost:8080/", "height": 92 }, "outputId": "6256ea64-4be5-4e1e-dc55-145a6b181377" }, "id": "lDFYa2pA44n_", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "XGBClassifier(colsample_by_tree=0.5, colsample_bylevel=1.0, eta=0.1,\n", " random_state=1, scale_pos_weight=1.0, subsample=1.0)" ], "text/html": [ "
XGBClassifier(colsample_by_tree=0.5, colsample_bylevel=1.0, eta=0.1,\n",
              "              random_state=1, scale_pos_weight=1.0, subsample=1.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 123 } ] }, { "cell_type": "code", "source": [ "scores(xgbc_tuned)" ], "metadata": { "id": "abyxG0VW1Y8e", "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "outputId": "dc6ea90f-ce4a-4982-b98d-e42f087147bb" }, "id": "abyxG0VW1Y8e", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.753762 0.750098\n", "Precision 0.780624 0.779646\n", "Recall 0.878235 0.872574\n", "F1 0.826558 0.823497" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7537620.750098
Precision0.7806240.779646
Recall0.8782350.872574
F10.8265580.823497
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 124 } ] }, { "cell_type": "markdown", "source": [ "After training the best estimator on the full training set, we find that performance is comparable—in fact, nearly identical—to the other tuned boosting models." ], "metadata": { "id": "P-FQrBmCaIEP" }, "id": "P-FQrBmCaIEP" }, { "cell_type": "code", "source": [ "tabulate(xgbc_tuned,'XGBoost (tuned)')" ], "metadata": { "id": "MusEIuCk5ArY", "colab": { "base_uri": "https://localhost:8080/", "height": 426 }, "outputId": "c46dd4dd-d6e8-4fd7-cba9-23ba0740a076" }, "id": "MusEIuCk5ArY", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General\n", "Random Forest (tuned) 0.725803 0.723772 0.820520 0.819605 General\n", "AdaBoost (tuned) 0.750561 0.749050 0.824158 0.823019 General\n", "Grad Boost (tuned) 0.754379 0.751801 0.826533 0.824293 General\n", "XGBoost (tuned) 0.753762 0.750098 0.826558 0.823497 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
Random Forest (tuned)0.7258030.7237720.8205200.819605General
AdaBoost (tuned)0.7505610.7490500.8241580.823019General
Grad Boost (tuned)0.7543790.7518010.8265330.824293General
XGBoost (tuned)0.7537620.7500980.8265580.823497General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 125 } ] }, { "cell_type": "markdown", "source": [ "#### Stacking Classifier" ], "metadata": { "id": "n2H7XaAw5LO7" }, "id": "n2H7XaAw5LO7" }, { "cell_type": "code", "source": [ "ests=[('Decision Tree',dtree_tuned),\n", " ('Bagging Classifier',bag_tuned),\n", " ('Random Forest',rf_tuned),\n", " ('AdaBoost',abc_tuned),\n", " ('Gradient Boosting',gbc_tuned),\n", " ('XGBoost',xgbc_tuned)]" ], "metadata": { "id": "I1I3qC6RcwOW" }, "id": "I1I3qC6RcwOW", "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "With all six ensemble models tuned, we will now use a stacking classifier to hopefully gain additional predictive power. " ], "metadata": { "id": "OzbwAZDMaY7W" }, "id": "OzbwAZDMaY7W" }, { "cell_type": "code", "source": [ "stack=StackingClassifier(estimators=ests,\n", " final_estimator=XGBClassifier(random_state=1),\n", " cv=5)\n", "stack.fit(X_train,y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 258 }, "id": "W-BXivhnchcf", "outputId": "7c4dd587-b183-441c-9528-111d5d7c4595" }, "id": "W-BXivhnchcf", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "StackingClassifier(cv=5,\n", " estimators=[('Decision Tree',\n", " DecisionTreeClassifier(max_depth=7,\n", " max_leaf_nodes=20,\n", " min_samples_leaf=5,\n", " random_state=1)),\n", " ('Bagging Classifier',\n", " BaggingClassifier(random_state=1)),\n", " ('Random Forest',\n", " RandomForestClassifier(max_depth=5,\n", " max_samples=0.75,\n", " min_samples_split=8,\n", " n_estimators=250,\n", " random_state=1)),\n", " ('AdaBoost',\n", " AdaBoostClassifier(...3,\n", " random_state=2),\n", " learning_rate=0.1,\n", " random_state=1)),\n", " ('Gradient Boosting',\n", " GradientBoostingClassifier(init=AdaBoostClassifier(random_state=2),\n", " max_features=0.7,\n", " random_state=1)),\n", " ('XGBoost',\n", " XGBClassifier(colsample_by_tree=0.5,\n", " colsample_bylevel=1.0, eta=0.1,\n", " random_state=1,\n", " scale_pos_weight=1.0,\n", " subsample=1.0))],\n", " final_estimator=XGBClassifier(random_state=1))" ], "text/html": [ "
StackingClassifier(cv=5,\n",
              "                   estimators=[('Decision Tree',\n",
              "                                DecisionTreeClassifier(max_depth=7,\n",
              "                                                       max_leaf_nodes=20,\n",
              "                                                       min_samples_leaf=5,\n",
              "                                                       random_state=1)),\n",
              "                               ('Bagging Classifier',\n",
              "                                BaggingClassifier(random_state=1)),\n",
              "                               ('Random Forest',\n",
              "                                RandomForestClassifier(max_depth=5,\n",
              "                                                       max_samples=0.75,\n",
              "                                                       min_samples_split=8,\n",
              "                                                       n_estimators=250,\n",
              "                                                       random_state=1)),\n",
              "                               ('AdaBoost',\n",
              "                                AdaBoostClassifier(...3,\n",
              "                                                                                         random_state=2),\n",
              "                                                   learning_rate=0.1,\n",
              "                                                   random_state=1)),\n",
              "                               ('Gradient Boosting',\n",
              "                                GradientBoostingClassifier(init=AdaBoostClassifier(random_state=2),\n",
              "                                                           max_features=0.7,\n",
              "                                                           random_state=1)),\n",
              "                               ('XGBoost',\n",
              "                                XGBClassifier(colsample_by_tree=0.5,\n",
              "                                              colsample_bylevel=1.0, eta=0.1,\n",
              "                                              random_state=1,\n",
              "                                              scale_pos_weight=1.0,\n",
              "                                              subsample=1.0))],\n",
              "                   final_estimator=XGBClassifier(random_state=1))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 128 } ] }, { "cell_type": "code", "source": [ "scores(stack)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 175 }, "id": "OOTOzgUyddm_", "outputId": "270ffaf6-a89d-4601-b73c-7b2bac5ae91d" }, "id": "OOTOzgUyddm_", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Test\n", "Accuracy 0.754772 0.749836\n", "Precision 0.783841 0.782640\n", "Recall 0.873950 0.866105\n", "F1 0.826446 0.822259" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TrainTest
Accuracy0.7547720.749836
Precision0.7838410.782640
Recall0.8739500.866105
F10.8264460.822259
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 129 } ] }, { "cell_type": "markdown", "source": [ "The F1 score is at best several thousandths higher than other well-performing models. The improvement is not as drastic as could be hoped." ], "metadata": { "id": "w0RuVkiLar3h" }, "id": "w0RuVkiLar3h" }, { "cell_type": "code", "source": [ "tabulate(stack,'Stacking')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 457 }, "id": "eQ2G9SUUddkS", "outputId": "ca813758-6727-4b95-c160-295b26e4542b" }, "id": "eQ2G9SUUddkS", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General\n", "Random Forest (tuned) 0.725803 0.723772 0.820520 0.819605 General\n", "AdaBoost (tuned) 0.750561 0.749050 0.824158 0.823019 General\n", "Grad Boost (tuned) 0.754379 0.751801 0.826533 0.824293 General\n", "XGBoost (tuned) 0.753762 0.750098 0.826558 0.823497 General\n", "Stacking 0.754772 0.749836 0.826446 0.822259 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
Random Forest (tuned)0.7258030.7237720.8205200.819605General
AdaBoost (tuned)0.7505610.7490500.8241580.823019General
Grad Boost (tuned)0.7543790.7518010.8265330.824293General
XGBoost (tuned)0.7537620.7500980.8265580.823497General
Stacking0.7547720.7498360.8264460.822259General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 130 } ] }, { "cell_type": "markdown", "id": "obvious-maine", "metadata": { "id": "obvious-maine" }, "source": [ "## Model Performance Comparison and Conclusions" ] }, { "cell_type": "markdown", "source": [ "### Best Model" ], "metadata": { "id": "in-ISvTOv8BX" }, "id": "in-ISvTOv8BX" }, { "cell_type": "markdown", "source": [ "We will examine the best model and compare it with some of the other high performing models." ], "metadata": { "id": "QeYSt8eSa5Fb" }, "id": "QeYSt8eSa5Fb" }, { "cell_type": "code", "execution_count": null, "id": "everyday-kinase", "metadata": { "id": "everyday-kinase", "colab": { "base_uri": "https://localhost:8080/", "height": 457 }, "outputId": "823aef45-accc-4519-cdb2-796577ec3cdd" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Train Acc Test Acc Train F1 Test F1 Status\n", "dTree (baseline) 1.000000 0.649902 1.000000 0.736208 Overfit\n", "Bagging Classifier 0.984168 0.698232 0.988121 0.775614 Overfit\n", "Random Forest 1.000000 0.725082 1.000000 0.802373 Overfit\n", "AdaBoost 0.734786 0.733857 0.817536 0.817036 General\n", "Gradient Boosting 0.754997 0.749312 0.827155 0.822581 General\n", "XGBoost 0.753762 0.750098 0.826558 0.823497 General\n", "dTree (tuned) 0.749382 0.750098 0.823571 0.823497 General\n", "Bagging clfr (tuned) 0.738098 0.735298 0.824062 0.821797 General\n", "Random Forest (tuned) 0.725803 0.723772 0.820520 0.819605 General\n", "AdaBoost (tuned) 0.750561 0.749050 0.824158 0.823019 General\n", "Grad Boost (tuned) 0.754379 0.751801 0.826533 0.824293 General\n", "XGBoost (tuned) 0.753762 0.750098 0.826558 0.823497 General\n", "Stacking 0.754772 0.749836 0.826446 0.822259 General" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Train AccTest AccTrain F1Test F1Status
dTree (baseline)1.0000000.6499021.0000000.736208Overfit
Bagging Classifier0.9841680.6982320.9881210.775614Overfit
Random Forest1.0000000.7250821.0000000.802373Overfit
AdaBoost0.7347860.7338570.8175360.817036General
Gradient Boosting0.7549970.7493120.8271550.822581General
XGBoost0.7537620.7500980.8265580.823497General
dTree (tuned)0.7493820.7500980.8235710.823497General
Bagging clfr (tuned)0.7380980.7352980.8240620.821797General
Random Forest (tuned)0.7258030.7237720.8205200.819605General
AdaBoost (tuned)0.7505610.7490500.8241580.823019General
Grad Boost (tuned)0.7543790.7518010.8265330.824293General
XGBoost (tuned)0.7537620.7500980.8265580.823497General
Stacking0.7547720.7498360.8264460.822259General
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 154 } ], "source": [ "mct=model_comp_table\n", "mct" ] }, { "cell_type": "markdown", "source": [ "We will define the 'best model' as that which has the highest F1 score on testing data." ], "metadata": { "id": "B7rf3R3vrXLz" }, "id": "B7rf3R3vrXLz" }, { "cell_type": "code", "source": [ "mct.loc[mct['Status']=='General']['Test F1'].idxmax()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "GEb9M8JcrBmi", "outputId": "d83c22c4-8c3b-4bf7-85ee-3c309abb70bb" }, "id": "GEb9M8JcrBmi", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'Grad Boost (tuned)'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 155 } ] }, { "cell_type": "markdown", "source": [ "The best generalized model is the tuned gradient boosting classifier." ], "metadata": { "id": "SNVNHrqCrRqC" }, "id": "SNVNHrqCrRqC" }, { "cell_type": "code", "source": [ "mct.loc[mct['Status']=='General']['Test Acc'].idxmax()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "hhuDYfT7r9DH", "outputId": "0a08879b-3c71-4395-d32c-6c1af7433917" }, "id": "hhuDYfT7r9DH", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'Grad Boost (tuned)'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 156 } ] }, { "cell_type": "code", "source": [ "mct.loc['Grad Boost (tuned)']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BB1Swd69rzd8", "outputId": "18d2114e-5a17-4ce3-86fe-654b258a511d" }, "id": "BB1Swd69rzd8", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Train Acc 0.754379\n", "Test Acc 0.751801\n", "Train F1 0.826533\n", "Test F1 0.824293\n", "Status General\n", "Name: Grad Boost (tuned), dtype: object" ] }, "metadata": {}, "execution_count": 157 } ] }, { "cell_type": "markdown", "source": [ "Incidentally, the tuned gradient boosting classifier is also the most accurate on test data, with over 75% accuracy. The F1 score of this model is a bit over 82%." ], "metadata": { "id": "HXull7Zmr2YW" }, "id": "HXull7Zmr2YW" }, { "cell_type": "code", "source": [ "confusion_heatmap(gbc_tuned)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 365 }, "id": "kiFYuB5Arhkk", "outputId": "b162dc8e-38a5-4a00-b9ae-4563e8edfd41" }, "id": "kiFYuB5Arhkk", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "The tuned gradient boosting classifier allows few false negatives, but comparatively more false positives. This explains the higher recall score, as compared to precision.\n", "\n", "The percentage of true negatives is nearly identical to the percentage of false positives, around 16%. This means our model does a bad job at distinguishing Denied cases, as it's prediction is no better than random guessing. Put another way, there is only about a 50% chance that our model will correctly identify when a case is Denied." ], "metadata": { "id": "BLkr2LdqrsUe" }, "id": "BLkr2LdqrsUe" }, { "cell_type": "code", "source": [ "# z-test\n", "num_obs=1239+1295\n", "t,p_val=proportions_ztest([1239,1295],[num_obs,num_obs])\n", "print('The p-value is',p_val)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "eN7AScs70aA_", "outputId": "4d0be5f6-c4df-4100-a07f-c65ca259e575" }, "id": "eN7AScs70aA_", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The p-value is 0.11565928243244895\n" ] } ] }, { "cell_type": "markdown", "source": [ "We can run a two proportions z-test to check our model's ability to predict Denied cases. With a p-value greater than 0.05, our model is only as good as random guessing." ], "metadata": { "id": "Xc1wMwrg0g-D" }, "id": "Xc1wMwrg0g-D" }, { "cell_type": "code", "source": [ "# feature importance\n", "ser=pd.Series(gbc_tuned.feature_importances_,index=X_train.columns)\n", "ser=ser.sort_values(ascending=False)\n", "print('Feature importance')\n", "print('-'*10)\n", "ser" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CijVFR1EtzgG", "outputId": "0cabc501-fbb4-4c06-f250-d16f4fa65a77" }, "id": "CijVFR1EtzgG", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Feature importance\n", "----------\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "education_of_employee 0.443459\n", "has_job_experience 0.150752\n", "unit_of_wage_Hour 0.100343\n", "continent_Europe 0.061272\n", "prevailing_wage 0.054921\n", "unit_of_wage_Year 0.030446\n", "region_of_employment_Midwest 0.029068\n", "continent_North America 0.022887\n", "region_of_employment_South 0.021740\n", "no_of_employees 0.018451\n", "region_of_employment_West 0.017778\n", "years_in_business 0.012811\n", "full_time_position 0.008663\n", "requires_job_training 0.008135\n", "continent_South America 0.006894\n", "region_of_employment_Northeast 0.005634\n", "continent_Asia 0.003787\n", "region_of_employment_Island 0.000889\n", "unit_of_wage_Week 0.000589\n", "continent_Oceania 0.000537\n", "unit_of_wage_Month 0.000521\n", "continent_Africa 0.000423\n", "dtype: float64" ] }, "metadata": {}, "execution_count": 171 } ] }, { "cell_type": "markdown", "source": [ "* Education of the applicant is most important, i.e., has the greatest predictive impact on ```case_status```.\n", "* Job experience is the next most influential feature.\n", "* Continent of origin has less importance, as do the years in business and full time status of the role." ], "metadata": { "id": "6aBP-XfjuNO_" }, "id": "6aBP-XfjuNO_" }, { "cell_type": "code", "source": [ "plott()\n", "plt.title('Feature Importances in Tuned Gradient Boosting Classifier',fontsize=14)\n", "sns.barplot(x=ser,y=ser.index)\n", "plt.xlabel('Importance');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "DJk2tTTVu02Z", "outputId": "12e0f203-da03-4d9a-ff26-34bd3aa3ad3a" }, "id": "DJk2tTTVu02Z", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Relative importance can be gleaned with the plot above. Education is massively more important than anything else." ], "metadata": { "id": "9_Mmh35qvGhk" }, "id": "9_Mmh35qvGhk" }, { "cell_type": "markdown", "source": [ "### Other Good Models" ], "metadata": { "id": "UZkFTwn7wC1w" }, "id": "UZkFTwn7wC1w" }, { "cell_type": "markdown", "source": [ "#### Tuned Decision Tree" ], "metadata": { "id": "KvhRCNQAx8wM" }, "id": "KvhRCNQAx8wM" }, { "cell_type": "code", "source": [ "mct.loc['dTree (tuned)']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2oRsYgmIwFon", "outputId": "9adf54dd-b6d6-4bd9-9fc2-1c9d49ab0106" }, "id": "2oRsYgmIwFon", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Train Acc 0.749382\n", "Test Acc 0.750098\n", "Train F1 0.823571\n", "Test F1 0.823497\n", "Status General\n", "Name: dTree (tuned), dtype: object" ] }, "metadata": {}, "execution_count": 173 } ] }, { "cell_type": "markdown", "source": [ "The tuned decision tree also performed well on test data, with around 75% accuracy and 82% F1 score. This seems like the upper limit for performance." ], "metadata": { "id": "JaT9XRmgwdPN" }, "id": "JaT9XRmgwdPN" }, { "cell_type": "code", "source": [ "confusion_heatmap(dtree_tuned)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 365 }, "id": "eIusXiLjw-bN", "outputId": "cb52664d-dd7d-4b0c-aaff-36f980641db2" }, "id": "eIusXiLjw-bN", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Our confusion matrix for the tuned decision tree looks quite similar to that for the tuned gradient boosting classifier. It seems there is a trend in our models being unable to predict Denied applications." ], "metadata": { "id": "NghPUQCGxDGE" }, "id": "NghPUQCGxDGE" }, { "cell_type": "code", "source": [ "# feature importance\n", "ser=pd.Series(gbc_tuned.feature_importances_,index=X_train.columns)\n", "ser=ser.sort_values(ascending=False)\n", "\n", "plott()\n", "plt.title('Feature Importances in Tuned Decision Tree',fontsize=14)\n", "sns.barplot(x=ser,y=ser.index)\n", "plt.xlabel('Importance');" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 356 }, "id": "09icVx6hxVBA", "outputId": "ae1f5640-71ad-4650-dbba-047cb690dff1" }, "id": "09icVx6hxVBA", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Once more, education and job experience are the most important features in predicting ```case_status```." ], "metadata": { "id": "0skR8ezqxlEv" }, "id": "0skR8ezqxlEv" }, { "cell_type": "markdown", "source": [ "#### Stacking Classifier" ], "metadata": { "id": "3msMRJmxx_O2" }, "id": "3msMRJmxx_O2" }, { "cell_type": "code", "source": [ "mct.loc['Stacking']" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Kb9XRo0LyBX-", "outputId": "05742599-980d-40b4-e7dd-75df92fb9c98" }, "id": "Kb9XRo0LyBX-", "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Train Acc 0.754772\n", "Test Acc 0.749836\n", "Train F1 0.826446\n", "Test F1 0.822259\n", "Status General\n", "Name: Stacking, dtype: object" ] }, "metadata": {}, "execution_count": 176 } ] }, { "cell_type": "markdown", "source": [ "Despite stacking multiple models together, nothing was gained by implementing the stacking classifier. Accuracy is still around 75% and F1 is at 82%.\n", "\n", "It is possible that most of the models misclassified the same records, i.e., had the same false positives and false negatives. If this is the case, then stacking them together would not add any predictive power.\n", "\n", "We can get an idea from the confusion matrix." ], "metadata": { "id": "bsJYDyfayJiB" }, "id": "bsJYDyfayJiB" }, { "cell_type": "code", "source": [ "confusion_heatmap(stack)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 365 }, "id": "7u90xFquyxLm", "outputId": "5f127702-030c-4ab1-879a-355a33336379" }, "id": "7u90xFquyxLm", "execution_count": null, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": [ "Here, it appears the stacking classifier is slightly better than earlier models at predicting Denied cases. We can check whether its performance is better than random guessing with a two proportions z-test (significance=0.05)." ], "metadata": { "id": "j7Tuq5fHzWF5" }, "id": "j7Tuq5fHzWF5" }, { "cell_type": "code", "source": [ "# z-test\n", "num_obs=1227+1307\n", "t,p_val=proportions_ztest([1227,1307],[num_obs,num_obs])\n", "print('The p-value is',p_val)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QiFTWQXDy7Ur", "outputId": "ab372386-f243-4e2b-b757-8291a3ad2e79" }, "id": "QiFTWQXDy7Ur", "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "The p-value is 0.024607436753421648\n" ] } ] }, { "cell_type": "markdown", "source": [ "With a p-value less than 0.05, we find that there is in fact a significant difference between the number of true negatives and false negatives. Thus the stacking classifier is better at predicting Denied cases than randomly guessing." ], "metadata": { "id": "0FIZ43OVzxZ5" }, "id": "0FIZ43OVzxZ5" }, { "cell_type": "markdown", "id": "nasty-retailer", "metadata": { "id": "nasty-retailer" }, "source": [ "## Actionable Insights and Recommendations" ] }, { "cell_type": "markdown", "source": [ "* Our model can correctly predict 75% of case statuses. While this is fairly good—certainly better than guessing—there's much room for improvement.\n", "* Our chosen scoring metric was F1, as it provides a good balance of recall and precision. Models maxed out at 82% for the F1 score. Meta-ensemble methods like the stacking classifier could not further improve this score.\n", "* Level of education has the greatest influence on case status prediction. One way for OFLC to manage the number of applications is to process an initial short form application to immediately rule out candidates. If this approach were used, level of education would surely be one of the most helpful questions to ask.\n", "* Previous job experience also has a big impact on case status. This too would be good to check on an initial automated screening. If the candidate passes the screening, they would be permitted to submit a complete application to be reviewed by a case reviewer at OFLC.\n", "* As was seen in the EDA, a higher level of education leads to a greater chance of being Certified. The ideal candidate has a Doctorate and previous job experience, is from Europe, and is seeking part-time employment in the Midwest.\n", "* Using machine learning models like those developed here, OFLC could in fact automate the entire screening! We could, for example, use a stacking classifier with a number of the estimators found above and logistic regression as the final estimator. Then we define two cutoffs: high probability of Certification and high probability of Denied. We can then assign statuses to the most obvious cases, leaving the more complicated cases on the edge for the reviewers at OFLC to consider. In effect, we will cut down on the number of applications that need to be reviewed by a human, while still allowing growth to the US job market." ], "metadata": { "id": "ZXxLTxqT1GfD" }, "id": "ZXxLTxqT1GfD" } ], "metadata": { "colab": { "provenance": [], "collapsed_sections": [ "difficult-union", "obvious-maine", "nasty-retailer" ], "include_colab_link": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "gpuClass": "premium" }, "nbformat": 4, "nbformat_minor": 5 }