{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 02\n", "\n", "Estimate a regression using the Income data\n", "\n", "\n", "## Forecast of income\n", "\n", "We'll be working with a dataset from US Census indome ([data dictionary](https://archive.ics.uci.edu/ml/datasets/Adult)).\n", "\n", "Many businesses would like to personalize their offer based on customer’s income. High-income customers could be, for instance, exposed to premium products. As a customer’s income is not always explicitly known, predictive model could estimate income of a person based on other information.\n", "\n", "Our goal is to create a predictive model that will be able to output an estimation of a person income." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeWorkclassfnlwgtEducationEducation-NumMartial StatusOccupationRelationshipRaceSexCapital GainCapital LossHours per weekCountryIncome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States51806.0
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States68719.0
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States51255.0
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States47398.0
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba30493.0
\n", "
" ], "text/plain": [ " Age Workclass fnlwgt Education Education-Num \\\n", "0 39 State-gov 77516 Bachelors 13 \n", "1 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 38 Private 215646 HS-grad 9 \n", "3 53 Private 234721 11th 7 \n", "4 28 Private 338409 Bachelors 13 \n", "\n", " Martial Status Occupation Relationship Race Sex \\\n", "0 Never-married Adm-clerical Not-in-family White Male \n", "1 Married-civ-spouse Exec-managerial Husband White Male \n", "2 Divorced Handlers-cleaners Not-in-family White Male \n", "3 Married-civ-spouse Handlers-cleaners Husband Black Male \n", "4 Married-civ-spouse Prof-specialty Wife Black Female \n", "\n", " Capital Gain Capital Loss Hours per week Country Income \n", "0 2174 0 40 United-States 51806.0 \n", "1 0 0 13 United-States 68719.0 \n", "2 0 0 40 United-States 51255.0 \n", "3 0 0 40 United-States 47398.0 \n", "4 0 0 40 Cuba 30493.0 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "\n", "# read the data and set the datetime as the index\n", "import zipfile\n", "with zipfile.ZipFile('../datasets/income.csv.zip', 'r') as z:\n", " f = z.open('income.csv')\n", " income = pd.read_csv(f, index_col=0)\n", "\n", "income.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32561, 15)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2.1 \n", "\n", "What is the relation between the age and Income?\n", "\n", "For a one percent increase in the Age how much the income increases?\n", "\n", "Using sklearn estimate a linear regression and predict the income when the Age is 30 and 40 years" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "income.plot(x='Age', y='Income', kind='scatter')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2.2\n", "Evaluate the model using the MSE" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Exercise 2.3\n", "\n", "Run a regression model using as features the Age and Age$^2$ using the OLS equations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2.4\n", "\n", "\n", "Estimate a regression using more features.\n", "\n", "How is the performance compared to using only the Age?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 2.5\n", "\n", "\n", "Estimate a logistic regression to predict if a person is in the United States.\n", "\n", "What is the performance of the model" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0 29170\n", "0.0 3391\n", "Name: isUS, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "income['isUS'] = (income['Country'] == 'United-States')*1.0\n", "income['isUS'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }