{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 13\n", "\n", "This particular Automobile Data Set includes a good mix of categorical values as well as continuous values and serves as a useful example that is relatively easy to understand. Since domain understanding is an important aspect when deciding how to encode various categorical values - this data set makes a good case study." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data into Pandas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
symbolingnormalized_lossesmakefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationwheel_base...engine_sizefuel_systemborestrokecompression_ratiohorsepowerpeak_rpmcity_mpghighway_mpgprice
03NaNalfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.0111.05000.0212713495.0
13NaNalfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.0111.05000.0212716500.0
21NaNalfa-romerogasstdtwohatchbackrwdfront94.5...152mpfi2.683.479.0154.05000.0192616500.0
32164.0audigasstdfoursedanfwdfront99.8...109mpfi3.193.4010.0102.05500.0243013950.0
42164.0audigasstdfoursedan4wdfront99.4...136mpfi3.193.408.0115.05500.0182217450.0
\n", "

5 rows × 26 columns

\n", "
" ], "text/plain": [ " symboling normalized_losses make fuel_type aspiration num_doors \\\n", "0 3 NaN alfa-romero gas std two \n", "1 3 NaN alfa-romero gas std two \n", "2 1 NaN alfa-romero gas std two \n", "3 2 164.0 audi gas std four \n", "4 2 164.0 audi gas std four \n", "\n", " body_style drive_wheels engine_location wheel_base ... engine_size \\\n", "0 convertible rwd front 88.6 ... 130 \n", "1 convertible rwd front 88.6 ... 130 \n", "2 hatchback rwd front 94.5 ... 152 \n", "3 sedan fwd front 99.8 ... 109 \n", "4 sedan 4wd front 99.4 ... 136 \n", "\n", " fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg \\\n", "0 mpfi 3.47 2.68 9.0 111.0 5000.0 21 \n", "1 mpfi 3.47 2.68 9.0 111.0 5000.0 21 \n", "2 mpfi 2.68 3.47 9.0 154.0 5000.0 19 \n", "3 mpfi 3.19 3.40 10.0 102.0 5500.0 24 \n", "4 mpfi 3.19 3.40 8.0 115.0 5500.0 18 \n", "\n", " highway_mpg price \n", "0 27 13495.0 \n", "1 27 16500.0 \n", "2 26 16500.0 \n", "3 30 13950.0 \n", "4 22 17450.0 \n", "\n", "[5 rows x 26 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# Define the headers since the data does not have any\n", "headers = [\"symboling\", \"normalized_losses\", \"make\", \"fuel_type\", \"aspiration\",\n", " \"num_doors\", \"body_style\", \"drive_wheels\", \"engine_location\",\n", " \"wheel_base\", \"length\", \"width\", \"height\", \"curb_weight\",\n", " \"engine_type\", \"num_cylinders\", \"engine_size\", \"fuel_system\",\n", " \"bore\", \"stroke\", \"compression_ratio\", \"horsepower\", \"peak_rpm\",\n", " \"city_mpg\", \"highway_mpg\", \"price\"]\n", "\n", "# Read in the CSV file and convert \"?\" to NaN\n", "df = pd.read_csv(\"http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data\",\n", " header=None, names=headers, na_values=\"?\" )\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(205, 26)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "symboling int64\n", "normalized_losses float64\n", "make object\n", "fuel_type object\n", "aspiration object\n", "num_doors object\n", "body_style object\n", "drive_wheels object\n", "engine_location object\n", "wheel_base float64\n", "length float64\n", "width float64\n", "height float64\n", "curb_weight int64\n", "engine_type object\n", "num_cylinders object\n", "engine_size int64\n", "fuel_system object\n", "bore float64\n", "stroke float64\n", "compression_ratio float64\n", "horsepower float64\n", "peak_rpm float64\n", "city_mpg int64\n", "highway_mpg int64\n", "price float64\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system
0alfa-romerogasstdtwoconvertiblerwdfrontdohcfourmpfi
1alfa-romerogasstdtwoconvertiblerwdfrontdohcfourmpfi
2alfa-romerogasstdtwohatchbackrwdfrontohcvsixmpfi
3audigasstdfoursedanfwdfrontohcfourmpfi
4audigasstdfoursedan4wdfrontohcfivempfi
\n", "
" ], "text/plain": [ " make fuel_type aspiration num_doors body_style drive_wheels \\\n", "0 alfa-romero gas std two convertible rwd \n", "1 alfa-romero gas std two convertible rwd \n", "2 alfa-romero gas std two hatchback rwd \n", "3 audi gas std four sedan fwd \n", "4 audi gas std four sedan 4wd \n", "\n", " engine_location engine_type num_cylinders fuel_system \n", "0 front dohc four mpfi \n", "1 front dohc four mpfi \n", "2 front ohcv six mpfi \n", "3 front ohc four mpfi \n", "4 front ohc five mpfi " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obj_df = df.select_dtypes(include=['object']).copy()\n", "obj_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 13.1\n", "\n", "Does the database contain missing values? If so, replace them using one of the methods explained in class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 13.2\n", "\n", "Split the data into training and testing sets\n", "\n", "Train a Random Forest Regressor to predict the price of a car using the numeric features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 13.3\n", "\n", "Create dummy variables for the categorical features\n", "\n", "Train a Random Forest Regressor and compare" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 13.4\n", "\n", "Apply two other methods of categorical encoding\n", "\n", "compare the results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 1 }