{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Forest Cover Type Prediction\n", "\n", "My first quick attempt at the [Forest Cover Type Prediction Kaggle Competition](https://www.kaggle.com/c/forest-cover-type-prediction), one of the [recommended starter projects](https://www.quora.com/What-Kaggle-competitions-should-a-beginner-start-with-1) I'm working through as part of my [ML Study Curriculum](http://karlrosaen.com/ml). The goal is to preprocess the data and explore the performance of a few algorithms and make some initial submissions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading and preprocessing \n", "\n", "Let's load the labeled test dataset into a Pandas Dataframe and take a gander. The [dataset section](https://www.kaggle.com/c/forest-cover-type-prediction/data) of the competition also summarizes the variables in the dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "labeled_df = pd.read_csv('train.csv')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | Id | \n", "Elevation | \n", "Aspect | \n", "Slope | \n", "Horizontal_Distance_To_Hydrology | \n", "Vertical_Distance_To_Hydrology | \n", "Horizontal_Distance_To_Roadways | \n", "Hillshade_9am | \n", "Hillshade_Noon | \n", "Hillshade_3pm | \n", "... | \n", "Soil_Type32 | \n", "Soil_Type33 | \n", "Soil_Type34 | \n", "Soil_Type35 | \n", "Soil_Type36 | \n", "Soil_Type37 | \n", "Soil_Type38 | \n", "Soil_Type39 | \n", "Soil_Type40 | \n", "Cover_Type | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "2596 | \n", "51 | \n", "3 | \n", "258 | \n", "0 | \n", "510 | \n", "221 | \n", "232 | \n", "148 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "
1 | \n", "2 | \n", "2590 | \n", "56 | \n", "2 | \n", "212 | \n", "-6 | \n", "390 | \n", "220 | \n", "235 | \n", "151 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "
2 | \n", "3 | \n", "2804 | \n", "139 | \n", "9 | \n", "268 | \n", "65 | \n", "3180 | \n", "234 | \n", "238 | \n", "135 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "
3 | \n", "4 | \n", "2785 | \n", "155 | \n", "18 | \n", "242 | \n", "118 | \n", "3090 | \n", "238 | \n", "238 | \n", "122 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "
4 | \n", "5 | \n", "2595 | \n", "45 | \n", "2 | \n", "153 | \n", "-1 | \n", "391 | \n", "220 | \n", "234 | \n", "150 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "
5 rows × 56 columns
\n", "\n", " | Id | \n", "Elevation | \n", "Aspect | \n", "Slope | \n", "Horizontal_Distance_To_Hydrology | \n", "Vertical_Distance_To_Hydrology | \n", "Horizontal_Distance_To_Roadways | \n", "Hillshade_9am | \n", "Hillshade_Noon | \n", "Hillshade_3pm | \n", "... | \n", "Soil_Type32 | \n", "Soil_Type33 | \n", "Soil_Type34 | \n", "Soil_Type35 | \n", "Soil_Type36 | \n", "Soil_Type37 | \n", "Soil_Type38 | \n", "Soil_Type39 | \n", "Soil_Type40 | \n", "Cover_Type | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "-0.367095 | \n", "-0.959980 | \n", "-1.597132 | \n", "0.146639 | \n", "-0.834074 | \n", "-0.908681 | \n", "0.271454 | \n", "0.571653 | \n", "0.281259 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "5 | \n", "
1 | \n", "2 | \n", "-0.381461 | \n", "-0.914559 | \n", "-1.715424 | \n", "-0.072337 | \n", "-0.932054 | \n", "-0.999246 | \n", "0.238732 | \n", "0.703225 | \n", "0.346627 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "5 | \n", "
2 | \n", "3 | \n", "0.130912 | \n", "-0.160577 | \n", "-0.887379 | \n", "0.194243 | \n", "0.227369 | \n", "1.106379 | \n", "0.696843 | \n", "0.834797 | \n", "-0.002005 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "2 | \n", "
3 | \n", "4 | \n", "0.085421 | \n", "-0.015231 | \n", "0.177250 | \n", "0.070474 | \n", "1.092853 | \n", "1.038455 | \n", "0.827731 | \n", "0.834797 | \n", "-0.285268 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "2 | \n", "
4 | \n", "5 | \n", "-0.369489 | \n", "-1.014485 | \n", "-1.715424 | \n", "-0.353198 | \n", "-0.850404 | \n", "-0.998491 | \n", "0.238732 | \n", "0.659368 | \n", "0.324838 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "5 | \n", "
5 rows × 56 columns
\n", "\n", " | Id | \n", "Elevation | \n", "Aspect | \n", "Slope | \n", "Horizontal_Distance_To_Hydrology | \n", "Vertical_Distance_To_Hydrology | \n", "Horizontal_Distance_To_Roadways | \n", "Hillshade_9am | \n", "Hillshade_Noon | \n", "Hillshade_3pm | \n", "... | \n", "Soil_Type32 | \n", "Soil_Type33 | \n", "Soil_Type34 | \n", "Soil_Type35 | \n", "Soil_Type36 | \n", "Soil_Type37 | \n", "Soil_Type38 | \n", "Soil_Type39 | \n", "Soil_Type40 | \n", "Cover_Type | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2902 | \n", "2903 | \n", "0.443033 | \n", "1.089822 | \n", "-0.771255 | \n", "-1.080497 | \n", "-0.837000 | \n", "0.090223 | \n", "-0.630140 | \n", "1.009898 | \n", "1.145001 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "5 | \n", "
10499 | \n", "10500 | \n", "1.792946 | \n", "0.989782 | \n", "-0.533998 | \n", "0.584584 | \n", "-0.033990 | \n", "-0.209066 | \n", "-0.762352 | \n", "1.141218 | \n", "1.298067 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "7 | \n", "
14973 | \n", "14974 | \n", "-0.616240 | \n", "-0.501711 | \n", "-0.296740 | \n", "-0.938182 | \n", "-0.837000 | \n", "-0.485624 | \n", "0.989457 | \n", "0.046887 | \n", "-0.691794 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "3 | \n", "
14364 | \n", "14365 | \n", "1.807358 | \n", "-0.056082 | \n", "-0.415369 | \n", "0.484964 | \n", "0.719856 | \n", "1.676076 | \n", "0.758086 | \n", "0.878579 | \n", "-0.079529 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "7 | \n", "
1421 | \n", "1422 | \n", "0.784114 | \n", "-1.111040 | \n", "-0.771255 | \n", "-0.444825 | \n", "-0.935328 | \n", "0.387996 | \n", "0.196185 | \n", "-0.040659 | \n", "-0.013929 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "1 | \n", "
5 rows × 56 columns
\n", "\n", " | Id | \n", "Elevation | \n", "Aspect | \n", "Slope | \n", "Horizontal_Distance_To_Hydrology | \n", "Vertical_Distance_To_Hydrology | \n", "Horizontal_Distance_To_Roadways | \n", "Hillshade_9am | \n", "Hillshade_Noon | \n", "Hillshade_3pm | \n", "... | \n", "Soil_Type31 | \n", "Soil_Type32 | \n", "Soil_Type33 | \n", "Soil_Type34 | \n", "Soil_Type35 | \n", "Soil_Type36 | \n", "Soil_Type37 | \n", "Soil_Type38 | \n", "Soil_Type39 | \n", "Soil_Type40 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "15121 | \n", "-0.165977 | \n", "1.792510 | \n", "-0.295918 | \n", "-1.081532 | \n", "-0.834074 | \n", "0.732045 | \n", "-0.546602 | \n", "-0.217778 | \n", "0.455575 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
1 | \n", "15122 | \n", "-0.158794 | \n", "-1.423270 | \n", "-0.414210 | \n", "-1.081532 | \n", "-0.834074 | \n", "0.709404 | \n", "-0.382991 | \n", "-0.130064 | \n", "0.368417 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
2 | \n", "15123 | \n", "-0.086966 | \n", "-1.277924 | \n", "-0.177626 | \n", "-1.081532 | \n", "-0.834074 | \n", "0.955438 | \n", "-0.219380 | \n", "-0.480922 | \n", "0.041574 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
3 | \n", "15124 | \n", "-0.096543 | \n", "-1.205251 | \n", "0.058958 | \n", "-1.081532 | \n", "-0.834074 | \n", "0.932797 | \n", "-0.153935 | \n", "-0.787923 | \n", "-0.219900 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
4 | \n", "15125 | \n", "-0.103726 | \n", "-1.159831 | \n", "0.295543 | \n", "-1.081532 | \n", "-0.834074 | \n", "0.910156 | \n", "-0.088491 | \n", "-1.051067 | \n", "-0.437795 | \n", "... | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "-1 | \n", "
5 rows × 55 columns
\n", "alg | \n", "70/30 training fit | \n", "70/30 test accuracy | \n", "training time | \n", "prediction time | \n", "full training fit | \n", "full test accuracy (kaggle submission) | \n", "
---|---|---|---|---|---|---|
Logistic Regression | \n", "0.68 | \n", "0.67 | \n", "3.07 | \n", "0.01 | \n", "0.60 | \n", "0.55999 | \n", "
Decision Tree Depth 6 | \n", "0.70 | \n", "0.68 | \n", "0.06 | \n", "0.00 | \n", "0.69 | \n", "0.57956 | \n", "
Random Forest Depth 10 | \n", "0.99 | \n", "0.82 | \n", "0.23 | \n", "0.04 | \n", "0.99 | \n", "0.71758 | \n", "
Kernel SVM | \n", "0.91 | \n", "0.82 | \n", "3.44 | \n", "6.2 | \n", "0.90 | \n", "0.72143 | \n", "
Kernel SVM on PCA reduced data | \n", "0.90 | \n", "0.82 | \n", "2.27 | \n", "3.26 | \n", "\n", " | \n", " |