{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Forest Cover Type Prediction: take 2\n", "\n", "In my first exploration / submission to the [Forest Cover Type Prediction Kaggle Competition](https://www.kaggle.com/c/forest-cover-type-prediction), one of the [recommended starter projects](https://www.quora.com/What-Kaggle-competitions-should-a-beginner-start-with-1) I'm working through as part of my [ML Study Curriculum](http://karlrosaen.com/ml), I made some progress:\n", "- developed a preprocessing function to scale each feature\n", "- observed that non-linear models performed significantly better than linear models\n", "- observed that applying tree based methods to scaled data performs the same as on unscaled (so working with the scaled data all the time is fine)\n", "- saw that the first 25 (out of 55) components from PCA are sufficient to preserve 95% of the variance of the data, and it is worth training / predicting based on the reduced dataset with the SVM models, as they are quite slow\n", "- got up to 72% accuracy on submission to kaggle using kernel SVM. A random forest performed almost as well and was *much* faster to train and predict\n", "\n", "In this notebook I'm going to see if I can improve performance by tuning the parameters of the two best performing models.\n", "\n", "## Loading and preprocessing the data\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Id | \n", "Elevation | \n", "Aspect | \n", "Slope | \n", "Horizontal_Distance_To_Hydrology | \n", "Vertical_Distance_To_Hydrology | \n", "Horizontal_Distance_To_Roadways | \n", "Hillshade_9am | \n", "Hillshade_Noon | \n", "Hillshade_3pm | \n", "... | \n", "Soil_Type32 | \n", "Soil_Type33 | \n", "Soil_Type34 | \n", "Soil_Type35 | \n", "Soil_Type36 | \n", "Soil_Type37 | \n", "Soil_Type38 | \n", "Soil_Type39 | \n", "Soil_Type40 | \n", "Cover_Type | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "2596 | \n", "51 | \n", "3 | \n", "258 | \n", "0 | \n", "510 | \n", "221 | \n", "232 | \n", "148 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "
| 1 | \n", "2 | \n", "2590 | \n", "56 | \n", "2 | \n", "212 | \n", "-6 | \n", "390 | \n", "220 | \n", "235 | \n", "151 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "
| 2 | \n", "3 | \n", "2804 | \n", "139 | \n", "9 | \n", "268 | \n", "65 | \n", "3180 | \n", "234 | \n", "238 | \n", "135 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "
| 3 | \n", "4 | \n", "2785 | \n", "155 | \n", "18 | \n", "242 | \n", "118 | \n", "3090 | \n", "238 | \n", "238 | \n", "122 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "2 | \n", "
| 4 | \n", "5 | \n", "2595 | \n", "45 | \n", "2 | \n", "153 | \n", "-1 | \n", "391 | \n", "220 | \n", "234 | \n", "150 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "5 | \n", "
5 rows × 56 columns
\n", "| Model | \n", "Untuned Test Accuracy | \n", "Tuned Test Accuracy\n", " | Untuned Kaggle Score | \n", "Kaggle Score | \n", "
|---|---|---|---|---|
| Logistic Regression | \n", "0.658 | \n", "0.658 | \n", "0.56 | \n", "\n", " |
| Kernel SVM | \n", "0.815 | \n", "0.825 | \n", "0.72143 | \n", "0.73570 | \n", "
| Random Forest | \n", "0.82 | \n", "0.85 | \n", "0.71758 | \n", "0.74463 | \n", "