{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro to Random Forests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## About this course"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Teaching approach"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This course is being taught by Jeremy Howard, and was developed by Jeremy along with Rachel Thomas. Rachel has been dealing with a life-threatening illness so will not be teaching as originally planned this year.\n",
    "\n",
    "Jeremy has worked in a number of different areas - feel free to ask about anything that he might be able to help you with at any time, even if not directly related to the current topic:\n",
    "\n",
    "- Management consultant (McKinsey; AT Kearney)\n",
    "- Self-funded startup entrepreneur (Fastmail: first consumer synchronized email; Optimal Decisions: first optimized insurance pricing)\n",
    "- VC-funded startup entrepreneur: (Kaggle; Enlitic: first deep-learning medical company)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'll be using a *top-down* teaching method, which is different from how most math courses operate.  Typically, in a *bottom-up* approach, you first learn all the separate components you will be using, and then you gradually build them up into more complex structures.  The problems with this are that students often lose motivation, don't have a sense of the \"big picture\", and don't know what they'll need.\n",
    "\n",
    "If you took the fast.ai deep learning course, that is what we used.  You can hear more about my teaching philosophy [in this blog post](http://www.fast.ai/2016/10/08/teaching-philosophy/) or [in this talk](https://vimeo.com/214233053).\n",
    "\n",
    "Harvard Professor David Perkins has a book, [Making Learning Whole](https://www.amazon.com/Making-Learning-Whole-Principles-Transform/dp/0470633719) in which he uses baseball as an analogy.  We don't require kids to memorize all the rules of baseball and understand all the technical details before we let them play the game.  Rather, they start playing with a just general sense of it, and then gradually learn more rules/details as time goes on.\n",
    "\n",
    "All that to say, don't worry if you don't understand everything at first!  You're not supposed to.  We will start using some \"black boxes\" such as random forests that haven't yet been explained in detail, and then we'll dig into the lower level details later.\n",
    "\n",
    "To start, focus on what things DO, not what they ARE."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your practice"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "People learn by:\n",
    "1. **doing** (coding and building)\n",
    "2. **explaining** what they've learned (by writing or helping others)\n",
    "\n",
    "Therefore, we suggest that you practice these skills on Kaggle by:\n",
    "1. Entering competitions (*doing*)\n",
    "2. Creating Kaggle kernels (*explaining*)\n",
    "\n",
    "It's OK if you don't get good competition ranks or any kernel votes at first - that's totally normal! Just try to keep improving every day, and you'll see the results over time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get better at technical writing, study the top ranked Kaggle kernels from past competitions, and read posts from well-regarded technical bloggers. Some good role models include:\n",
    "\n",
    "- [Peter Norvig](http://nbviewer.jupyter.org/url/norvig.com/ipython/ProbabilityParadox.ipynb) (more [here](http://norvig.com/ipython/))\n",
    "- [Stephen Merity](https://smerity.com/articles/2017/deepcoder_and_ai_hype.html)\n",
    "- [Julia Evans](https://codewords.recurse.com/issues/five/why-do-neural-networks-think-a-panda-is-a-vulture) (more [here](https://jvns.ca/blog/2014/08/12/what-happens-if-you-write-a-tcp-stack-in-python/))\n",
    "- [Julia Ferraioli](http://blog.juliaferraioli.com/2016/02/exploring-world-using-vision-twilio.html)\n",
    "- [Edwin Chen](http://blog.echen.me/2014/10/07/moving-beyond-ctr-better-recommendations-through-human-evaluation/)\n",
    "- [Slav Ivanov](https://blog.slavv.com/picking-an-optimizer-for-style-transfer-86e7b8cba84b) (fast.ai student)\n",
    "- [Brad Kenstler](https://hackernoon.com/non-artistic-style-transfer-or-how-to-draw-kanye-using-captain-picards-face-c4a50256b814) (fast.ai and USF MSAN student)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Books"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The more familiarity you have with numeric programming in Python, the better. If you're looking to improve in this area, we strongly suggest Wes McKinney's [Python for Data Analysis, 2nd ed](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=asap_bc?ie=UTF8).\n",
    "\n",
    "For machine learning with Python, we recommend:\n",
    "\n",
    "- [Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Andreas-Mueller/dp/1449369413): From one of the scikit-learn authors, which is the main library we'll be using\n",
    "- [Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition](https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939/ref=dp_ob_title_bk): New version of a very successful book. A lot of the new material however covers deep learning in Tensorflow, which isn't relevant to this course\n",
    "- [Hands-On Machine Learning with Scikit-Learn and TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291/ref=pd_lpo_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=MBV2QMFH3EZ6B3YBY40K)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Syllabus in brief"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on time and class interests, we'll cover something like (not necessarily in this order):\n",
    "\n",
    "- Train vs test\n",
    "  - Effective validation set construction\n",
    "- Trees and ensembles\n",
    "  - Creating random forests\n",
    "  - Interpreting random forests\n",
    "- What is ML?  Why do we use it?\n",
    "  - What makes a good ML project?\n",
    "  - Structured vs unstructured data\n",
    "  - Examples of failures/mistakes\n",
    "- Feature engineering\n",
    "  - Domain specific - dates, URLs, text\n",
    "  - Embeddings / latent factors\n",
    "- Regularized models trained with SGD\n",
    "  - GLMs, Elasticnet, etc (NB: see what James covered)\n",
    "- Basic neural nets\n",
    "  - PyTorch\n",
    "  - Broadcasting, Matrix Multiplication\n",
    "  - Training loop, backpropagation\n",
    "- KNN\n",
    "- CV / bootstrap (Diabetes data set?)\n",
    "- Ethical considerations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Skip:\n",
    "\n",
    "- Dimensionality reduction\n",
    "- Interactions\n",
    "- Monitoring training\n",
    "- Collaborative filtering\n",
    "- Momentum and LR annealing\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jhoward/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.\n",
      "  from numpy.core.umath_tests import inner1d\n"
     ]
    }
   ],
   "source": [
    "from fastai.imports import *\n",
    "from fastai.structured import *\n",
    "\n",
    "from pandas_summary import DataFrameSummary\n",
    "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier\n",
    "from IPython.display import display\n",
    "\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH = \"data/bulldozers/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data%20Dictionary.xlsx\t\t  Test.csv\t     Train.csv\r\n",
      "Machine_Appendix.csv\t\t  tmp\t\t     Valid.csv\r\n",
      "median_benchmark.csv\t\t  TrainAndValid.7z   ValidSolution.csv\r\n",
      "models\t\t\t\t  TrainAndValid.csv\r\n",
      "random_forest_benchmark_test.csv  TrainAndValid.zip\r\n"
     ]
    }
   ],
   "source": [
    "!ls {PATH}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to *Blue Book for Bulldozers*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## About..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ...our teaching"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At fast.ai we have a distinctive [teaching philosophy](http://www.fast.ai/2016/10/08/teaching-philosophy/) of [\"the whole game\"](https://www.amazon.com/Making-Learning-Whole-Principles-Transform/dp/0470633719/ref=sr_1_1?ie=UTF8&qid=1505094653).  This is different from how most traditional math & technical courses are taught, where you have to learn all the individual elements before you can combine them (Harvard professor David Perkins call this *elementitis*), but it is similar to how topics like *driving* and *baseball* are taught.  That is, you can start driving without [knowing how an internal combustion engine works](https://medium.com/towards-data-science/thoughts-after-taking-the-deeplearning-ai-courses-8568f132153), and children begin playing baseball before they learn all the formal rules."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ...our approach to machine learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most machine learning courses will throw at you dozens of different algorithms, with a brief technical description of the math behind them, and maybe a toy example. You're left confused by the enormous range of techniques shown and have little practical understanding of how to apply them.\n",
    "\n",
    "The good news is that modern machine learning can be distilled down to a couple of key techniques that are of very wide applicability. Recent studies have shown that the vast majority of datasets can be best modeled with just two methods:\n",
    "\n",
    "- *Ensembles of decision trees* (i.e. Random Forests and Gradient Boosting Machines), mainly for structured data (such as you might find in a database table at most companies)\n",
    "- *Multi-layered neural networks learnt with SGD* (i.e. shallow and/or deep learning), mainly for unstructured data (such as audio, vision, and natural language)\n",
    "\n",
    "In this course we'll be doing a deep dive into random forests, and simple models learnt with SGD. You'll be learning about gradient boosting and deep learning in part 2."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ...this dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be looking at the Blue Book for Bulldozers Kaggle Competition: \"The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuaration.  The data is sourced from auction result postings and includes information on usage and equipment configurations.\"\n",
    "\n",
    "This is a very common type of dataset and prediciton problem, and similar to what you may see in your project or workplace."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ...Kaggle Competitions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kaggle is an awesome resource for aspiring data scientists or anyone looking to improve their machine learning skills.  There is nothing like being able to get hands-on practice and receiving real-time feedback to help you improve your skills.\n",
    "\n",
    "Kaggle provides:\n",
    "\n",
    "1. Interesting data sets\n",
    "2. Feedback on how you're doing\n",
    "3. A leader board to see what's good, what's possible, and what's state-of-art.\n",
    "4. Blog posts by winning contestants share useful tips and techniques."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Look at the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kaggle provides info about some of the fields of our dataset; on the [Kaggle Data info](https://www.kaggle.com/c/bluebook-for-bulldozers/data) page they say the following:\n",
    "\n",
    "For this competition, you are predicting the sale price of bulldozers sold at auctions. The data for this competition is split into three parts:\n",
    "\n",
    "- **Train.csv** is the training set, which contains data through the end of 2011.\n",
    "- **Valid.csv** is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.\n",
    "- **Test.csv** is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.\n",
    "\n",
    "The key fields are in train.csv are:\n",
    "\n",
    "- SalesID: the uniue identifier of the sale\n",
    "- MachineID: the unique identifier of a machine.  A machine can be sold multiple times\n",
    "- saleprice: what the machine sold for at auction (only provided in train.csv)\n",
    "- saledate: the date of the sale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Question*\n",
    "\n",
    "What stands out to you from the above description?  What needs to be true of our training and validation sets?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False, \n",
    "                     parse_dates=[\"saledate\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In any sort of data science work, it's **important to look at your data**, to make sure you understand the format, how it's stored, what type of values it holds, etc. Even if you've read descriptions about your data, the actual data may not be what you expect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_all(df):\n",
    "    with pd.option_context(\"display.max_rows\", 1000, \"display.max_columns\", 1000): \n",
    "        display(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>401120</th>\n",
       "      <th>401121</th>\n",
       "      <th>401122</th>\n",
       "      <th>401123</th>\n",
       "      <th>401124</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>SalesID</th>\n",
       "      <td>6333336</td>\n",
       "      <td>6333337</td>\n",
       "      <td>6333338</td>\n",
       "      <td>6333341</td>\n",
       "      <td>6333342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SalePrice</th>\n",
       "      <td>10500</td>\n",
       "      <td>11000</td>\n",
       "      <td>11500</td>\n",
       "      <td>9000</td>\n",
       "      <td>7750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MachineID</th>\n",
       "      <td>1840702</td>\n",
       "      <td>1830472</td>\n",
       "      <td>1887659</td>\n",
       "      <td>1903570</td>\n",
       "      <td>1926965</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ModelID</th>\n",
       "      <td>21439</td>\n",
       "      <td>21439</td>\n",
       "      <td>21439</td>\n",
       "      <td>21435</td>\n",
       "      <td>21435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>datasource</th>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "      <td>149</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>auctioneerID</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YearMade</th>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MachineHoursCurrentMeter</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UsageBand</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>saledate</th>\n",
       "      <td>2011-11-02 00:00:00</td>\n",
       "      <td>2011-11-02 00:00:00</td>\n",
       "      <td>2011-11-02 00:00:00</td>\n",
       "      <td>2011-10-25 00:00:00</td>\n",
       "      <td>2011-10-25 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiModelDesc</th>\n",
       "      <td>35NX2</td>\n",
       "      <td>35NX2</td>\n",
       "      <td>35NX2</td>\n",
       "      <td>30NX</td>\n",
       "      <td>30NX</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiBaseModel</th>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "      <td>30</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiSecondaryDesc</th>\n",
       "      <td>NX</td>\n",
       "      <td>NX</td>\n",
       "      <td>NX</td>\n",
       "      <td>NX</td>\n",
       "      <td>NX</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiModelSeries</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiModelDescriptor</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ProductSize</th>\n",
       "      <td>Mini</td>\n",
       "      <td>Mini</td>\n",
       "      <td>Mini</td>\n",
       "      <td>Mini</td>\n",
       "      <td>Mini</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiProductClassDesc</th>\n",
       "      <td>Hydraulic Excavator, Track - 3.0 to 4.0 Metric...</td>\n",
       "      <td>Hydraulic Excavator, Track - 3.0 to 4.0 Metric...</td>\n",
       "      <td>Hydraulic Excavator, Track - 3.0 to 4.0 Metric...</td>\n",
       "      <td>Hydraulic Excavator, Track - 2.0 to 3.0 Metric...</td>\n",
       "      <td>Hydraulic Excavator, Track - 2.0 to 3.0 Metric...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>state</th>\n",
       "      <td>Maryland</td>\n",
       "      <td>Maryland</td>\n",
       "      <td>Maryland</td>\n",
       "      <td>Florida</td>\n",
       "      <td>Florida</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ProductGroup</th>\n",
       "      <td>TEX</td>\n",
       "      <td>TEX</td>\n",
       "      <td>TEX</td>\n",
       "      <td>TEX</td>\n",
       "      <td>TEX</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ProductGroupDesc</th>\n",
       "      <td>Track Excavators</td>\n",
       "      <td>Track Excavators</td>\n",
       "      <td>Track Excavators</td>\n",
       "      <td>Track Excavators</td>\n",
       "      <td>Track Excavators</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Drive_System</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Enclosure</th>\n",
       "      <td>EROPS</td>\n",
       "      <td>EROPS</td>\n",
       "      <td>EROPS</td>\n",
       "      <td>EROPS</td>\n",
       "      <td>EROPS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Forks</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pad_Type</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ride_Control</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stick</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Transmission</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Turbocharged</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blade_Extension</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blade_Width</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Enclosure_Type</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Engine_Horsepower</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hydraulics</th>\n",
       "      <td>Auxiliary</td>\n",
       "      <td>Standard</td>\n",
       "      <td>Auxiliary</td>\n",
       "      <td>Standard</td>\n",
       "      <td>Standard</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pushblock</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ripper</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Scarifier</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tip_Control</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tire_Size</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Coupler</th>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Coupler_System</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Grouser_Tracks</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hydraulics_Flow</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Track_Type</th>\n",
       "      <td>Steel</td>\n",
       "      <td>Steel</td>\n",
       "      <td>Steel</td>\n",
       "      <td>Steel</td>\n",
       "      <td>Steel</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Undercarriage_Pad_Width</th>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stick_Length</th>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Thumb</th>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pattern_Changer</th>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>None or Unspecified</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Grouser_Type</th>\n",
       "      <td>Double</td>\n",
       "      <td>Double</td>\n",
       "      <td>Double</td>\n",
       "      <td>Double</td>\n",
       "      <td>Double</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Backhoe_Mounting</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blade_Type</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Travel_Controls</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Differential_Type</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Steering_Controls</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                                     401120  \\\n",
       "SalesID                                                             6333336   \n",
       "SalePrice                                                             10500   \n",
       "MachineID                                                           1840702   \n",
       "ModelID                                                               21439   \n",
       "datasource                                                              149   \n",
       "auctioneerID                                                              1   \n",
       "YearMade                                                               2005   \n",
       "MachineHoursCurrentMeter                                                NaN   \n",
       "UsageBand                                                               NaN   \n",
       "saledate                                                2011-11-02 00:00:00   \n",
       "fiModelDesc                                                           35NX2   \n",
       "fiBaseModel                                                              35   \n",
       "fiSecondaryDesc                                                          NX   \n",
       "fiModelSeries                                                             2   \n",
       "fiModelDescriptor                                                       NaN   \n",
       "ProductSize                                                            Mini   \n",
       "fiProductClassDesc        Hydraulic Excavator, Track - 3.0 to 4.0 Metric...   \n",
       "state                                                              Maryland   \n",
       "ProductGroup                                                            TEX   \n",
       "ProductGroupDesc                                           Track Excavators   \n",
       "Drive_System                                                            NaN   \n",
       "Enclosure                                                             EROPS   \n",
       "Forks                                                                   NaN   \n",
       "Pad_Type                                                                NaN   \n",
       "Ride_Control                                                            NaN   \n",
       "Stick                                                                   NaN   \n",
       "Transmission                                                            NaN   \n",
       "Turbocharged                                                            NaN   \n",
       "Blade_Extension                                                         NaN   \n",
       "Blade_Width                                                             NaN   \n",
       "Enclosure_Type                                                          NaN   \n",
       "Engine_Horsepower                                                       NaN   \n",
       "Hydraulics                                                        Auxiliary   \n",
       "Pushblock                                                               NaN   \n",
       "Ripper                                                                  NaN   \n",
       "Scarifier                                                               NaN   \n",
       "Tip_Control                                                             NaN   \n",
       "Tire_Size                                                               NaN   \n",
       "Coupler                                                 None or Unspecified   \n",
       "Coupler_System                                                          NaN   \n",
       "Grouser_Tracks                                                          NaN   \n",
       "Hydraulics_Flow                                                         NaN   \n",
       "Track_Type                                                            Steel   \n",
       "Undercarriage_Pad_Width                                 None or Unspecified   \n",
       "Stick_Length                                            None or Unspecified   \n",
       "Thumb                                                   None or Unspecified   \n",
       "Pattern_Changer                                         None or Unspecified   \n",
       "Grouser_Type                                                         Double   \n",
       "Backhoe_Mounting                                                        NaN   \n",
       "Blade_Type                                                              NaN   \n",
       "Travel_Controls                                                         NaN   \n",
       "Differential_Type                                                       NaN   \n",
       "Steering_Controls                                                       NaN   \n",
       "\n",
       "                                                                     401121  \\\n",
       "SalesID                                                             6333337   \n",
       "SalePrice                                                             11000   \n",
       "MachineID                                                           1830472   \n",
       "ModelID                                                               21439   \n",
       "datasource                                                              149   \n",
       "auctioneerID                                                              1   \n",
       "YearMade                                                               2005   \n",
       "MachineHoursCurrentMeter                                                NaN   \n",
       "UsageBand                                                               NaN   \n",
       "saledate                                                2011-11-02 00:00:00   \n",
       "fiModelDesc                                                           35NX2   \n",
       "fiBaseModel                                                              35   \n",
       "fiSecondaryDesc                                                          NX   \n",
       "fiModelSeries                                                             2   \n",
       "fiModelDescriptor                                                       NaN   \n",
       "ProductSize                                                            Mini   \n",
       "fiProductClassDesc        Hydraulic Excavator, Track - 3.0 to 4.0 Metric...   \n",
       "state                                                              Maryland   \n",
       "ProductGroup                                                            TEX   \n",
       "ProductGroupDesc                                           Track Excavators   \n",
       "Drive_System                                                            NaN   \n",
       "Enclosure                                                             EROPS   \n",
       "Forks                                                                   NaN   \n",
       "Pad_Type                                                                NaN   \n",
       "Ride_Control                                                            NaN   \n",
       "Stick                                                                   NaN   \n",
       "Transmission                                                            NaN   \n",
       "Turbocharged                                                            NaN   \n",
       "Blade_Extension                                                         NaN   \n",
       "Blade_Width                                                             NaN   \n",
       "Enclosure_Type                                                          NaN   \n",
       "Engine_Horsepower                                                       NaN   \n",
       "Hydraulics                                                         Standard   \n",
       "Pushblock                                                               NaN   \n",
       "Ripper                                                                  NaN   \n",
       "Scarifier                                                               NaN   \n",
       "Tip_Control                                                             NaN   \n",
       "Tire_Size                                                               NaN   \n",
       "Coupler                                                 None or Unspecified   \n",
       "Coupler_System                                                          NaN   \n",
       "Grouser_Tracks                                                          NaN   \n",
       "Hydraulics_Flow                                                         NaN   \n",
       "Track_Type                                                            Steel   \n",
       "Undercarriage_Pad_Width                                 None or Unspecified   \n",
       "Stick_Length                                            None or Unspecified   \n",
       "Thumb                                                   None or Unspecified   \n",
       "Pattern_Changer                                         None or Unspecified   \n",
       "Grouser_Type                                                         Double   \n",
       "Backhoe_Mounting                                                        NaN   \n",
       "Blade_Type                                                              NaN   \n",
       "Travel_Controls                                                         NaN   \n",
       "Differential_Type                                                       NaN   \n",
       "Steering_Controls                                                       NaN   \n",
       "\n",
       "                                                                     401122  \\\n",
       "SalesID                                                             6333338   \n",
       "SalePrice                                                             11500   \n",
       "MachineID                                                           1887659   \n",
       "ModelID                                                               21439   \n",
       "datasource                                                              149   \n",
       "auctioneerID                                                              1   \n",
       "YearMade                                                               2005   \n",
       "MachineHoursCurrentMeter                                                NaN   \n",
       "UsageBand                                                               NaN   \n",
       "saledate                                                2011-11-02 00:00:00   \n",
       "fiModelDesc                                                           35NX2   \n",
       "fiBaseModel                                                              35   \n",
       "fiSecondaryDesc                                                          NX   \n",
       "fiModelSeries                                                             2   \n",
       "fiModelDescriptor                                                       NaN   \n",
       "ProductSize                                                            Mini   \n",
       "fiProductClassDesc        Hydraulic Excavator, Track - 3.0 to 4.0 Metric...   \n",
       "state                                                              Maryland   \n",
       "ProductGroup                                                            TEX   \n",
       "ProductGroupDesc                                           Track Excavators   \n",
       "Drive_System                                                            NaN   \n",
       "Enclosure                                                             EROPS   \n",
       "Forks                                                                   NaN   \n",
       "Pad_Type                                                                NaN   \n",
       "Ride_Control                                                            NaN   \n",
       "Stick                                                                   NaN   \n",
       "Transmission                                                            NaN   \n",
       "Turbocharged                                                            NaN   \n",
       "Blade_Extension                                                         NaN   \n",
       "Blade_Width                                                             NaN   \n",
       "Enclosure_Type                                                          NaN   \n",
       "Engine_Horsepower                                                       NaN   \n",
       "Hydraulics                                                        Auxiliary   \n",
       "Pushblock                                                               NaN   \n",
       "Ripper                                                                  NaN   \n",
       "Scarifier                                                               NaN   \n",
       "Tip_Control                                                             NaN   \n",
       "Tire_Size                                                               NaN   \n",
       "Coupler                                                 None or Unspecified   \n",
       "Coupler_System                                                          NaN   \n",
       "Grouser_Tracks                                                          NaN   \n",
       "Hydraulics_Flow                                                         NaN   \n",
       "Track_Type                                                            Steel   \n",
       "Undercarriage_Pad_Width                                 None or Unspecified   \n",
       "Stick_Length                                            None or Unspecified   \n",
       "Thumb                                                   None or Unspecified   \n",
       "Pattern_Changer                                         None or Unspecified   \n",
       "Grouser_Type                                                         Double   \n",
       "Backhoe_Mounting                                                        NaN   \n",
       "Blade_Type                                                              NaN   \n",
       "Travel_Controls                                                         NaN   \n",
       "Differential_Type                                                       NaN   \n",
       "Steering_Controls                                                       NaN   \n",
       "\n",
       "                                                                     401123  \\\n",
       "SalesID                                                             6333341   \n",
       "SalePrice                                                              9000   \n",
       "MachineID                                                           1903570   \n",
       "ModelID                                                               21435   \n",
       "datasource                                                              149   \n",
       "auctioneerID                                                              2   \n",
       "YearMade                                                               2005   \n",
       "MachineHoursCurrentMeter                                                NaN   \n",
       "UsageBand                                                               NaN   \n",
       "saledate                                                2011-10-25 00:00:00   \n",
       "fiModelDesc                                                            30NX   \n",
       "fiBaseModel                                                              30   \n",
       "fiSecondaryDesc                                                          NX   \n",
       "fiModelSeries                                                           NaN   \n",
       "fiModelDescriptor                                                       NaN   \n",
       "ProductSize                                                            Mini   \n",
       "fiProductClassDesc        Hydraulic Excavator, Track - 2.0 to 3.0 Metric...   \n",
       "state                                                               Florida   \n",
       "ProductGroup                                                            TEX   \n",
       "ProductGroupDesc                                           Track Excavators   \n",
       "Drive_System                                                            NaN   \n",
       "Enclosure                                                             EROPS   \n",
       "Forks                                                                   NaN   \n",
       "Pad_Type                                                                NaN   \n",
       "Ride_Control                                                            NaN   \n",
       "Stick                                                                   NaN   \n",
       "Transmission                                                            NaN   \n",
       "Turbocharged                                                            NaN   \n",
       "Blade_Extension                                                         NaN   \n",
       "Blade_Width                                                             NaN   \n",
       "Enclosure_Type                                                          NaN   \n",
       "Engine_Horsepower                                                       NaN   \n",
       "Hydraulics                                                         Standard   \n",
       "Pushblock                                                               NaN   \n",
       "Ripper                                                                  NaN   \n",
       "Scarifier                                                               NaN   \n",
       "Tip_Control                                                             NaN   \n",
       "Tire_Size                                                               NaN   \n",
       "Coupler                                                 None or Unspecified   \n",
       "Coupler_System                                                          NaN   \n",
       "Grouser_Tracks                                                          NaN   \n",
       "Hydraulics_Flow                                                         NaN   \n",
       "Track_Type                                                            Steel   \n",
       "Undercarriage_Pad_Width                                 None or Unspecified   \n",
       "Stick_Length                                            None or Unspecified   \n",
       "Thumb                                                   None or Unspecified   \n",
       "Pattern_Changer                                         None or Unspecified   \n",
       "Grouser_Type                                                         Double   \n",
       "Backhoe_Mounting                                                        NaN   \n",
       "Blade_Type                                                              NaN   \n",
       "Travel_Controls                                                         NaN   \n",
       "Differential_Type                                                       NaN   \n",
       "Steering_Controls                                                       NaN   \n",
       "\n",
       "                                                                     401124  \n",
       "SalesID                                                             6333342  \n",
       "SalePrice                                                              7750  \n",
       "MachineID                                                           1926965  \n",
       "ModelID                                                               21435  \n",
       "datasource                                                              149  \n",
       "auctioneerID                                                              2  \n",
       "YearMade                                                               2005  \n",
       "MachineHoursCurrentMeter                                                NaN  \n",
       "UsageBand                                                               NaN  \n",
       "saledate                                                2011-10-25 00:00:00  \n",
       "fiModelDesc                                                            30NX  \n",
       "fiBaseModel                                                              30  \n",
       "fiSecondaryDesc                                                          NX  \n",
       "fiModelSeries                                                           NaN  \n",
       "fiModelDescriptor                                                       NaN  \n",
       "ProductSize                                                            Mini  \n",
       "fiProductClassDesc        Hydraulic Excavator, Track - 2.0 to 3.0 Metric...  \n",
       "state                                                               Florida  \n",
       "ProductGroup                                                            TEX  \n",
       "ProductGroupDesc                                           Track Excavators  \n",
       "Drive_System                                                            NaN  \n",
       "Enclosure                                                             EROPS  \n",
       "Forks                                                                   NaN  \n",
       "Pad_Type                                                                NaN  \n",
       "Ride_Control                                                            NaN  \n",
       "Stick                                                                   NaN  \n",
       "Transmission                                                            NaN  \n",
       "Turbocharged                                                            NaN  \n",
       "Blade_Extension                                                         NaN  \n",
       "Blade_Width                                                             NaN  \n",
       "Enclosure_Type                                                          NaN  \n",
       "Engine_Horsepower                                                       NaN  \n",
       "Hydraulics                                                         Standard  \n",
       "Pushblock                                                               NaN  \n",
       "Ripper                                                                  NaN  \n",
       "Scarifier                                                               NaN  \n",
       "Tip_Control                                                             NaN  \n",
       "Tire_Size                                                               NaN  \n",
       "Coupler                                                 None or Unspecified  \n",
       "Coupler_System                                                          NaN  \n",
       "Grouser_Tracks                                                          NaN  \n",
       "Hydraulics_Flow                                                         NaN  \n",
       "Track_Type                                                            Steel  \n",
       "Undercarriage_Pad_Width                                 None or Unspecified  \n",
       "Stick_Length                                            None or Unspecified  \n",
       "Thumb                                                   None or Unspecified  \n",
       "Pattern_Changer                                         None or Unspecified  \n",
       "Grouser_Type                                                         Double  \n",
       "Backhoe_Mounting                                                        NaN  \n",
       "Blade_Type                                                              NaN  \n",
       "Travel_Controls                                                         NaN  \n",
       "Differential_Type                                                       NaN  \n",
       "Steering_Controls                                                       NaN  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_all(df_raw.tail().T)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>unique</th>\n",
       "      <th>top</th>\n",
       "      <th>freq</th>\n",
       "      <th>first</th>\n",
       "      <th>last</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>SalesID</th>\n",
       "      <td>401125</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.91971e+06</td>\n",
       "      <td>909021</td>\n",
       "      <td>1.13925e+06</td>\n",
       "      <td>1.41837e+06</td>\n",
       "      <td>1.63942e+06</td>\n",
       "      <td>2.24271e+06</td>\n",
       "      <td>6.33334e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>SalePrice</th>\n",
       "      <td>401125</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31099.7</td>\n",
       "      <td>23036.9</td>\n",
       "      <td>4750</td>\n",
       "      <td>14500</td>\n",
       "      <td>24000</td>\n",
       "      <td>40000</td>\n",
       "      <td>142000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MachineID</th>\n",
       "      <td>401125</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.2179e+06</td>\n",
       "      <td>440992</td>\n",
       "      <td>0</td>\n",
       "      <td>1.0887e+06</td>\n",
       "      <td>1.27949e+06</td>\n",
       "      <td>1.46807e+06</td>\n",
       "      <td>2.48633e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ModelID</th>\n",
       "      <td>401125</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6889.7</td>\n",
       "      <td>6221.78</td>\n",
       "      <td>28</td>\n",
       "      <td>3259</td>\n",
       "      <td>4604</td>\n",
       "      <td>8724</td>\n",
       "      <td>37198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>datasource</th>\n",
       "      <td>401125</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>134.666</td>\n",
       "      <td>8.96224</td>\n",
       "      <td>121</td>\n",
       "      <td>132</td>\n",
       "      <td>132</td>\n",
       "      <td>136</td>\n",
       "      <td>172</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>auctioneerID</th>\n",
       "      <td>380989</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.55604</td>\n",
       "      <td>16.9768</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>YearMade</th>\n",
       "      <td>401125</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1899.16</td>\n",
       "      <td>291.797</td>\n",
       "      <td>1000</td>\n",
       "      <td>1985</td>\n",
       "      <td>1995</td>\n",
       "      <td>2000</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MachineHoursCurrentMeter</th>\n",
       "      <td>142765</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3457.96</td>\n",
       "      <td>27590.3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>3025</td>\n",
       "      <td>2.4833e+06</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>UsageBand</th>\n",
       "      <td>69639</td>\n",
       "      <td>3</td>\n",
       "      <td>Medium</td>\n",
       "      <td>33985</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>saledate</th>\n",
       "      <td>401125</td>\n",
       "      <td>3919</td>\n",
       "      <td>2009-02-16 00:00:00</td>\n",
       "      <td>1932</td>\n",
       "      <td>1989-01-17 00:00:00</td>\n",
       "      <td>2011-12-30 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiModelDesc</th>\n",
       "      <td>401125</td>\n",
       "      <td>4999</td>\n",
       "      <td>310G</td>\n",
       "      <td>5039</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiBaseModel</th>\n",
       "      <td>401125</td>\n",
       "      <td>1950</td>\n",
       "      <td>580</td>\n",
       "      <td>19798</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiSecondaryDesc</th>\n",
       "      <td>263934</td>\n",
       "      <td>175</td>\n",
       "      <td>C</td>\n",
       "      <td>43235</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiModelSeries</th>\n",
       "      <td>56908</td>\n",
       "      <td>122</td>\n",
       "      <td>II</td>\n",
       "      <td>13202</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiModelDescriptor</th>\n",
       "      <td>71919</td>\n",
       "      <td>139</td>\n",
       "      <td>L</td>\n",
       "      <td>15875</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ProductSize</th>\n",
       "      <td>190350</td>\n",
       "      <td>6</td>\n",
       "      <td>Medium</td>\n",
       "      <td>62274</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiProductClassDesc</th>\n",
       "      <td>401125</td>\n",
       "      <td>74</td>\n",
       "      <td>Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...</td>\n",
       "      <td>56166</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>state</th>\n",
       "      <td>401125</td>\n",
       "      <td>53</td>\n",
       "      <td>Florida</td>\n",
       "      <td>63944</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ProductGroup</th>\n",
       "      <td>401125</td>\n",
       "      <td>6</td>\n",
       "      <td>TEX</td>\n",
       "      <td>101167</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ProductGroupDesc</th>\n",
       "      <td>401125</td>\n",
       "      <td>6</td>\n",
       "      <td>Track Excavators</td>\n",
       "      <td>101167</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Drive_System</th>\n",
       "      <td>104361</td>\n",
       "      <td>4</td>\n",
       "      <td>Two Wheel Drive</td>\n",
       "      <td>46139</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Enclosure</th>\n",
       "      <td>400800</td>\n",
       "      <td>6</td>\n",
       "      <td>OROPS</td>\n",
       "      <td>173932</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Forks</th>\n",
       "      <td>192077</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>178300</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pad_Type</th>\n",
       "      <td>79134</td>\n",
       "      <td>4</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>70614</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ride_Control</th>\n",
       "      <td>148606</td>\n",
       "      <td>3</td>\n",
       "      <td>No</td>\n",
       "      <td>77685</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stick</th>\n",
       "      <td>79134</td>\n",
       "      <td>2</td>\n",
       "      <td>Standard</td>\n",
       "      <td>48829</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Transmission</th>\n",
       "      <td>183230</td>\n",
       "      <td>8</td>\n",
       "      <td>Standard</td>\n",
       "      <td>140328</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Turbocharged</th>\n",
       "      <td>79134</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>75211</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blade_Extension</th>\n",
       "      <td>25219</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>24692</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blade_Width</th>\n",
       "      <td>25219</td>\n",
       "      <td>6</td>\n",
       "      <td>14'</td>\n",
       "      <td>9615</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Enclosure_Type</th>\n",
       "      <td>25219</td>\n",
       "      <td>3</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>21923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Engine_Horsepower</th>\n",
       "      <td>25219</td>\n",
       "      <td>2</td>\n",
       "      <td>No</td>\n",
       "      <td>23937</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hydraulics</th>\n",
       "      <td>320570</td>\n",
       "      <td>12</td>\n",
       "      <td>2 Valve</td>\n",
       "      <td>141404</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pushblock</th>\n",
       "      <td>25219</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>19463</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ripper</th>\n",
       "      <td>104137</td>\n",
       "      <td>4</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>83452</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Scarifier</th>\n",
       "      <td>25230</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>12719</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tip_Control</th>\n",
       "      <td>25219</td>\n",
       "      <td>3</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>16207</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tire_Size</th>\n",
       "      <td>94718</td>\n",
       "      <td>17</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>46339</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Coupler</th>\n",
       "      <td>213952</td>\n",
       "      <td>3</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>184582</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Coupler_System</th>\n",
       "      <td>43458</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>40430</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Grouser_Tracks</th>\n",
       "      <td>43362</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>40515</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hydraulics_Flow</th>\n",
       "      <td>43362</td>\n",
       "      <td>3</td>\n",
       "      <td>Standard</td>\n",
       "      <td>42784</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Track_Type</th>\n",
       "      <td>99153</td>\n",
       "      <td>2</td>\n",
       "      <td>Steel</td>\n",
       "      <td>84880</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Undercarriage_Pad_Width</th>\n",
       "      <td>99872</td>\n",
       "      <td>19</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>79651</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stick_Length</th>\n",
       "      <td>99218</td>\n",
       "      <td>29</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>78820</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Thumb</th>\n",
       "      <td>99288</td>\n",
       "      <td>3</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>83093</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pattern_Changer</th>\n",
       "      <td>99218</td>\n",
       "      <td>3</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>90255</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Grouser_Type</th>\n",
       "      <td>99153</td>\n",
       "      <td>3</td>\n",
       "      <td>Double</td>\n",
       "      <td>84653</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Backhoe_Mounting</th>\n",
       "      <td>78672</td>\n",
       "      <td>2</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>78652</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Blade_Type</th>\n",
       "      <td>79833</td>\n",
       "      <td>10</td>\n",
       "      <td>PAT</td>\n",
       "      <td>38612</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Travel_Controls</th>\n",
       "      <td>79834</td>\n",
       "      <td>7</td>\n",
       "      <td>None or Unspecified</td>\n",
       "      <td>69923</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Differential_Type</th>\n",
       "      <td>69411</td>\n",
       "      <td>4</td>\n",
       "      <td>Standard</td>\n",
       "      <td>68073</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Steering_Controls</th>\n",
       "      <td>69369</td>\n",
       "      <td>5</td>\n",
       "      <td>Conventional</td>\n",
       "      <td>68679</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           count unique  \\\n",
       "SalesID                   401125    NaN   \n",
       "SalePrice                 401125    NaN   \n",
       "MachineID                 401125    NaN   \n",
       "ModelID                   401125    NaN   \n",
       "datasource                401125    NaN   \n",
       "auctioneerID              380989    NaN   \n",
       "YearMade                  401125    NaN   \n",
       "MachineHoursCurrentMeter  142765    NaN   \n",
       "UsageBand                  69639      3   \n",
       "saledate                  401125   3919   \n",
       "fiModelDesc               401125   4999   \n",
       "fiBaseModel               401125   1950   \n",
       "fiSecondaryDesc           263934    175   \n",
       "fiModelSeries              56908    122   \n",
       "fiModelDescriptor          71919    139   \n",
       "ProductSize               190350      6   \n",
       "fiProductClassDesc        401125     74   \n",
       "state                     401125     53   \n",
       "ProductGroup              401125      6   \n",
       "ProductGroupDesc          401125      6   \n",
       "Drive_System              104361      4   \n",
       "Enclosure                 400800      6   \n",
       "Forks                     192077      2   \n",
       "Pad_Type                   79134      4   \n",
       "Ride_Control              148606      3   \n",
       "Stick                      79134      2   \n",
       "Transmission              183230      8   \n",
       "Turbocharged               79134      2   \n",
       "Blade_Extension            25219      2   \n",
       "Blade_Width                25219      6   \n",
       "Enclosure_Type             25219      3   \n",
       "Engine_Horsepower          25219      2   \n",
       "Hydraulics                320570     12   \n",
       "Pushblock                  25219      2   \n",
       "Ripper                    104137      4   \n",
       "Scarifier                  25230      2   \n",
       "Tip_Control                25219      3   \n",
       "Tire_Size                  94718     17   \n",
       "Coupler                   213952      3   \n",
       "Coupler_System             43458      2   \n",
       "Grouser_Tracks             43362      2   \n",
       "Hydraulics_Flow            43362      3   \n",
       "Track_Type                 99153      2   \n",
       "Undercarriage_Pad_Width    99872     19   \n",
       "Stick_Length               99218     29   \n",
       "Thumb                      99288      3   \n",
       "Pattern_Changer            99218      3   \n",
       "Grouser_Type               99153      3   \n",
       "Backhoe_Mounting           78672      2   \n",
       "Blade_Type                 79833     10   \n",
       "Travel_Controls            79834      7   \n",
       "Differential_Type          69411      4   \n",
       "Steering_Controls          69369      5   \n",
       "\n",
       "                                                                        top  \\\n",
       "SalesID                                                                 NaN   \n",
       "SalePrice                                                               NaN   \n",
       "MachineID                                                               NaN   \n",
       "ModelID                                                                 NaN   \n",
       "datasource                                                              NaN   \n",
       "auctioneerID                                                            NaN   \n",
       "YearMade                                                                NaN   \n",
       "MachineHoursCurrentMeter                                                NaN   \n",
       "UsageBand                                                            Medium   \n",
       "saledate                                                2009-02-16 00:00:00   \n",
       "fiModelDesc                                                            310G   \n",
       "fiBaseModel                                                             580   \n",
       "fiSecondaryDesc                                                           C   \n",
       "fiModelSeries                                                            II   \n",
       "fiModelDescriptor                                                         L   \n",
       "ProductSize                                                          Medium   \n",
       "fiProductClassDesc        Backhoe Loader - 14.0 to 15.0 Ft Standard Digg...   \n",
       "state                                                               Florida   \n",
       "ProductGroup                                                            TEX   \n",
       "ProductGroupDesc                                           Track Excavators   \n",
       "Drive_System                                                Two Wheel Drive   \n",
       "Enclosure                                                             OROPS   \n",
       "Forks                                                   None or Unspecified   \n",
       "Pad_Type                                                None or Unspecified   \n",
       "Ride_Control                                                             No   \n",
       "Stick                                                              Standard   \n",
       "Transmission                                                       Standard   \n",
       "Turbocharged                                            None or Unspecified   \n",
       "Blade_Extension                                         None or Unspecified   \n",
       "Blade_Width                                                             14'   \n",
       "Enclosure_Type                                          None or Unspecified   \n",
       "Engine_Horsepower                                                        No   \n",
       "Hydraulics                                                          2 Valve   \n",
       "Pushblock                                               None or Unspecified   \n",
       "Ripper                                                  None or Unspecified   \n",
       "Scarifier                                               None or Unspecified   \n",
       "Tip_Control                                             None or Unspecified   \n",
       "Tire_Size                                               None or Unspecified   \n",
       "Coupler                                                 None or Unspecified   \n",
       "Coupler_System                                          None or Unspecified   \n",
       "Grouser_Tracks                                          None or Unspecified   \n",
       "Hydraulics_Flow                                                    Standard   \n",
       "Track_Type                                                            Steel   \n",
       "Undercarriage_Pad_Width                                 None or Unspecified   \n",
       "Stick_Length                                            None or Unspecified   \n",
       "Thumb                                                   None or Unspecified   \n",
       "Pattern_Changer                                         None or Unspecified   \n",
       "Grouser_Type                                                         Double   \n",
       "Backhoe_Mounting                                        None or Unspecified   \n",
       "Blade_Type                                                              PAT   \n",
       "Travel_Controls                                         None or Unspecified   \n",
       "Differential_Type                                                  Standard   \n",
       "Steering_Controls                                              Conventional   \n",
       "\n",
       "                            freq                first                 last  \\\n",
       "SalesID                      NaN                  NaN                  NaN   \n",
       "SalePrice                    NaN                  NaN                  NaN   \n",
       "MachineID                    NaN                  NaN                  NaN   \n",
       "ModelID                      NaN                  NaN                  NaN   \n",
       "datasource                   NaN                  NaN                  NaN   \n",
       "auctioneerID                 NaN                  NaN                  NaN   \n",
       "YearMade                     NaN                  NaN                  NaN   \n",
       "MachineHoursCurrentMeter     NaN                  NaN                  NaN   \n",
       "UsageBand                  33985                  NaN                  NaN   \n",
       "saledate                    1932  1989-01-17 00:00:00  2011-12-30 00:00:00   \n",
       "fiModelDesc                 5039                  NaN                  NaN   \n",
       "fiBaseModel                19798                  NaN                  NaN   \n",
       "fiSecondaryDesc            43235                  NaN                  NaN   \n",
       "fiModelSeries              13202                  NaN                  NaN   \n",
       "fiModelDescriptor          15875                  NaN                  NaN   \n",
       "ProductSize                62274                  NaN                  NaN   \n",
       "fiProductClassDesc         56166                  NaN                  NaN   \n",
       "state                      63944                  NaN                  NaN   \n",
       "ProductGroup              101167                  NaN                  NaN   \n",
       "ProductGroupDesc          101167                  NaN                  NaN   \n",
       "Drive_System               46139                  NaN                  NaN   \n",
       "Enclosure                 173932                  NaN                  NaN   \n",
       "Forks                     178300                  NaN                  NaN   \n",
       "Pad_Type                   70614                  NaN                  NaN   \n",
       "Ride_Control               77685                  NaN                  NaN   \n",
       "Stick                      48829                  NaN                  NaN   \n",
       "Transmission              140328                  NaN                  NaN   \n",
       "Turbocharged               75211                  NaN                  NaN   \n",
       "Blade_Extension            24692                  NaN                  NaN   \n",
       "Blade_Width                 9615                  NaN                  NaN   \n",
       "Enclosure_Type             21923                  NaN                  NaN   \n",
       "Engine_Horsepower          23937                  NaN                  NaN   \n",
       "Hydraulics                141404                  NaN                  NaN   \n",
       "Pushblock                  19463                  NaN                  NaN   \n",
       "Ripper                     83452                  NaN                  NaN   \n",
       "Scarifier                  12719                  NaN                  NaN   \n",
       "Tip_Control                16207                  NaN                  NaN   \n",
       "Tire_Size                  46339                  NaN                  NaN   \n",
       "Coupler                   184582                  NaN                  NaN   \n",
       "Coupler_System             40430                  NaN                  NaN   \n",
       "Grouser_Tracks             40515                  NaN                  NaN   \n",
       "Hydraulics_Flow            42784                  NaN                  NaN   \n",
       "Track_Type                 84880                  NaN                  NaN   \n",
       "Undercarriage_Pad_Width    79651                  NaN                  NaN   \n",
       "Stick_Length               78820                  NaN                  NaN   \n",
       "Thumb                      83093                  NaN                  NaN   \n",
       "Pattern_Changer            90255                  NaN                  NaN   \n",
       "Grouser_Type               84653                  NaN                  NaN   \n",
       "Backhoe_Mounting           78652                  NaN                  NaN   \n",
       "Blade_Type                 38612                  NaN                  NaN   \n",
       "Travel_Controls            69923                  NaN                  NaN   \n",
       "Differential_Type          68073                  NaN                  NaN   \n",
       "Steering_Controls          68679                  NaN                  NaN   \n",
       "\n",
       "                                 mean      std          min          25%  \\\n",
       "SalesID                   1.91971e+06   909021  1.13925e+06  1.41837e+06   \n",
       "SalePrice                     31099.7  23036.9         4750        14500   \n",
       "MachineID                  1.2179e+06   440992            0   1.0887e+06   \n",
       "ModelID                        6889.7  6221.78           28         3259   \n",
       "datasource                    134.666  8.96224          121          132   \n",
       "auctioneerID                  6.55604  16.9768            0            1   \n",
       "YearMade                      1899.16  291.797         1000         1985   \n",
       "MachineHoursCurrentMeter      3457.96  27590.3            0            0   \n",
       "UsageBand                         NaN      NaN          NaN          NaN   \n",
       "saledate                          NaN      NaN          NaN          NaN   \n",
       "fiModelDesc                       NaN      NaN          NaN          NaN   \n",
       "fiBaseModel                       NaN      NaN          NaN          NaN   \n",
       "fiSecondaryDesc                   NaN      NaN          NaN          NaN   \n",
       "fiModelSeries                     NaN      NaN          NaN          NaN   \n",
       "fiModelDescriptor                 NaN      NaN          NaN          NaN   \n",
       "ProductSize                       NaN      NaN          NaN          NaN   \n",
       "fiProductClassDesc                NaN      NaN          NaN          NaN   \n",
       "state                             NaN      NaN          NaN          NaN   \n",
       "ProductGroup                      NaN      NaN          NaN          NaN   \n",
       "ProductGroupDesc                  NaN      NaN          NaN          NaN   \n",
       "Drive_System                      NaN      NaN          NaN          NaN   \n",
       "Enclosure                         NaN      NaN          NaN          NaN   \n",
       "Forks                             NaN      NaN          NaN          NaN   \n",
       "Pad_Type                          NaN      NaN          NaN          NaN   \n",
       "Ride_Control                      NaN      NaN          NaN          NaN   \n",
       "Stick                             NaN      NaN          NaN          NaN   \n",
       "Transmission                      NaN      NaN          NaN          NaN   \n",
       "Turbocharged                      NaN      NaN          NaN          NaN   \n",
       "Blade_Extension                   NaN      NaN          NaN          NaN   \n",
       "Blade_Width                       NaN      NaN          NaN          NaN   \n",
       "Enclosure_Type                    NaN      NaN          NaN          NaN   \n",
       "Engine_Horsepower                 NaN      NaN          NaN          NaN   \n",
       "Hydraulics                        NaN      NaN          NaN          NaN   \n",
       "Pushblock                         NaN      NaN          NaN          NaN   \n",
       "Ripper                            NaN      NaN          NaN          NaN   \n",
       "Scarifier                         NaN      NaN          NaN          NaN   \n",
       "Tip_Control                       NaN      NaN          NaN          NaN   \n",
       "Tire_Size                         NaN      NaN          NaN          NaN   \n",
       "Coupler                           NaN      NaN          NaN          NaN   \n",
       "Coupler_System                    NaN      NaN          NaN          NaN   \n",
       "Grouser_Tracks                    NaN      NaN          NaN          NaN   \n",
       "Hydraulics_Flow                   NaN      NaN          NaN          NaN   \n",
       "Track_Type                        NaN      NaN          NaN          NaN   \n",
       "Undercarriage_Pad_Width           NaN      NaN          NaN          NaN   \n",
       "Stick_Length                      NaN      NaN          NaN          NaN   \n",
       "Thumb                             NaN      NaN          NaN          NaN   \n",
       "Pattern_Changer                   NaN      NaN          NaN          NaN   \n",
       "Grouser_Type                      NaN      NaN          NaN          NaN   \n",
       "Backhoe_Mounting                  NaN      NaN          NaN          NaN   \n",
       "Blade_Type                        NaN      NaN          NaN          NaN   \n",
       "Travel_Controls                   NaN      NaN          NaN          NaN   \n",
       "Differential_Type                 NaN      NaN          NaN          NaN   \n",
       "Steering_Controls                 NaN      NaN          NaN          NaN   \n",
       "\n",
       "                                  50%          75%          max  \n",
       "SalesID                   1.63942e+06  2.24271e+06  6.33334e+06  \n",
       "SalePrice                       24000        40000       142000  \n",
       "MachineID                 1.27949e+06  1.46807e+06  2.48633e+06  \n",
       "ModelID                          4604         8724        37198  \n",
       "datasource                        132          136          172  \n",
       "auctioneerID                        2            4           99  \n",
       "YearMade                         1995         2000         2013  \n",
       "MachineHoursCurrentMeter            0         3025   2.4833e+06  \n",
       "UsageBand                         NaN          NaN          NaN  \n",
       "saledate                          NaN          NaN          NaN  \n",
       "fiModelDesc                       NaN          NaN          NaN  \n",
       "fiBaseModel                       NaN          NaN          NaN  \n",
       "fiSecondaryDesc                   NaN          NaN          NaN  \n",
       "fiModelSeries                     NaN          NaN          NaN  \n",
       "fiModelDescriptor                 NaN          NaN          NaN  \n",
       "ProductSize                       NaN          NaN          NaN  \n",
       "fiProductClassDesc                NaN          NaN          NaN  \n",
       "state                             NaN          NaN          NaN  \n",
       "ProductGroup                      NaN          NaN          NaN  \n",
       "ProductGroupDesc                  NaN          NaN          NaN  \n",
       "Drive_System                      NaN          NaN          NaN  \n",
       "Enclosure                         NaN          NaN          NaN  \n",
       "Forks                             NaN          NaN          NaN  \n",
       "Pad_Type                          NaN          NaN          NaN  \n",
       "Ride_Control                      NaN          NaN          NaN  \n",
       "Stick                             NaN          NaN          NaN  \n",
       "Transmission                      NaN          NaN          NaN  \n",
       "Turbocharged                      NaN          NaN          NaN  \n",
       "Blade_Extension                   NaN          NaN          NaN  \n",
       "Blade_Width                       NaN          NaN          NaN  \n",
       "Enclosure_Type                    NaN          NaN          NaN  \n",
       "Engine_Horsepower                 NaN          NaN          NaN  \n",
       "Hydraulics                        NaN          NaN          NaN  \n",
       "Pushblock                         NaN          NaN          NaN  \n",
       "Ripper                            NaN          NaN          NaN  \n",
       "Scarifier                         NaN          NaN          NaN  \n",
       "Tip_Control                       NaN          NaN          NaN  \n",
       "Tire_Size                         NaN          NaN          NaN  \n",
       "Coupler                           NaN          NaN          NaN  \n",
       "Coupler_System                    NaN          NaN          NaN  \n",
       "Grouser_Tracks                    NaN          NaN          NaN  \n",
       "Hydraulics_Flow                   NaN          NaN          NaN  \n",
       "Track_Type                        NaN          NaN          NaN  \n",
       "Undercarriage_Pad_Width           NaN          NaN          NaN  \n",
       "Stick_Length                      NaN          NaN          NaN  \n",
       "Thumb                             NaN          NaN          NaN  \n",
       "Pattern_Changer                   NaN          NaN          NaN  \n",
       "Grouser_Type                      NaN          NaN          NaN  \n",
       "Backhoe_Mounting                  NaN          NaN          NaN  \n",
       "Blade_Type                        NaN          NaN          NaN  \n",
       "Travel_Controls                   NaN          NaN          NaN  \n",
       "Differential_Type                 NaN          NaN          NaN  \n",
       "Steering_Controls                 NaN          NaN          NaN  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_all(df_raw.describe(include='all').T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's important to note what metric is being used for a project. Generally, selecting the metric(s) is an important part of the project setup. However, in this case Kaggle tells us what metric to use: RMSLE (root mean squared log error) between the actual and predicted auction prices. Therefore we take the log of the prices, so that RMSE will give us what we need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_raw.SalePrice = np.log(df_raw.SalePrice)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initial processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "could not convert string to float: 'Conventional'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-10-6e70335c9573>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0mm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;31m# The following code is supposed to fail due to string values in the input data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf_raw\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'SalePrice'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdf_raw\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSalePrice\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/sklearn/ensemble/forest.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m    245\u001b[0m         \"\"\"\n\u001b[1;32m    246\u001b[0m         \u001b[0;31m# Validate or convert input data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 247\u001b[0;31m         \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"csc\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mDTYPE\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    248\u001b[0m         \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'csc'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    249\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0msample_weight\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)\u001b[0m\n\u001b[1;32m    431\u001b[0m                                       force_all_finite)\n\u001b[1;32m    432\u001b[0m     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 433\u001b[0;31m         \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcopy\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    434\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    435\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mensure_2d\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Conventional'"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_jobs=-1)\n",
    "# The following code is supposed to fail due to string values in the input data\n",
    "m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset contains a mix of **continuous** and **categorical** variables.\n",
    "\n",
    "The following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals.  You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2006\n",
       "1    2004\n",
       "2    2004\n",
       "3    2011\n",
       "4    2009\n",
       "Name: saleYear, dtype: int64"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "add_datepart(df_raw, 'saledate')\n",
    "df_raw.saleYear.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The categorical variables are currently stored as strings, which is inefficient, and doesn't provide the numeric coding required for a random forest. Therefore we call `train_cats` to convert strings to pandas categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_cats(df_raw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can specify the order to use for categorical variables if we wish:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['High', 'Low', 'Medium'], dtype='object')"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_raw.UsageBand.cat.categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_raw.UsageBand.cat.set_categories(['High', 'Medium', 'Low'], ordered=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normally, pandas will continue displaying the text categories, while treating them as numerical data internally. Optionally, we can replace the text categories with numbers, which will make this variable non-categorical, like so:."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_raw.UsageBand = df_raw.UsageBand.cat.codes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're still not quite done - for instance we have lots of missing values, which we can't pass directly to a random forest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Backhoe_Mounting            0.803872\n",
       "Blade_Extension             0.937129\n",
       "Blade_Type                  0.800977\n",
       "Blade_Width                 0.937129\n",
       "Coupler                     0.466620\n",
       "Coupler_System              0.891660\n",
       "Differential_Type           0.826959\n",
       "Drive_System                0.739829\n",
       "Enclosure                   0.000810\n",
       "Enclosure_Type              0.937129\n",
       "Engine_Horsepower           0.937129\n",
       "Forks                       0.521154\n",
       "Grouser_Tracks              0.891899\n",
       "Grouser_Type                0.752813\n",
       "Hydraulics                  0.200823\n",
       "Hydraulics_Flow             0.891899\n",
       "MachineHoursCurrentMeter    0.644089\n",
       "MachineID                   0.000000\n",
       "ModelID                     0.000000\n",
       "Pad_Type                    0.802720\n",
       "Pattern_Changer             0.752651\n",
       "ProductGroup                0.000000\n",
       "ProductGroupDesc            0.000000\n",
       "ProductSize                 0.525460\n",
       "Pushblock                   0.937129\n",
       "Ride_Control                0.629527\n",
       "Ripper                      0.740388\n",
       "SalePrice                   0.000000\n",
       "SalesID                     0.000000\n",
       "Scarifier                   0.937102\n",
       "Steering_Controls           0.827064\n",
       "Stick                       0.802720\n",
       "Stick_Length                0.752651\n",
       "Thumb                       0.752476\n",
       "Tip_Control                 0.937129\n",
       "Tire_Size                   0.763869\n",
       "Track_Type                  0.752813\n",
       "Transmission                0.543210\n",
       "Travel_Controls             0.800975\n",
       "Turbocharged                0.802720\n",
       "Undercarriage_Pad_Width     0.751020\n",
       "UsageBand                   0.000000\n",
       "YearMade                    0.000000\n",
       "auctioneerID                0.050199\n",
       "datasource                  0.000000\n",
       "fiBaseModel                 0.000000\n",
       "fiModelDesc                 0.000000\n",
       "fiModelDescriptor           0.820707\n",
       "fiModelSeries               0.858129\n",
       "fiProductClassDesc          0.000000\n",
       "fiSecondaryDesc             0.342016\n",
       "saleDay                     0.000000\n",
       "saleDayofweek               0.000000\n",
       "saleDayofyear               0.000000\n",
       "saleElapsed                 0.000000\n",
       "saleIs_month_end            0.000000\n",
       "saleIs_month_start          0.000000\n",
       "saleIs_quarter_end          0.000000\n",
       "saleIs_quarter_start        0.000000\n",
       "saleIs_year_end             0.000000\n",
       "saleIs_year_start           0.000000\n",
       "saleMonth                   0.000000\n",
       "saleWeek                    0.000000\n",
       "saleYear                    0.000000\n",
       "state                       0.000000\n",
       "dtype: float64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_all(df_raw.isnull().sum().sort_index()/len(df_raw))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But let's save this file for now, since it's already in format can we be stored and accessed efficiently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.makedirs('tmp', exist_ok=True)\n",
    "df_raw.to_feather('tmp/bulldozers-raw')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the future we can simply read it from this fast format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_raw = pd.read_feather('tmp/bulldozers-raw')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df, y, nas = proc_df(df_raw, 'SalePrice')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have something we can pass to a random forest!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9830962179413453"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_jobs=-1)\n",
    "m.fit(df, y)\n",
    "m.score(df,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In statistics, the coefficient of determination, denoted R2 or r2 and pronounced \"R squared\", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). https://en.wikipedia.org/wiki/Coefficient_of_determination"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow, an r^2 of 0.98 - that's great, right? Well, perhaps not...\n",
    "\n",
    "Possibly **the most important idea** in machine learning is that of having separate training & validation data sets. As motivation, suppose you don't divide up your data, but instead use all of it.  And suppose you have lots of parameters:\n",
    "\n",
    "<img src=\"images/overfitting2.png\" alt=\"\" style=\"width: 70%\"/>\n",
    "<center>\n",
    "[Underfitting and Overfitting](https://datascience.stackexchange.com/questions/361/when-is-a-model-underfitted)\n",
    "</center>\n",
    "\n",
    "The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it's not the best choice.  Why is that?  If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.\n",
    "\n",
    "This illustrates how using all our data can lead to **overfitting**. A validation set helps diagnose this problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((389125, 66), (389125,), (12000, 66))"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def split_vals(a,n): return a[:n].copy(), a[n:].copy()\n",
    "\n",
    "n_valid = 12000  # same as Kaggle's test set size\n",
    "n_trn = len(df)-n_valid\n",
    "raw_train, raw_valid = split_vals(df_raw, n_trn)\n",
    "X_train, X_valid = split_vals(df, n_trn)\n",
    "y_train, y_valid = split_vals(y, n_trn)\n",
    "\n",
    "X_train.shape, y_train.shape, X_valid.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random Forests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Base model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try our model again, this time with separate training and validation sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rmse(x,y): return math.sqrt(((x-y)**2).mean())\n",
    "\n",
    "def print_score(m):\n",
    "    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),\n",
    "                m.score(X_train, y_train), m.score(X_valid, y_valid)]\n",
    "    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)\n",
    "    print(res)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 4s, sys: 372 ms, total: 1min 4s\n",
      "Wall time: 8.56 s\n",
      "[0.0904611534175684, 0.2517003033389636, 0.9828975209204237, 0.8868601882297901]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_jobs=-1)\n",
    "%time m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An r^2 in the high-80's isn't bad at all (and the RMSLE puts us around rank 100 of 470 on the Kaggle leaderboard), but we can see from the validation set score that we're over-fitting badly. To understand this issue, let's simplify things down to a single small tree."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Speeding things up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)\n",
    "X_train, _ = split_vals(df_trn, 20000)\n",
    "y_train, _ = split_vals(y_trn, 20000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6.45 s, sys: 132 ms, total: 6.58 s\n",
      "Wall time: 539 ms\n",
      "[0.11129876897034846, 0.3501178589366683, 0.9730338701182212, 0.7810845051996299]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_jobs=-1)\n",
    "%time m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Single tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.4965829795739235, 0.5246832258551836, 0.50149617735615859, 0.5083655198087873]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
       "<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
       " \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
       "<!-- Generated by graphviz version 2.38.0 (20140413.2041)\n",
       " -->\n",
       "<!-- Title: Tree Pages: 1 -->\n",
       "<svg width=\"720pt\" height=\"434pt\"\n",
       " viewBox=\"0.00 0.00 720.00 434.49\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(0.778659 0.778659) rotate(0) translate(4 554)\">\n",
       "<title>Tree</title>\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-554 920.667,-554 920.667,4 -4,4\"/>\n",
       "<!-- 0 -->\n",
       "<g id=\"node1\" class=\"node\"><title>0</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.701961\" stroke=\"black\" points=\"167.667,-308.5 27.6667,-308.5 27.6667,-240.5 167.667,-240.5 167.667,-308.5\"/>\n",
       "<text text-anchor=\"start\" x=\"35.6667\" y=\"-293.3\" font-family=\"Times,serif\" font-size=\"14.00\">Coupler_System ≤ 0.5</text>\n",
       "<text text-anchor=\"start\" x=\"62.6667\" y=\"-278.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.495</text>\n",
       "<text text-anchor=\"start\" x=\"50.6667\" y=\"-263.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 20000</text>\n",
       "<text text-anchor=\"start\" x=\"56.1667\" y=\"-248.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 10.189</text>\n",
       "</g>\n",
       "<!-- 1 -->\n",
       "<g id=\"node2\" class=\"node\"><title>1</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.788235\" stroke=\"black\" points=\"391.667,-361.5 281.667,-361.5 281.667,-293.5 391.667,-293.5 391.667,-361.5\"/>\n",
       "<text text-anchor=\"start\" x=\"292.667\" y=\"-346.3\" font-family=\"Times,serif\" font-size=\"14.00\">Enclosure ≤ 2.0</text>\n",
       "<text text-anchor=\"start\" x=\"301.667\" y=\"-331.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.414</text>\n",
       "<text text-anchor=\"start\" x=\"289.667\" y=\"-316.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 16815</text>\n",
       "<text text-anchor=\"start\" x=\"295.167\" y=\"-301.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 10.345</text>\n",
       "</g>\n",
       "<!-- 0&#45;&gt;1 -->\n",
       "<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M167.854,-289.972C200.525,-297.278 239.282,-305.946 271.315,-313.109\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"270.94,-316.612 281.462,-315.378 272.467,-309.78 270.94,-316.612\"/>\n",
       "<text text-anchor=\"middle\" x=\"260.353\" y=\"-325.072\" font-family=\"Times,serif\" font-size=\"14.00\">True</text>\n",
       "</g>\n",
       "<!-- 8 -->\n",
       "<g id=\"node9\" class=\"node\"><title>8</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.254902\" stroke=\"black\" points=\"400.667,-255.5 272.667,-255.5 272.667,-187.5 400.667,-187.5 400.667,-255.5\"/>\n",
       "<text text-anchor=\"start\" x=\"280.667\" y=\"-240.3\" font-family=\"Times,serif\" font-size=\"14.00\">YearMade ≤ 1999.5</text>\n",
       "<text text-anchor=\"start\" x=\"301.667\" y=\"-225.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.109</text>\n",
       "<text text-anchor=\"start\" x=\"292.667\" y=\"-210.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 3185</text>\n",
       "<text text-anchor=\"start\" x=\"298.667\" y=\"-195.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.363</text>\n",
       "</g>\n",
       "<!-- 0&#45;&gt;8 -->\n",
       "<g id=\"edge8\" class=\"edge\"><title>0&#45;&gt;8</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M167.854,-259.028C197.56,-252.385 232.298,-244.616 262.438,-237.876\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"263.474,-241.231 272.469,-235.633 261.946,-234.4 263.474,-241.231\"/>\n",
       "<text text-anchor=\"middle\" x=\"251.359\" y=\"-218.539\" font-family=\"Times,serif\" font-size=\"14.00\">False</text>\n",
       "</g>\n",
       "<!-- 2 -->\n",
       "<g id=\"node3\" class=\"node\"><title>2</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.576471\" stroke=\"black\" points=\"658.667,-486.5 538.667,-486.5 538.667,-418.5 658.667,-418.5 658.667,-486.5\"/>\n",
       "<text text-anchor=\"start\" x=\"546.667\" y=\"-471.3\" font-family=\"Times,serif\" font-size=\"14.00\">ModelID ≤ 4573.0</text>\n",
       "<text text-anchor=\"start\" x=\"563.667\" y=\"-456.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.331</text>\n",
       "<text text-anchor=\"start\" x=\"554.667\" y=\"-441.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 4400</text>\n",
       "<text text-anchor=\"start\" x=\"560.667\" y=\"-426.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.955</text>\n",
       "</g>\n",
       "<!-- 1&#45;&gt;2 -->\n",
       "<g id=\"edge2\" class=\"edge\"><title>1&#45;&gt;2</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M391.855,-353.552C431.786,-372.75 486.387,-399 529.316,-419.639\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"528.038,-422.908 538.567,-424.087 531.071,-416.599 528.038,-422.908\"/>\n",
       "</g>\n",
       "<!-- 5 -->\n",
       "<g id=\"node6\" class=\"node\"><title>5</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.862745\" stroke=\"black\" points=\"653.667,-361.5 543.667,-361.5 543.667,-293.5 653.667,-293.5 653.667,-361.5\"/>\n",
       "<text text-anchor=\"start\" x=\"554.667\" y=\"-346.3\" font-family=\"Times,serif\" font-size=\"14.00\">Enclosure ≤ 4.5</text>\n",
       "<text text-anchor=\"start\" x=\"567.167\" y=\"-331.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.37</text>\n",
       "<text text-anchor=\"start\" x=\"551.667\" y=\"-316.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 12415</text>\n",
       "<text text-anchor=\"start\" x=\"557.167\" y=\"-301.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 10.484</text>\n",
       "</g>\n",
       "<!-- 1&#45;&gt;5 -->\n",
       "<g id=\"edge5\" class=\"edge\"><title>1&#45;&gt;5</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M391.855,-327.5C433.051,-327.5 489.86,-327.5 533.358,-327.5\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"533.548,-331 543.548,-327.5 533.548,-324 533.548,-331\"/>\n",
       "</g>\n",
       "<!-- 3 -->\n",
       "<g id=\"node4\" class=\"node\"><title>3</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.721569\" stroke=\"black\" points=\"895.667,-550 791.667,-550 791.667,-497 895.667,-497 895.667,-550\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-534.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.315</text>\n",
       "<text text-anchor=\"start\" x=\"799.667\" y=\"-519.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 2002</text>\n",
       "<text text-anchor=\"start\" x=\"802.167\" y=\"-504.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 10.226</text>\n",
       "</g>\n",
       "<!-- 2&#45;&gt;3 -->\n",
       "<g id=\"edge3\" class=\"edge\"><title>2&#45;&gt;3</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M658.8,-469.778C696.07,-480.667 744.082,-494.695 781.697,-505.686\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"780.937,-509.11 791.517,-508.555 782.9,-502.391 780.937,-509.11\"/>\n",
       "</g>\n",
       "<!-- 4 -->\n",
       "<g id=\"node5\" class=\"node\"><title>4</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.450980\" stroke=\"black\" points=\"895.667,-479 791.667,-479 791.667,-426 895.667,-426 895.667,-479\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-463.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.232</text>\n",
       "<text text-anchor=\"start\" x=\"799.667\" y=\"-448.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 2398</text>\n",
       "<text text-anchor=\"start\" x=\"805.667\" y=\"-433.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.728</text>\n",
       "</g>\n",
       "<!-- 2&#45;&gt;4 -->\n",
       "<g id=\"edge4\" class=\"edge\"><title>2&#45;&gt;4</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M658.8,-452.5C695.912,-452.5 743.673,-452.5 781.216,-452.5\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"781.517,-456 791.517,-452.5 781.517,-449 781.517,-456\"/>\n",
       "</g>\n",
       "<!-- 6 -->\n",
       "<g id=\"node7\" class=\"node\"><title>6</title>\n",
       "<polygon fill=\"#e58139\" stroke=\"black\" points=\"895.667,-408 791.667,-408 791.667,-355 895.667,-355 895.667,-408\"/>\n",
       "<text text-anchor=\"start\" x=\"812.167\" y=\"-392.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.29</text>\n",
       "<text text-anchor=\"start\" x=\"799.667\" y=\"-377.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 7193</text>\n",
       "<text text-anchor=\"start\" x=\"802.167\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 10.736</text>\n",
       "</g>\n",
       "<!-- 5&#45;&gt;6 -->\n",
       "<g id=\"edge6\" class=\"edge\"><title>5&#45;&gt;6</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M653.898,-339.551C691.729,-347.958 742.266,-359.189 781.528,-367.914\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"780.988,-371.379 791.509,-370.132 782.507,-364.546 780.988,-371.379\"/>\n",
       "</g>\n",
       "<!-- 7 -->\n",
       "<g id=\"node8\" class=\"node\"><title>7</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.674510\" stroke=\"black\" points=\"895.667,-337 791.667,-337 791.667,-284 895.667,-284 895.667,-337\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-321.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.272</text>\n",
       "<text text-anchor=\"start\" x=\"799.667\" y=\"-306.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 5222</text>\n",
       "<text text-anchor=\"start\" x=\"802.167\" y=\"-291.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 10.137</text>\n",
       "</g>\n",
       "<!-- 5&#45;&gt;7 -->\n",
       "<g id=\"edge7\" class=\"edge\"><title>5&#45;&gt;7</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M653.898,-323.706C691.729,-321.059 742.266,-317.524 781.528,-314.777\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"781.778,-318.268 791.509,-314.079 781.29,-311.285 781.778,-318.268\"/>\n",
       "</g>\n",
       "<!-- 9 -->\n",
       "<g id=\"node10\" class=\"node\"><title>9</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.050980\" stroke=\"black\" points=\"680.167,-255.5 517.167,-255.5 517.167,-187.5 680.167,-187.5 680.167,-255.5\"/>\n",
       "<text text-anchor=\"start\" x=\"525.167\" y=\"-240.3\" font-family=\"Times,serif\" font-size=\"14.00\">fiProductClassDesc ≤ 40.5</text>\n",
       "<text text-anchor=\"start\" x=\"563.667\" y=\"-225.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.101</text>\n",
       "<text text-anchor=\"start\" x=\"558.167\" y=\"-210.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 468</text>\n",
       "<text text-anchor=\"start\" x=\"560.667\" y=\"-195.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 8.988</text>\n",
       "</g>\n",
       "<!-- 8&#45;&gt;9 -->\n",
       "<g id=\"edge9\" class=\"edge\"><title>8&#45;&gt;9</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M400.937,-221.5C432.677,-221.5 471.698,-221.5 506.595,-221.5\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"506.799,-225 516.798,-221.5 506.798,-218 506.799,-225\"/>\n",
       "</g>\n",
       "<!-- 12 -->\n",
       "<g id=\"node13\" class=\"node\"><title>12</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.290196\" stroke=\"black\" points=\"685.667,-131.5 511.667,-131.5 511.667,-63.5 685.667,-63.5 685.667,-131.5\"/>\n",
       "<text text-anchor=\"start\" x=\"519.667\" y=\"-116.3\" font-family=\"Times,serif\" font-size=\"14.00\">saleElapsed ≤ 1217462400.0</text>\n",
       "<text text-anchor=\"start\" x=\"563.667\" y=\"-101.3\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.082</text>\n",
       "<text text-anchor=\"start\" x=\"554.667\" y=\"-86.3\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 2717</text>\n",
       "<text text-anchor=\"start\" x=\"560.667\" y=\"-71.3\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.427</text>\n",
       "</g>\n",
       "<!-- 8&#45;&gt;12 -->\n",
       "<g id=\"edge12\" class=\"edge\"><title>8&#45;&gt;12</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M400.937,-191.325C435.857,-174.671 479.59,-153.813 516.949,-135.996\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"518.781,-139 526.301,-131.536 515.768,-132.682 518.781,-139\"/>\n",
       "</g>\n",
       "<!-- 10 -->\n",
       "<g id=\"node11\" class=\"node\"><title>10</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"892.167,-266 795.167,-266 795.167,-213 892.167,-213 892.167,-266\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-250.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.069</text>\n",
       "<text text-anchor=\"start\" x=\"803.167\" y=\"-235.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 293</text>\n",
       "<text text-anchor=\"start\" x=\"805.667\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 8.896</text>\n",
       "</g>\n",
       "<!-- 9&#45;&gt;10 -->\n",
       "<g id=\"edge10\" class=\"edge\"><title>9&#45;&gt;10</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M680.245,-227.469C714.378,-229.997 753.391,-232.887 784.626,-235.201\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"784.671,-238.713 794.902,-235.962 785.188,-231.733 784.671,-238.713\"/>\n",
       "</g>\n",
       "<!-- 11 -->\n",
       "<g id=\"node12\" class=\"node\"><title>11</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.133333\" stroke=\"black\" points=\"892.167,-195 795.167,-195 795.167,-142 892.167,-142 892.167,-195\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-179.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.118</text>\n",
       "<text text-anchor=\"start\" x=\"803.167\" y=\"-164.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 175</text>\n",
       "<text text-anchor=\"start\" x=\"805.667\" y=\"-149.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.143</text>\n",
       "</g>\n",
       "<!-- 9&#45;&gt;11 -->\n",
       "<g id=\"edge11\" class=\"edge\"><title>9&#45;&gt;11</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M680.245,-203.925C714.527,-196.448 753.731,-187.898 785.033,-181.07\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"785.877,-184.469 794.902,-178.918 784.386,-177.629 785.877,-184.469\"/>\n",
       "</g>\n",
       "<!-- 13 -->\n",
       "<g id=\"node14\" class=\"node\"><title>13</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.345098\" stroke=\"black\" points=\"895.667,-124 791.667,-124 791.667,-71 895.667,-71 895.667,-124\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-108.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.062</text>\n",
       "<text text-anchor=\"start\" x=\"799.667\" y=\"-93.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 1347</text>\n",
       "<text text-anchor=\"start\" x=\"805.667\" y=\"-78.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.529</text>\n",
       "</g>\n",
       "<!-- 12&#45;&gt;13 -->\n",
       "<g id=\"edge13\" class=\"edge\"><title>12&#45;&gt;13</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M685.681,-97.5C717.175,-97.5 752.156,-97.5 781.089,-97.5\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"781.457,-101 791.457,-97.5 781.457,-94.0001 781.457,-101\"/>\n",
       "</g>\n",
       "<!-- 14 -->\n",
       "<g id=\"node15\" class=\"node\"><title>14</title>\n",
       "<polygon fill=\"#e58139\" fill-opacity=\"0.235294\" stroke=\"black\" points=\"895.667,-53 791.667,-53 791.667,-0 895.667,-0 895.667,-53\"/>\n",
       "<text text-anchor=\"start\" x=\"808.667\" y=\"-37.8\" font-family=\"Times,serif\" font-size=\"14.00\">mse = 0.081</text>\n",
       "<text text-anchor=\"start\" x=\"799.667\" y=\"-22.8\" font-family=\"Times,serif\" font-size=\"14.00\">samples = 1370</text>\n",
       "<text text-anchor=\"start\" x=\"805.667\" y=\"-7.8\" font-family=\"Times,serif\" font-size=\"14.00\">value = 9.328</text>\n",
       "</g>\n",
       "<!-- 12&#45;&gt;14 -->\n",
       "<g id=\"edge14\" class=\"edge\"><title>12&#45;&gt;14</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M685.681,-72.3681C717.451,-63.0856 752.77,-52.7662 781.849,-44.2699\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"782.84,-47.6267 791.457,-41.4626 780.876,-40.9076 782.84,-47.6267\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>\n"
      ],
      "text/plain": [
       "<graphviz.files.Source at 0x7f4c5ef607f0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "draw_tree(m.estimators_[0], df_trn, precision=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what happens if we create a bigger tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[6.526751786450488e-17, 0.38473652894699306, 1.0, 0.73565273648797624]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=1, bootstrap=False, n_jobs=-1)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training set result looks great! But the validation set is worse than our original model. This is why we need to use *bagging* of multiple trees to get more generalizable results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bagging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Intro to bagging"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To learn about bagging in random forests, let's start with our basic model again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.11745691160954547, 0.27959279688230376, 0.97139456205050101, 0.86039533492251219]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_jobs=-1)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll grab the predictions for each individual tree, and look at one example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(array([ 9.21034,  8.9872 ,  8.9872 ,  8.9872 ,  8.9872 ,  9.21034,  8.92266,  9.21034,  9.21034,  8.9872 ]),\n",
       " 9.0700003890739005,\n",
       " 9.1049798563183568)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preds = np.stack([t.predict(X_valid) for t in m.estimators_])\n",
    "preds[:,0], np.mean(preds[:,0]), y_valid[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10, 12000)"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preds.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD8CAYAAACb4nSYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xt01Od95/H3F10QEghJIECAANncjS/YMrbjS2KwE+K4\nJpt2E0ibBCeN09PY6Xrdi9vjZr1Oe5JtmzhJ6/Updm3HTjfE8bZdUkNwLOzEF2JLxMYGJIwAA0JC\nF0ASul/mu3/MCAYhoQFGzO3zOmcO8/vN85v5zgAfPXrm9/wec3dERCQ1jIl1ASIicvEo9EVEUohC\nX0QkhSj0RURSiEJfRCSFKPRFRFKIQl9EJIUo9EVEUohCX0QkhaTHuoDBJk+e7HPmzIl1GSIiCWXb\ntm1N7l44Uru4C/05c+ZQUVER6zJERBKKmR2IpJ2Gd0REUohCX0QkhSj0RURSiEJfRCSFKPRFRFJI\nRKFvZivNbLeZVZvZg0M8PsvMXjGzd8zsPTO7I+yxK8xsq5ntNLP3zSwrmm9AREQiN+Ipm2aWBjwG\n3A7UAOVmtsHdd4U1ewh43t0fN7PFwEZgjpmlAz8GvuDu281sEtAb9XchIiIRieQ8/WVAtbvvAzCz\n9cAqIDz0HcgN3Z8I1Ibufxx4z923A7j70WgULSKSqHr6AjR39HC8o5fjHT0n7x9r7yE/O5PPXzdr\nVF8/ktCfARwK264BrhvU5mHgJTO7D8gBbgvtnw+4mW0GCoH17v53F1SxiEgccHfae/o53t5DcyjA\ngyEeDPDTg/3Un23dfcM+59Wz8uIi9G2IfYNXU18DPOPu3zWzG4DnzGxJ6PlvAq4FOoAyM9vm7mWn\nvYDZPcA9ALNmje4bFhEZrD/gtHSG9bzbTw/r40Psa+7opac/MOxz5malk5+TSV52JpPHZzJvynjy\nsjPJz84gLyeTgoH72Znk52SQn51JVkbaqL/XSEK/BigO257JqeGbAV8BVgK4+9bQl7WTQ8f+yt2b\nAMxsI3A1cFrou/s6YB1AaWnp4B8oIiJncHc6e/tp6+qjrTvs1tVHe09faH8/bd29tHf3c6Krj/bu\n09u2h9q39fThwyRP2hgjPzsYyvnZmcyelM1VxXnk5WSEgjuTvOwM8nMyT7abOC6D9LT4PDkyktAv\nB+aZWQlwGFgNfH5Qm4PACuAZM1sEZAGNwGbgz80sG+gBPgo8GqXaRSRBuTvHO3qpbe6ktSsYym3d\nvcGQHhzOoRAfHNrt3X0EIugijjEYPzY9eMtKJ2dsOhOy0imamEVOaH9uVvrJHndeduapMM/JYMLY\ndMyGGvBITCOGvrv3mdm9BAM8DXjK3Xea2SNAhbtvAB4AnjCz+wkO/ax1dweOm9n3CP7gcGCju784\nWm9GROJHT1+Aw82dHDzWEbwdbQ/d7+TQsY6zjm2PMYLhPDYY0uOzzgzq00J8oN1p+9OYMDaDrIwx\nSRXaF8p8uN9pYqS0tNR1lU2R+OfuNHf0cvBYBweOdXDoWAcHj3acDPm6ls7TeuKZ6WOYVZB98lZc\nkM2MvHHkjktnwtgMcsamMT4rGNrjMtIU1Oco9H1p6Ujt4u7SyiISP3r7Axw+fqq3fuhYBweOnrp/\nYlBvffL4scwqGMeykgKKwwJ+VkE2UyaMZcwYBXmsKfRFUph78KyVgVA/cDTUYw/dapvP7K0X549j\nVkE2187Jp7ggm9mTckI993FkZypS4p3+hkSSWH/AaTjRRW1zF0dauqhr6aS2OfjnoePBkD/RNbi3\nnklxQTals/OZtXTGqR77pGymTshSbz3BKfRFElQg4DS2dVPX0kVdcye1LV0caQn+WdfcyZGWLupP\ndNM/6BSXcRlpFOVlMasgm6tn5Z8cX589KZvi/GxyxioWkpn+dkXiUCDgHG3vOdkzP9LSSV1L18lA\nr2vpor61i75BgT42fQzT88ZRNDGL6y+dxPSJ4yjKy2L6xHFMmxj8M3dccp2CKOdGoS9ykbk7x9p7\ngj30sCGXk730lk7qW7rPmO2ZmT6GoolZTMvNYllJAUUTsyjKG0dRbtbJYM/LzlCgy1kp9EVGUeOJ\nbnbVtbKztoVdta1U1rVy6HgnPX2nB3pGmjE1NxjcV8/Kp2hisLdeNDGL6XnBXvqknEwFulwwhb5I\nFAQCzsFjHeysbWVXXUvwz9pWGk50n2wzM38ci4pyWbFoaijQQ8Gel8XkHJ3OKBeHQl/kHHX39bOn\nvo1dta0ne/GVdSdOzjBNG2PMmzKem+ZNZnFRLpdNn8jiolwmZmfEuHIRhb7IWbV29QbDvbY11Itv\nZU/9iZNfoOZkprGoKJfPXD2Dy6bnsrhoIvOmjr8oV0sUOR8KfRGCX64eae06Fe61reysa+HQsc6T\nbSaPH8tl03O5dUEhi6cHe/CzC7I1LCMJRaEvKac/4OxvajsV7qEe/LH2npNtSibncMWMPFZfOysU\n8LlMmaDlnSXxKfQl6bV09FJWVU/FgePsqm2l6kgrXb3Bs2cy08Ywf9p4bl809WS4LyzKZbwmKEmS\n0r9sSUpH27p5aVc9m3Yc4c3qJvoCTm5WOoun5/L5ZbOD4+/Tc5k7ZTwZcbrYhchoUOhL0qhv7WLz\nziNsfL+Ot/cfI+AwqyCbr9xcwieXFHHlzIk6z11SnkJfElrN8Q5+seMIm3YcYduB4wDMnTKer986\nl5VLprG4KFdBLxImotA3s5XADwiunPWku39n0OOzgB8BeaE2D7r7xkGP7wIedvd/iFLtkqL2N7Wz\naUcdm94/wvuHWwBYVJTLA7fP55OXT2PulAkxrlAkfo0Y+maWBjwG3E5wofNyM9vg7rvCmj0EPO/u\nj5vZYmAjMCfs8UeBTVGrWlKKu/NBfRubdtTxix1HqDpyAoAri/N48JML+eSSacyelBPjKkUSQyQ9\n/WVAtbvvAzCz9cAqgj33AQ7khu5PBGoHHjCzTwP7gPZoFCypwd3ZWdt6ske/r6kdMyidnc8371zM\nyiXTmJ43LtZliiScSEJ/BnAobLsGuG5Qm4eBl8zsPiAHuA3AzHKAvyD4W8KfDvcCZnYPcA/ArFmz\nIixdkk0g4LxzqJlf7Khj044j1BzvJG2Mcf0lBdx9UwmfuGyqzpUXuUCRhP5Q34INXk19DfCMu3/X\nzG4AnjOzJcD/BB5197azfZnm7uuAdRBcGD2iyiUp9Aec8g+P8YsdR/jFjiMcae0iI824ce5kvrF8\nHrctnkpBTmasyxRJGpGEfg1QHLY9k7Dhm5CvACsB3H2rmWUBkwn+RvB7ZvZ3BL/kDZhZl7v/0wVX\nLgmrtz/A1r1H2bTjCL/cdYSmth7Gpo/ho/ML+YvLF7B84VQmjtPFyURGQyShXw7MM7MS4DCwGvj8\noDYHgRXAM2a2CMgCGt395oEGZvYw0KbAT03dff28vqcpFPT1tHT2kp2Zxq0Lp3DHkiI+tqBQy/SJ\nXAQj/i9z9z4zuxfYTPB0zKfcfaeZPQJUuPsG4AHgCTO7n+DQz1p31zBNinN33tx7lJ+WH2JLVQNt\n3X1MyErn9kVTWblkGrfML9TVKEUuMou3bC4tLfWKiopYlyEXwN351QeN/LBsD7892Ex+dgYfXzyN\nlZdP48ZLJ5OZrsseiESbmW1z99KR2un3aYkad2dLVQM/LNvD9poWZuSN428+vYT/WjqTsenq0YvE\nA4W+XLBAwHlpVz3/uGUPO2tbKS4Yx3c+czmfuXqmevUicUahL+ctEHA27TjCP27ZQ9WRE8yZlM3f\n/94VfHrpDF25UiROKfTlnPUHnP98r5Z/2lLNnoY2Li3M4fufu4o7rygiXWEvEtcU+hKxvv4AG7YH\nw35fUzvzp47nH9cs5Y7Li0jTkoEiCUGhLyPq7Q/w7789zGOvVnPgaAcLp03g8d+/mk9cNk3rw4ok\nGIW+DKunL8AL22r4369WU3O8kyUzcln3hWu4bdFUhb1IglLoyxm6evv5WcUhHn91L7UtXVxZnMe3\nVi3hYwsKtSCJSIJT6MtJXb39/J+3DvLPv95LfWs318zO5zu/ewU3z5ussBdJEgp9oaOnj3/9zUH+\n+df7aGrr5rqSAh797FXccOkkhb1IklHop7C27j6e23qAJ17bx7H2Hm6aO5n7li/luksmxbo0ERkl\nCv0U1NrVy7NvfsiTr++nuaOXj84v5Bsr5nLN7IJYlyYio0yhn0JaOnp56o39PP3Gflq7+lixcAr3\nrZjHVcV5sS5NRC4ShX4KON7ew7+8vp8fvfkhJ7r7+PjiqXxjxTyWzJgY69JE5CJT6CexprZunnxt\nP89t/ZCO3n7uWFLEvcvnsqgod8RjRSQ5RRT6ZrYS+AHBRVSedPfvDHp8FvAjgksipgEPuvtGM7sd\n+A6QCfQAf+buW6JYvwxjw/Za/uKF9+jq6+d3rpjOvcvnMn/qhFiXJSIxNmLom1ka8BhwO8H1csvN\nbIO77wpr9hDwvLs/bmaLgY3AHKAJ+B13rw0tlL4ZmBHl9yCDvL6niQeef5erivP49meuYO6U8bEu\nSUTiRCQ9/WVAtbvvAzCz9cAqIDz0HRgYM5hIaOF0d38nrM1OIMvMxrp794UWLkPbWdvCH/14G5cW\njufJL12rBcZF5DSRhP4M4FDYdg1w3aA2DwMvmdl9QA5w2xDP87vAOwr80VNzvIO7ny5nQlY6T9+t\nwBeRM0Vy8fOhpmQOXlh3DfCMu88E7gCeM7OTz21mlwH/C/jakC9gdo+ZVZhZRWNjY2SVy2maO3pY\n+3Q5nb39/OjLyyiaOC7WJYlIHIok9GuA4rDtmYSGb8J8BXgewN23AlnAZAAzmwn8O/BFd9871Au4\n+zp3L3X30sLCwnN7B0JXbz9ffbaCg0c7eOKLpfrCVkSGFUnolwPzzKzEzDKB1cCGQW0OAisAzGwR\nwdBvNLM84EXgL939jeiVLQP6A879P32X8g+P873PXcn1uoSCiJzFiKHv7n3AvQTPvKkkeJbOTjN7\nxMzuCjV7APiqmW0HfgKsdXcPHTcX+Gszezd0mzIq7yQFuTvf+s9dbNpxhIc+tYg7r5ge65JEJM5Z\nMJvjR2lpqVdUVMS6jITwz7/ay7c3VfGHN5Xw0J2LY12OiMSQmW1z99KR2mkV6wT1/949zLc3VXHn\nFUX81R2LYl2OiCQIhX4CerO6iT/92XauKyngu5+9UksXikjEFPoJprKula89t42SyTms+2IpY9PT\nYl2SiCQQhX4COdzcydqn3yZnbDrP3L1Mk69E5JzpKpsJoqWjl7VPvU1HTz8/+6MbmJ6nyVcicu7U\n008AA5OvDhztYN0XSlk4TZdGFpHzo55+nAsEnAee387bHx7jh2uWcsOlmnwlIudPPf045u5868Vd\nvPh+HQ99ahF3XanJVyJyYRT6cezJ1/bz9Bsf8uUbS/jDmy+JdTkikgQU+nFqw/Za/nZjJZ+6vIiH\nPqXJVyISHQr9OPTm3uDKV8s0+UpEokyhH2eqjrTytWeDk6+e+EIpWRmafCUi0aPQjyO1zZ2sfaqc\n7LFpwclX2Zp8JSLRpdCPEy2dvax9+m3au/t45u5lmnwlIqNC5+nHge6+fu55toL9Te386MvLWFSk\nyVciMjoU+jEWCDj//fntvLX/GD9YfRUfuXRyrEsSkSQW0fCOma00s91mVm1mDw7x+Cwze8XM3jGz\n98zsjrDH/jJ03G4z+0Q0i08Gf7uxkhffq+Ov7ljIqqtmxLocEUlyI/b0zSwNeAy4neAi6eVmtsHd\nd4U1e4jgMoqPm9liYCMwJ3R/NXAZMB142czmu3t/tN9IInrytX38y+v7WfuROXxVk69E5CKIpKe/\nDKh2933u3gOsB1YNauPAwED0RKA2dH8VsN7du919P1Ader6U9/PttfzNi5Xccfk0/vrOxZjpXHwR\nGX2RhP4M4FDYdk1oX7iHgT8wsxqCvfz7zuFYzOweM6sws4rGxsYIS09cW/ce5YHnt7NsTgHf++xV\npGnylYhcJJGE/lCJNHg19TXAM+4+E7gDeM7MxkR4LO6+zt1L3b20sLAwgpIS1+4jJ7jnuQpmTcpm\n3Rev0eQrEbmoIjl7pwYoDtueyanhmwFfAVYCuPtWM8sCJkd4bMqoawmufJWdmcaPvryMvOzMWJck\nIikmkp5+OTDPzErMLJPgF7MbBrU5CKwAMLNFQBbQGGq32szGmlkJMA94O1rFJ5KWzl7WPlXOia4+\nnl67jBmafCUiMTBiT9/d+8zsXmAzkAY85e47zewRoMLdNwAPAE+Y2f0Eh2/WursDO83seWAX0Ad8\nPRXP3Onu6+drz1Wwr6mNZ+5exuLpmnwlIrFhwWyOH6WlpV5RURHrMqImEHD+5Kfv8vPttXz/c1fx\n6aU6F19Eos/Mtrl76UjtdO2dUfbtTZX8fHstD35yoQJfRGJOoT+Knnp9P0+8tp8v3TCbr92iyVci\nEnsK/VHy4nt1fOvFXay8bBrf/J3LNPlKROKCQn8UvLXvKPf/9F2umZXP91dr8pWIxA+FfpTVNnfy\n1WcrKC4Yx5Nf0spXIhJfdGnlKPv59lpau/r4j6/fqMlXIhJ31NOPsrKqBhYX5XJJ4fhYlyIicgaF\nfhQ1d/Sw7cBxViyaEutSRESGpNCPold3N9IfcFYsmhrrUkREhqTQj6KyqgYmjx/LFTMmxroUEZEh\nKfSjpLc/wKu7G1i+sJAxOkVTROKUQj9Kyj88xomuPg3tiEhcU+hHyZbKBjLTxnDT3MmxLkVEZFgK\n/Sgpq2rghksnkTNWUx9EJH4p9KNgb2Mb+5vauU2naopInIso9M1spZntNrNqM3twiMcfNbN3Q7cP\nzKw57LG/M7OdZlZpZj+0JLzy2JbKBgBuXajQF5H4NuJYhJmlAY8BtxNc87bczDa4+66BNu5+f1j7\n+4ClofsfAW4Ergg9/DrwUeDVKNUfF16urGfhtAnMzM+OdSkiImcVSU9/GVDt7vvcvQdYD6w6S/s1\nwE9C953germZwFggA6g//3LjT0tHLxUHjnObztoRkQQQSejPAA6FbdeE9p3BzGYDJcAWAHffCrwC\n1IVum9298kIKjjevftBAf8BZrvF8EUkAkYT+UGPwwy2suxp4YWDxczObCywCZhL8QbHczG454wXM\n7jGzCjOraGxsjKzyOFFW2cDk8ZlcNTMv1qWIiIwoktCvAYrDtmcCtcO0Xc2poR2A/wL8xt3b3L0N\n2ARcP/ggd1/n7qXuXlpYWBhZ5XFgYBburQumaBauiCSESEK/HJhnZiVmlkkw2DcMbmRmC4B8YGvY\n7oPAR80s3cwyCH6JmzTDO9sOHKe1q09X1RSRhDFi6Lt7H3AvsJlgYD/v7jvN7BEzuyus6RpgvbuH\nD/28AOwF3ge2A9vd/edRqz7GyirryUwbw83zEue3ExFJbRFNH3X3jcDGQfu+OWj74SGO6we+dgH1\nxbWyygau1yxcEUkgmpF7nvY1trGvqZ0VmpAlIglEoX+etlQFZ+EuV+iLSAJR6J+nlyvrWTB1AsUF\nmoUrIolDoX8eWjp7Kf9Qa+GKSOJR6J+HX32gtXBFJDEp9M9DWWU9BTmZXFWsWbgiklgU+ueorz/A\nq7sbuXXBFNI0C1dEEoxC/xxtO3Ccls5eLZgiIglJoX+OtlQ1kJFm3DRPa+GKSOJR6J+jlyvruf6S\nSUzIyoh1KSIi50yhfw4+bGpnb6Nm4YpI4lLon4Oy0CxcnaopIolKoX8OyirrmT91vGbhikjCUuhH\nqLWrl7f3H1MvX0QSmkI/Qr/+oJG+gGs8X0QSWkShb2YrzWy3mVWb2YNDPP6omb0bun1gZs1hj80y\ns5fMrNLMdpnZnOiVf/GUVTZQkJPJ0ln5sS5FROS8jbj6h5mlAY8BtxNcL7fczDa4+66BNu5+f1j7\n+4ClYU/xLPC37v5LMxsPBKJV/MXS1x/gld0NLF+oWbgiktgi6ekvA6rdfZ+79wDrgVVnab+G0OLo\nZrYYSHf3XwKEFkjvuMCaL7p3DjXT3NHLioUazxeRxBZJ6M8ADoVt14T2ncHMZgMlwJbQrvlAs5n9\nm5m9Y2Z/H/rNIaG8XFlPRppxy3zNwhWRxBZJ6A81nuFD7ANYDbwQWhsXgsNHNwN/ClwLXAKsPeMF\nzO4xswozq2hsbIygpIurrLKB60o0C1dEEl8koV8DFIdtzwRqh2m7mtDQTtix74SGhvqA/wCuHnyQ\nu69z91J3Ly0sLIys8ovkwNF2qhvatCyiiCSFSEK/HJhnZiVmlkkw2DcMbmRmC4B8YOugY/PNbCDJ\nlwO7Bh8bz8oqB2bhKvRFJPGNGPqhHvq9wGagEnje3Xea2SNmdldY0zXAenf3sGP7CQ7tlJnZ+wSH\nip6I5hsYbWVV9cydMp7Zk3JiXYqIyAUb8ZRNAHffCGwctO+bg7YfHubYXwJXnGd9MXWiq5e39h3j\nKzeXxLoUEZGo0Izcs/j1B030BZzbdOkFEUkSCv2zKKusJy87g6VaC1dEkoRCfxj9AeeV3Q3cumAK\n6Wn6mEQkOSjNhvHOweMc7+jVWTsiklQU+sN4ubKB9DHGLfPja96AiMiFUOgPY0tVPctKCsjVLFwR\nSSIK/SEcOtbBB/VtWjBFRJKOQn8IL1fWA2jBFBFJOgr9IWypauDSwhzmTNYsXBFJLgr9QU509fKb\nfUc1IUtEkpJCf5DX9jTR2++6qqaIJCWF/iBllQ1MHJfBNbO1Fq6IJB+FfphTs3ALNQtXRJKSki3M\nu4eOc6y9h+UazxeRJKXQD1MWmoX7Uc3CFZEkpdAPU1bZwLVzCpg4TrNwRSQ5RRT6ZrbSzHabWbWZ\nPTjE44+a2buh2wdm1jzo8VwzO2xm/xStwqPt0LEOdtef0AXWRCSpjbhylpmlAY8BtxNc6LzczDa4\n+8m1bt39/rD29wFLBz3Nt4BfRaXiUbKlamAtXI3ni0jyiqSnvwyodvd97t4DrAdWnaX9GuAnAxtm\ndg0wFXjpQgodbS9X1nPJ5BxKNAtXRJJYJKE/AzgUtl0T2ncGM5sNlABbQttjgO8Cf3ZhZY6utu4+\n3tp3TEM7IpL0Igl9G2KfD9N2NfCCu/eHtv8Y2Ojuh4ZpH3wBs3vMrMLMKhobGyMoKbpe39NIT39A\nQzsikvRGHNMn2LMvDtueCdQO03Y18PWw7RuAm83sj4HxQKaZtbn7aV8Gu/s6YB1AaWnpcD9QRs3L\nlQ3kZqVrFq6IJL1IQr8cmGdmJcBhgsH++cGNzGwBkA9sHdjn7r8f9vhaoHRw4Mdaf8B5paqBjy2Y\nQoZm4YpIkhsx5dy9D7gX2AxUAs+7+04ze8TM7gprugZY7+4Xvad+IbbXNHO0vUfj+SKSEiLp6ePu\nG4GNg/Z9c9D2wyM8xzPAM+dU3UVQVllP2hjjY/MV+iKS/FJ+PKOssoHS2flMzNYsXBFJfikd+jXH\nO6g6ckILpohIykjp0B+Yhbtc4/kikiJSOvTLKhsomZzDpYXjY12KiMhFkbKh397dx9a9R1mhZRFF\nJIWkbOi/tqeJnv6AhnZEJKWkbOhvqapnQlY6184piHUpIiIXTUqGfiDgbKlq1CxcEUk5KZl422ua\naWrr1ni+iKSclAz9LVUNwVm4C7QWroiklpQM/ZcrG7hmdj552ZmxLkVE5KJKudA/3NxJZV2rhnZE\nJCWlXOhrLVwRSWUpF/pllfXMnpTNpYVaC1dEUk9KhX5HTx9v7j3KioVTMRtqFUgRkeQWUeib2Uoz\n221m1WZ2xspXZvaomb0bun1gZs2h/VeZ2VYz22lm75nZ56L9Bs7F63ua6OkLcJtm4YpIihpxERUz\nSwMeA24nuF5uuZltcPddA23c/f6w9vcBS0ObHcAX3X2PmU0HtpnZZndvjuabiFRZZQMTxqZTqlm4\nIpKiIunpLwOq3X2fu/cA64FVZ2m/BvgJgLt/4O57QvdrgQYgJifHBwJOWVUDtywoJDM9pUa1RERO\niiT9ZgCHwrZrQvvOYGazgRJgyxCPLQMygb3nXuaFe/9wC01t3RraEZGUFknoD/WN53CLn68GXnD3\n/tOewKwIeA64290DZ7yA2T1mVmFmFY2NjRGUdO7KKusZY2gtXBFJaZGEfg1QHLY9E6gdpu1qQkM7\nA8wsF3gReMjdfzPUQe6+zt1L3b20sHB0Rn8GZuHm52gWroikrkhCvxyYZ2YlZpZJMNg3DG5kZguA\nfGBr2L5M4N+BZ939Z9Ep+dzVtXSyq65VE7JEJOWNGPru3gfcC2wGKoHn3X2nmT1iZneFNV0DrHf3\n8KGfzwK3AGvDTum8Kor1R6SsMjQLV5deEJEUN+IpmwDuvhHYOGjfNwdtPzzEcT8GfnwB9UVFWWU9\nswqymTtFa+GKSGpL+nMXO3r6eGPvUVYsmqJZuCKS8pI+9N+oPkpPX4AVCzWeLyKS9KFfVlnPhLHp\nLCvRLFwRkaQO/eBauA3cMl+zcEVEIMlDf0dtCw0nulmus3ZERIAkD/2XKxswg1sV+iIiQJKH/paq\neq6elU+BZuGKiABJHPpHWrrYcbiVFbrAmojISUkb+mVV9QDcpksviIiclLShv6WygZn545inWbgi\nIiclZeh39vTzenUTty3SWrgiIuGSMvTfqG6iuy+g8XwRkUGSMvTLqhrIyUzTLFwRkUGSLvTdnS1V\n9dwyv5Cx6WmxLkdEJK4kXejvONxKfWu3FkwRERlC0oV+WVU9ZvCxBaOz7KKISCKLKPTNbKWZ7Taz\najN7cIjHHw1bGesDM2sOe+xLZrYndPtSNIsfSlllA0uL85g8fuxov5SISMIZceUsM0sDHgNuJ7hI\nermZbXD3XQNt3P3+sPb3AUtD9wuA/wGUAg5sCx17PKrvIqS+tYv3D7fwZ59YMBpPLyKS8CLp6S8D\nqt19n7v3AOuBVWdpvwb4Sej+J4BfuvuxUND/Elh5IQWfzZaq0Fq4OlVTRGRIkYT+DOBQ2HZNaN8Z\nzGw2UAJsOddjo6Gssp4ZeeNYMHXCaL2EiEhCiyT0h5rS6sO0XQ284O7953Ksmd1jZhVmVtHY2BhB\nSWfq6h2Yhau1cEVEhhNJ6NcAxWHbM4HaYdqu5tTQTsTHuvs6dy9199LCwvM766a1s5ePL57GyiVF\n53W8iEgcAukSAAADoklEQVQqiCT0y4F5ZlZiZpkEg33D4EZmtgDIB7aG7d4MfNzM8s0sH/h4aF/U\nTcnN4odrlnLDpZNG4+lFRJLCiGfvuHufmd1LMKzTgKfcfaeZPQJUuPvAD4A1wHp397Bjj5nZtwj+\n4AB4xN2PRfctiIhIpCwso+NCaWmpV1RUxLoMEZGEYmbb3L10pHZJNyNXRESGp9AXEUkhCn0RkRSi\n0BcRSSEKfRGRFKLQFxFJIXF3yqaZNQIHLuApJgNNUSon0emzOJ0+j9Pp8zglGT6L2e4+4iUN4i70\nL5SZVURyrmoq0GdxOn0ep9PncUoqfRYa3hERSSEKfRGRFJKMob8u1gXEEX0Wp9PncTp9HqekzGeR\ndGP6IiIyvGTs6YuIyDCSJvTNbKWZ7TazajN7MNb1xJKZFZvZK2ZWaWY7zexPYl1TrJlZmpm9Y2b/\nGetaYs3M8szsBTOrCv0buSHWNcWSmd0f+n+yw8x+YmZZsa5pNCVF6JtZGvAY8ElgMbDGzBbHtqqY\n6gMecPdFwPXA11P88wD4E6Ay1kXEiR8Av3D3hcCVpPDnYmYzgG8Ape6+hOCaIatjW9XoSorQB5YB\n1e6+z917gPXAqhjXFDPuXufuvw3dP0HwP/WoLUgf78xsJvAp4MlY1xJrZpYL3AL8C4C797h7c2yr\nirl0YJyZpQPZDL8cbFJIltCfARwK264hhUMunJnNAZYCb8W2kpj6PvDnQCDWhcSBS4BG4OnQcNeT\nZpYT66Jixd0PA/8AHATqgBZ3fym2VY2uZAl9G2Jfyp+WZGbjgf8L/Dd3b411PbFgZncCDe6+Lda1\nxIl04GrgcXdfCrQDKfsdWGjt7lVACTAdyDGzP4htVaMrWUK/BigO255Jkv+KNhIzyyAY+P/q7v8W\n63pi6EbgLjP7kOCw33Iz+3FsS4qpGqDG3Qd+83uB4A+BVHUbsN/dG929F/g34CMxrmlUJUvolwPz\nzKzEzDIJfhGzYYRjkpaZGcEx20p3/16s64kld/9Ld5/p7nMI/rvY4u5J3ZM7G3c/AhwyswWhXSuA\nXTEsKdYOAtebWXbo/80KkvyL7fRYFxAN7t5nZvcCmwl++/6Uu++McVmxdCPwBeB9M3s3tO+v3H1j\nDGuS+HEf8K+hDtI+4O4Y1xMz7v6Wmb0A/JbgWW/vkOSzczUjV0QkhSTL8I6IiERAoS8ikkIU+iIi\nKUShLyKSQhT6IiIpRKEvIpJCFPoiIilEoS8ikkL+P5ShaozSBnTsAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f2a7db76c18>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "plt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(10)]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The shape of this curve suggests that adding more trees isn't going to help us much. Let's check. (Compare this to our original model on a sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.10721195540628872, 0.2777358026154778, 0.9761670456844791, 0.86224362387001874]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=20, n_jobs=-1)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.1029319603663909, 0.2725488716109634, 0.97803192843821529, 0.86734099039701873]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, n_jobs=-1)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.09942284423261978, 0.27026457977935875, 0.97950425012208453, 0.86955536025947799]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=80, n_jobs=-1)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Out-of-bag (OOB) score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called *out-of-bag (OOB) error* which can handle this (and more!)\n",
    "\n",
    "The idea is to calculate error on the training set, but only include the trees in the calculation of a row's error where that row was *not* included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.\n",
    "\n",
    "This also has the benefit of allowing us to see whether our model generalizes, even if we only have a small amount of data so want to avoid separating some out to create a validation set.\n",
    "\n",
    "This is as simple as adding one more parameter to our model constructor. We print the OOB error last in our `print_score` function below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.10198464613020647, 0.2714485881623037, 0.9786192457999483, 0.86840992079038759, 0.84831537630038534]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This shows that our validation set time difference is making an impact, as is model over-fitting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reducing over-fitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Subsampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that one of the easiest ways to avoid over-fitting is also one of the best ways to speed up analysis: *subsampling*. Let's return to using our full dataset, so that we can demonstrate the impact of this technique."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')\n",
    "X_train, X_valid = split_vals(df_trn, n_trn)\n",
    "y_train, y_valid = split_vals(y_trn, n_trn)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The basic idea is this: rather than limit the total amount of data that our model can access, let's instead limit it to a *different* random subset per tree. That way, given enough trees, the model can still see *all* the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set_rf_samples(20000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 8.38 s, sys: 428 ms, total: 8.81 s\n",
      "Wall time: 3.49 s\n",
      "[0.24021020451254516, 0.2780622994610262, 0.87937208432405256, 0.86191954999425424, 0.86692047674867767]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_jobs=-1, oob_score=True)\n",
    "%time m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since each additional tree allows the model to see more data, this approach can make additional trees more useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.2317315086850927, 0.26334275954117264, 0.89225792718146846, 0.87615150359885019, 0.88097587673696554]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tree building parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We revert to using a full bootstrap sample in order to show the impact of other over-fitting avoidance methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reset_rf_samples()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get a baseline for this full set to compare to."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dectree_max_depth(tree):\n",
    "    children_left = tree.children_left\n",
    "    children_right = tree.children_right\n",
    "\n",
    "    def walk(node_id):\n",
    "        if (children_left[node_id] != children_right[node_id]):\n",
    "            left_max = 1 + walk(children_left[node_id])\n",
    "            right_max = 1 + walk(children_right[node_id])\n",
    "            return max(left_max, right_max)\n",
    "        else: # leaf\n",
    "            return 1\n",
    "\n",
    "    root_node_id = 0\n",
    "    return walk(root_node_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.07828713008286803, 0.23818558990341943, 0.9871909898049919, 0.8986837887808402, 0.9085077721150765]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t=m.estimators_[0].tree_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "45"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dectree_max_depth(t)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.14073508031497292, 0.23337403295759937, 0.9586057939941005, 0.9027357960501001, 0.9068706269691232]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, min_samples_leaf=5, n_jobs=-1, oob_score=True)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t=m.estimators_[0].tree_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "35"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dectree_max_depth(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to reduce over-fitting is to grow our trees less deeply. We do this by specifying (with `min_samples_leaf`) that we require some minimum number of rows in every leaf node. This has two benefits:\n",
    "\n",
    "- There are less decision rules for each leaf node; simpler models should generalize better\n",
    "- The predictions are made by averaging more rows in the leaf node, resulting in less volatility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.11595869956476182, 0.23427349924625201, 0.97209195463880227, 0.90198460308551043, 0.90843297242839738]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, n_jobs=-1, oob_score=True)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also increase the amount of variation amongst the trees by not only use a sample of rows for each tree, but to also using a sample of *columns* for each *split*. We do this by specifying `max_features`, which is the proportion of features to randomly select from at each split."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- None\n",
    "- 0.5\n",
    "- 'sqrt'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 1, 3, 5, 10, 25, 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.11926975747908228, 0.22869111042050522, 0.97026995966445684, 0.9066000722129437, 0.91144914977164715]\n"
     ]
    }
   ],
   "source": [
    "m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)\n",
    "m.fit(X_train, y_train)\n",
    "print_score(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can't compare our results directly with the Kaggle competition, since it used a different validation set (and we can no longer to submit to this competition) - but we can at least see that we're getting similar results to the winners based on the dataset we have.\n",
    "\n",
    "The sklearn docs [show an example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html) of different `max_features` methods with increasing numbers of trees - as you see, using a subset of features on each split requires using more trees, but results in better models:\n",
    "![sklearn max_features chart](http://scikit-learn.org/stable/_images/sphx_glr_plot_ensemble_oob_001.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}