{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# eBay Kleinanzeigen: Analysis of used car listing\n",
    "\n",
    "## About:\n",
    "In this project our aim is to analyse the dataset of used cars from eBay Kleinanzeigen (A Classified section of the German eBay [website](https://en.wikipedia.org/wiki/Classified_advertising)).\n",
    "\n",
    "Initially, we will step through the data cleansing process as dataset itself needs extensive cleansing and is a significant part of the project.\n",
    "\n",
    "Finally, after our data cleansing, we will answer the question of top brands and their average prices and compare that alongside average mileage to draw inferences.\n",
    "\n",
    "## Dataset:\n",
    "For the purposes of this project, the dataset has been limited to 50,000 rows.\n",
    "\n",
    "The full dataset can be downloaded from [here](https://data.world/data-society/used-cars-data).\n",
    "\n",
    "Below we can find the metadata about this dataset:\n",
    "\n",
    "`dateCrawled` - When this ad was first crawled. All field-values are taken from this date.\n",
    "    \n",
    "`name` - Name of the car.\n",
    "\n",
    "`seller` - Whether the seller is private or a dealer.\n",
    "\n",
    "`offerType` - The type of listing\n",
    "\n",
    "`price` - The price on the ad to sell the car.\n",
    "\n",
    "`abtest` - Whether the listing is included in an A/B test.\n",
    "\n",
    "`vehicleType` - The vehicle Type.\n",
    "\n",
    "`yearOfRegistration` - The year in which the car was first registered.\n",
    "\n",
    "`gearbox` - The transmission type.\n",
    "\n",
    "`powerPS` - The power of the car in PS.\n",
    "\n",
    "`model` - The car model name.\n",
    "\n",
    "`kilometer` - How many kilometers the car has driven.\n",
    "\n",
    "`monthOfRegistration` - The month in which the car was first registered.\n",
    "\n",
    "`fuelType` - What type of fuel the car uses.\n",
    "\n",
    "`brand` - The brand of the car.\n",
    "\n",
    "`notRepairedDamage` - If the car has a damage which is not yet repaired.\n",
    "\n",
    "`dateCreated` - The date on which the eBay listing was created.\n",
    "\n",
    "`nrOfPictures` - The number of pictures in the ad.\n",
    "\n",
    "`postalCode` - The postal code for the location of the vehicle.\n",
    "\n",
    "`lastSeenOnline` - When the crawler saw this ad last online.\n",
    "\n",
    "\n",
    "## Initial Data Exploration\n",
    "\n",
    "We will start by importing the `pandas` and `NumPy` libraries and read our csv data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "autos = pd.read_csv(\"autos.csv\", encoding=\"Latin-1\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 50000 entries, 0 to 49999\n",
      "Data columns (total 20 columns):\n",
      " #   Column               Non-Null Count  Dtype \n",
      "---  ------               --------------  ----- \n",
      " 0   dateCrawled          50000 non-null  object\n",
      " 1   name                 50000 non-null  object\n",
      " 2   seller               50000 non-null  object\n",
      " 3   offerType            50000 non-null  object\n",
      " 4   price                50000 non-null  object\n",
      " 5   abtest               50000 non-null  object\n",
      " 6   vehicleType          44905 non-null  object\n",
      " 7   yearOfRegistration   50000 non-null  int64 \n",
      " 8   gearbox              47320 non-null  object\n",
      " 9   powerPS              50000 non-null  int64 \n",
      " 10  model                47242 non-null  object\n",
      " 11  odometer             50000 non-null  object\n",
      " 12  monthOfRegistration  50000 non-null  int64 \n",
      " 13  fuelType             45518 non-null  object\n",
      " 14  brand                50000 non-null  object\n",
      " 15  notRepairedDamage    40171 non-null  object\n",
      " 16  dateCreated          50000 non-null  object\n",
      " 17  nrOfPictures         50000 non-null  int64 \n",
      " 18  postalCode           50000 non-null  int64 \n",
      " 19  lastSeen             50000 non-null  object\n",
      "dtypes: int64(5), object(15)\n",
      "memory usage: 7.6+ MB\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dateCrawled</th>\n",
       "      <th>name</th>\n",
       "      <th>seller</th>\n",
       "      <th>offerType</th>\n",
       "      <th>price</th>\n",
       "      <th>abtest</th>\n",
       "      <th>vehicleType</th>\n",
       "      <th>yearOfRegistration</th>\n",
       "      <th>gearbox</th>\n",
       "      <th>powerPS</th>\n",
       "      <th>model</th>\n",
       "      <th>odometer</th>\n",
       "      <th>monthOfRegistration</th>\n",
       "      <th>fuelType</th>\n",
       "      <th>brand</th>\n",
       "      <th>notRepairedDamage</th>\n",
       "      <th>dateCreated</th>\n",
       "      <th>nrOfPictures</th>\n",
       "      <th>postalCode</th>\n",
       "      <th>lastSeen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-03-26 17:47:46</td>\n",
       "      <td>Peugeot_807_160_NAVTECH_ON_BOARD</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$5,000</td>\n",
       "      <td>control</td>\n",
       "      <td>bus</td>\n",
       "      <td>2004</td>\n",
       "      <td>manuell</td>\n",
       "      <td>158</td>\n",
       "      <td>andere</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>3</td>\n",
       "      <td>lpg</td>\n",
       "      <td>peugeot</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-26 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>79588</td>\n",
       "      <td>2016-04-06 06:45:54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-04-04 13:38:56</td>\n",
       "      <td>BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$8,500</td>\n",
       "      <td>control</td>\n",
       "      <td>limousine</td>\n",
       "      <td>1997</td>\n",
       "      <td>automatik</td>\n",
       "      <td>286</td>\n",
       "      <td>7er</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>6</td>\n",
       "      <td>benzin</td>\n",
       "      <td>bmw</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-04-04 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>71034</td>\n",
       "      <td>2016-04-06 14:45:08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-03-26 18:57:24</td>\n",
       "      <td>Volkswagen_Golf_1.6_United</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$8,990</td>\n",
       "      <td>test</td>\n",
       "      <td>limousine</td>\n",
       "      <td>2009</td>\n",
       "      <td>manuell</td>\n",
       "      <td>102</td>\n",
       "      <td>golf</td>\n",
       "      <td>70,000km</td>\n",
       "      <td>7</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-26 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>35394</td>\n",
       "      <td>2016-04-06 20:15:37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-03-12 16:58:10</td>\n",
       "      <td>Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$4,350</td>\n",
       "      <td>control</td>\n",
       "      <td>kleinwagen</td>\n",
       "      <td>2007</td>\n",
       "      <td>automatik</td>\n",
       "      <td>71</td>\n",
       "      <td>fortwo</td>\n",
       "      <td>70,000km</td>\n",
       "      <td>6</td>\n",
       "      <td>benzin</td>\n",
       "      <td>smart</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-12 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>33729</td>\n",
       "      <td>2016-03-15 03:16:28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-04-01 14:38:50</td>\n",
       "      <td>Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$1,350</td>\n",
       "      <td>test</td>\n",
       "      <td>kombi</td>\n",
       "      <td>2003</td>\n",
       "      <td>manuell</td>\n",
       "      <td>0</td>\n",
       "      <td>focus</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>7</td>\n",
       "      <td>benzin</td>\n",
       "      <td>ford</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-04-01 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>39218</td>\n",
       "      <td>2016-04-01 14:38:50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           dateCrawled                                               name  \\\n",
       "0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   \n",
       "1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   \n",
       "2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   \n",
       "3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   \n",
       "4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   \n",
       "\n",
       "   seller offerType   price   abtest vehicleType  yearOfRegistration  \\\n",
       "0  privat   Angebot  $5,000  control         bus                2004   \n",
       "1  privat   Angebot  $8,500  control   limousine                1997   \n",
       "2  privat   Angebot  $8,990     test   limousine                2009   \n",
       "3  privat   Angebot  $4,350  control  kleinwagen                2007   \n",
       "4  privat   Angebot  $1,350     test       kombi                2003   \n",
       "\n",
       "     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \\\n",
       "0    manuell      158  andere  150,000km                    3      lpg   \n",
       "1  automatik      286     7er  150,000km                    6   benzin   \n",
       "2    manuell      102    golf   70,000km                    7   benzin   \n",
       "3  automatik       71  fortwo   70,000km                    6   benzin   \n",
       "4    manuell        0   focus  150,000km                    7   benzin   \n",
       "\n",
       "        brand notRepairedDamage          dateCreated  nrOfPictures  \\\n",
       "0     peugeot              nein  2016-03-26 00:00:00             0   \n",
       "1         bmw              nein  2016-04-04 00:00:00             0   \n",
       "2  volkswagen              nein  2016-03-26 00:00:00             0   \n",
       "3       smart              nein  2016-03-12 00:00:00             0   \n",
       "4        ford              nein  2016-04-01 00:00:00             0   \n",
       "\n",
       "   postalCode             lastSeen  \n",
       "0       79588  2016-04-06 06:45:54  \n",
       "1       71034  2016-04-06 14:45:08  \n",
       "2       35394  2016-04-06 20:15:37  \n",
       "3       33729  2016-03-15 03:16:28  \n",
       "4       39218  2016-04-01 14:38:50  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "autos.info()\n",
    "autos.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* `seller`, `no_of_pics` and `offer_type` column seems to have only one value for 99.99% of the data and hence can be ignored safely\n",
    "* `registration_year` seem to have some odd values suggested by the min being 1000 and max being 9999. This needs further investigation\n",
    "* `registration_month` as well have invalid values suggested by the min value being 0. This needs further investigation\n",
    "* `powerPS` column has **0** for about 11% of the data and that is invalid for that column\n",
    "* `fuel_type` column has **NaN** for about 10% of the data which we would want to exclude for any analysis based on that column\n",
    "* `unrepaired_damage` column has **NaN** for about 20% of the data which we want to exclude for any analysis based on that column\n",
    "* Columns `price`, `odometer` ideally should be numeric for our analysis\n",
    "* Columns `dateCrawled`,`dateCreated`, and `lastSeen` should be Date type\n",
    "\n",
    "Additionally, the column names use camelCase instead of the preferred snake_case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Cleansing\n",
    "\n",
    "\n",
    "### 1. Rename labels\n",
    "\n",
    "We will clean/rename the column names so that we have consistent snake_case and also reduce some lengthy column names to more apt shorter ones so that it will aid when we use those as labels in our data analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date_crawled</th>\n",
       "      <th>name</th>\n",
       "      <th>seller</th>\n",
       "      <th>offer_type</th>\n",
       "      <th>price</th>\n",
       "      <th>abtest</th>\n",
       "      <th>vehicle_type</th>\n",
       "      <th>registration_year</th>\n",
       "      <th>gearbox</th>\n",
       "      <th>power_ps</th>\n",
       "      <th>model</th>\n",
       "      <th>odometer</th>\n",
       "      <th>registration_month</th>\n",
       "      <th>fuel_type</th>\n",
       "      <th>brand</th>\n",
       "      <th>unrepaired_damage</th>\n",
       "      <th>ad_created</th>\n",
       "      <th>pictures_count</th>\n",
       "      <th>postal_code</th>\n",
       "      <th>last_seen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-03-26 17:47:46</td>\n",
       "      <td>Peugeot_807_160_NAVTECH_ON_BOARD</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$5,000</td>\n",
       "      <td>control</td>\n",
       "      <td>bus</td>\n",
       "      <td>2004</td>\n",
       "      <td>manuell</td>\n",
       "      <td>158</td>\n",
       "      <td>andere</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>3</td>\n",
       "      <td>lpg</td>\n",
       "      <td>peugeot</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-26 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>79588</td>\n",
       "      <td>2016-04-06 06:45:54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-04-04 13:38:56</td>\n",
       "      <td>BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$8,500</td>\n",
       "      <td>control</td>\n",
       "      <td>limousine</td>\n",
       "      <td>1997</td>\n",
       "      <td>automatik</td>\n",
       "      <td>286</td>\n",
       "      <td>7er</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>6</td>\n",
       "      <td>benzin</td>\n",
       "      <td>bmw</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-04-04 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>71034</td>\n",
       "      <td>2016-04-06 14:45:08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-03-26 18:57:24</td>\n",
       "      <td>Volkswagen_Golf_1.6_United</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$8,990</td>\n",
       "      <td>test</td>\n",
       "      <td>limousine</td>\n",
       "      <td>2009</td>\n",
       "      <td>manuell</td>\n",
       "      <td>102</td>\n",
       "      <td>golf</td>\n",
       "      <td>70,000km</td>\n",
       "      <td>7</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-26 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>35394</td>\n",
       "      <td>2016-04-06 20:15:37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-03-12 16:58:10</td>\n",
       "      <td>Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$4,350</td>\n",
       "      <td>control</td>\n",
       "      <td>kleinwagen</td>\n",
       "      <td>2007</td>\n",
       "      <td>automatik</td>\n",
       "      <td>71</td>\n",
       "      <td>fortwo</td>\n",
       "      <td>70,000km</td>\n",
       "      <td>6</td>\n",
       "      <td>benzin</td>\n",
       "      <td>smart</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-12 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>33729</td>\n",
       "      <td>2016-03-15 03:16:28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-04-01 14:38:50</td>\n",
       "      <td>Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$1,350</td>\n",
       "      <td>test</td>\n",
       "      <td>kombi</td>\n",
       "      <td>2003</td>\n",
       "      <td>manuell</td>\n",
       "      <td>0</td>\n",
       "      <td>focus</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>7</td>\n",
       "      <td>benzin</td>\n",
       "      <td>ford</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-04-01 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>39218</td>\n",
       "      <td>2016-04-01 14:38:50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          date_crawled                                               name  \\\n",
       "0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   \n",
       "1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   \n",
       "2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   \n",
       "3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   \n",
       "4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   \n",
       "\n",
       "   seller offer_type   price   abtest vehicle_type  registration_year  \\\n",
       "0  privat    Angebot  $5,000  control          bus               2004   \n",
       "1  privat    Angebot  $8,500  control    limousine               1997   \n",
       "2  privat    Angebot  $8,990     test    limousine               2009   \n",
       "3  privat    Angebot  $4,350  control   kleinwagen               2007   \n",
       "4  privat    Angebot  $1,350     test        kombi               2003   \n",
       "\n",
       "     gearbox  power_ps   model   odometer  registration_month fuel_type  \\\n",
       "0    manuell       158  andere  150,000km                   3       lpg   \n",
       "1  automatik       286     7er  150,000km                   6    benzin   \n",
       "2    manuell       102    golf   70,000km                   7    benzin   \n",
       "3  automatik        71  fortwo   70,000km                   6    benzin   \n",
       "4    manuell         0   focus  150,000km                   7    benzin   \n",
       "\n",
       "        brand unrepaired_damage           ad_created  pictures_count  \\\n",
       "0     peugeot              nein  2016-03-26 00:00:00               0   \n",
       "1         bmw              nein  2016-04-04 00:00:00               0   \n",
       "2  volkswagen              nein  2016-03-26 00:00:00               0   \n",
       "3       smart              nein  2016-03-12 00:00:00               0   \n",
       "4        ford              nein  2016-04-01 00:00:00               0   \n",
       "\n",
       "   postal_code            last_seen  \n",
       "0        79588  2016-04-06 06:45:54  \n",
       "1        71034  2016-04-06 14:45:08  \n",
       "2        35394  2016-04-06 20:15:37  \n",
       "3        33729  2016-03-15 03:16:28  \n",
       "4        39218  2016-04-01 14:38:50  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',\n",
    "       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',\n",
    "       'odometer', 'registration_month', 'fuel_type', 'brand',\n",
    "       'unrepaired_damage', 'ad_created', 'pictures_count', 'postal_code',\n",
    "       'last_seen']\n",
    "autos.columns = new_columns\n",
    "autos.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Datatype conversions\n",
    "\n",
    "We will describe the statistics of the data across the columns to:\n",
    "    - Drop any text columns which have almost all values as the same\n",
    "    - Convert the data on any columns which have numeric data which are stored as text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date_crawled</th>\n",
       "      <th>name</th>\n",
       "      <th>seller</th>\n",
       "      <th>offer_type</th>\n",
       "      <th>price</th>\n",
       "      <th>abtest</th>\n",
       "      <th>vehicle_type</th>\n",
       "      <th>registration_year</th>\n",
       "      <th>gearbox</th>\n",
       "      <th>power_ps</th>\n",
       "      <th>model</th>\n",
       "      <th>odometer</th>\n",
       "      <th>registration_month</th>\n",
       "      <th>fuel_type</th>\n",
       "      <th>brand</th>\n",
       "      <th>unrepaired_damage</th>\n",
       "      <th>ad_created</th>\n",
       "      <th>pictures_count</th>\n",
       "      <th>postal_code</th>\n",
       "      <th>last_seen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>50000</td>\n",
       "      <td>50000</td>\n",
       "      <td>50000</td>\n",
       "      <td>50000</td>\n",
       "      <td>50000</td>\n",
       "      <td>50000</td>\n",
       "      <td>44905</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>47320</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>47242</td>\n",
       "      <td>50000</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>45518</td>\n",
       "      <td>50000</td>\n",
       "      <td>40171</td>\n",
       "      <td>50000</td>\n",
       "      <td>50000.0</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>50000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>48213</td>\n",
       "      <td>38754</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2357</td>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>NaN</td>\n",
       "      <td>245</td>\n",
       "      <td>13</td>\n",
       "      <td>NaN</td>\n",
       "      <td>7</td>\n",
       "      <td>40</td>\n",
       "      <td>2</td>\n",
       "      <td>76</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>39481</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>2016-03-16 21:50:53</td>\n",
       "      <td>Ford_Fiesta</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>$0</td>\n",
       "      <td>test</td>\n",
       "      <td>limousine</td>\n",
       "      <td>NaN</td>\n",
       "      <td>manuell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>golf</td>\n",
       "      <td>150,000km</td>\n",
       "      <td>NaN</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-04-03 00:00:00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2016-04-07 06:17:27</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>3</td>\n",
       "      <td>78</td>\n",
       "      <td>49999</td>\n",
       "      <td>49999</td>\n",
       "      <td>1421</td>\n",
       "      <td>25756</td>\n",
       "      <td>12859</td>\n",
       "      <td>NaN</td>\n",
       "      <td>36993</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4024</td>\n",
       "      <td>32424</td>\n",
       "      <td>NaN</td>\n",
       "      <td>30107</td>\n",
       "      <td>10687</td>\n",
       "      <td>35232</td>\n",
       "      <td>1946</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2005.073280</td>\n",
       "      <td>NaN</td>\n",
       "      <td>116.355920</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5.723360</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>50813.627300</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>105.712813</td>\n",
       "      <td>NaN</td>\n",
       "      <td>209.216627</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.711984</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>25779.747957</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1067.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1999.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>70.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>30451.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2003.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>105.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>49577.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2008.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>150.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>71540.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>9999.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>17700.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.0</td>\n",
       "      <td>99998.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               date_crawled         name  seller offer_type  price abtest  \\\n",
       "count                 50000        50000   50000      50000  50000  50000   \n",
       "unique                48213        38754       2          2   2357      2   \n",
       "top     2016-03-16 21:50:53  Ford_Fiesta  privat    Angebot     $0   test   \n",
       "freq                      3           78   49999      49999   1421  25756   \n",
       "mean                    NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "std                     NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "min                     NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "25%                     NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "50%                     NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "75%                     NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "max                     NaN          NaN     NaN        NaN    NaN    NaN   \n",
       "\n",
       "       vehicle_type  registration_year  gearbox      power_ps  model  \\\n",
       "count         44905       50000.000000    47320  50000.000000  47242   \n",
       "unique            8                NaN        2           NaN    245   \n",
       "top       limousine                NaN  manuell           NaN   golf   \n",
       "freq          12859                NaN    36993           NaN   4024   \n",
       "mean            NaN        2005.073280      NaN    116.355920    NaN   \n",
       "std             NaN         105.712813      NaN    209.216627    NaN   \n",
       "min             NaN        1000.000000      NaN      0.000000    NaN   \n",
       "25%             NaN        1999.000000      NaN     70.000000    NaN   \n",
       "50%             NaN        2003.000000      NaN    105.000000    NaN   \n",
       "75%             NaN        2008.000000      NaN    150.000000    NaN   \n",
       "max             NaN        9999.000000      NaN  17700.000000    NaN   \n",
       "\n",
       "         odometer  registration_month fuel_type       brand unrepaired_damage  \\\n",
       "count       50000        50000.000000     45518       50000             40171   \n",
       "unique         13                 NaN         7          40                 2   \n",
       "top     150,000km                 NaN    benzin  volkswagen              nein   \n",
       "freq        32424                 NaN     30107       10687             35232   \n",
       "mean          NaN            5.723360       NaN         NaN               NaN   \n",
       "std           NaN            3.711984       NaN         NaN               NaN   \n",
       "min           NaN            0.000000       NaN         NaN               NaN   \n",
       "25%           NaN            3.000000       NaN         NaN               NaN   \n",
       "50%           NaN            6.000000       NaN         NaN               NaN   \n",
       "75%           NaN            9.000000       NaN         NaN               NaN   \n",
       "max           NaN           12.000000       NaN         NaN               NaN   \n",
       "\n",
       "                 ad_created  pictures_count   postal_code            last_seen  \n",
       "count                 50000         50000.0  50000.000000                50000  \n",
       "unique                   76             NaN           NaN                39481  \n",
       "top     2016-04-03 00:00:00             NaN           NaN  2016-04-07 06:17:27  \n",
       "freq                   1946             NaN           NaN                    8  \n",
       "mean                    NaN             0.0  50813.627300                  NaN  \n",
       "std                     NaN             0.0  25779.747957                  NaN  \n",
       "min                     NaN             0.0   1067.000000                  NaN  \n",
       "25%                     NaN             0.0  30451.000000                  NaN  \n",
       "50%                     NaN             0.0  49577.000000                  NaN  \n",
       "75%                     NaN             0.0  71540.000000                  NaN  \n",
       "max                     NaN             0.0  99998.000000                  NaN  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "autos.describe(include=\"all\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above table, we can infer the following:\n",
    "- Columns `seller`, `offer_type` and `pictures_count` have the same text data for almost the entire dataset and hence can safely be ignored\n",
    "- Columns `price`, `odometer` are numeric values stored as text with its metric\n",
    "\n",
    "Let's now look at converting the text to numeric on the columns `price` and `odometer`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===== Odometer =====\n",
      "150,000km    32424\n",
      "125,000km     5170\n",
      "100,000km     2169\n",
      "90,000km      1757\n",
      "80,000km      1436\n",
      "70,000km      1230\n",
      "60,000km      1164\n",
      "50,000km      1027\n",
      "5,000km        967\n",
      "40,000km       819\n",
      "30,000km       789\n",
      "20,000km       784\n",
      "10,000km       264\n",
      "Name: odometer, dtype: int64\n",
      "\n",
      "\n",
      "===== Price =====\n",
      "$0         1421\n",
      "$500        781\n",
      "$1,500      734\n",
      "$2,500      643\n",
      "$1,000      639\n",
      "           ... \n",
      "$23,790       1\n",
      "$9,655        1\n",
      "$29,445       1\n",
      "$29,699       1\n",
      "$35,800       1\n",
      "Name: price, Length: 2357, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(\"=\"*5,\"Odometer\",\"=\"*5)\n",
    "print(autos[\"odometer\"].value_counts())\n",
    "print('\\n')\n",
    "print(\"=\"*5,\"Price\",\"=\"*5)\n",
    "print(autos[\"price\"].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we need to remove the non-numeric characters as seen above on both of these columns and then we can convert to a numeric dtype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "      <th>registration_year</th>\n",
       "      <th>power_ps</th>\n",
       "      <th>odometer_km</th>\n",
       "      <th>registration_month</th>\n",
       "      <th>pictures_count</th>\n",
       "      <th>postal_code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>5.000000e+04</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>50000.000000</td>\n",
       "      <td>50000.0</td>\n",
       "      <td>50000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>9.840044e+03</td>\n",
       "      <td>2005.073280</td>\n",
       "      <td>116.355920</td>\n",
       "      <td>125732.700000</td>\n",
       "      <td>5.723360</td>\n",
       "      <td>0.0</td>\n",
       "      <td>50813.627300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>4.811044e+05</td>\n",
       "      <td>105.712813</td>\n",
       "      <td>209.216627</td>\n",
       "      <td>40042.211706</td>\n",
       "      <td>3.711984</td>\n",
       "      <td>0.0</td>\n",
       "      <td>25779.747957</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000e+00</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>5000.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1067.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.100000e+03</td>\n",
       "      <td>1999.000000</td>\n",
       "      <td>70.000000</td>\n",
       "      <td>125000.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>30451.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.950000e+03</td>\n",
       "      <td>2003.000000</td>\n",
       "      <td>105.000000</td>\n",
       "      <td>150000.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>49577.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>7.200000e+03</td>\n",
       "      <td>2008.000000</td>\n",
       "      <td>150.000000</td>\n",
       "      <td>150000.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>71540.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.000000e+08</td>\n",
       "      <td>9999.000000</td>\n",
       "      <td>17700.000000</td>\n",
       "      <td>150000.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>99998.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              price  registration_year      power_ps    odometer_km  \\\n",
       "count  5.000000e+04       50000.000000  50000.000000   50000.000000   \n",
       "mean   9.840044e+03        2005.073280    116.355920  125732.700000   \n",
       "std    4.811044e+05         105.712813    209.216627   40042.211706   \n",
       "min    0.000000e+00        1000.000000      0.000000    5000.000000   \n",
       "25%    1.100000e+03        1999.000000     70.000000  125000.000000   \n",
       "50%    2.950000e+03        2003.000000    105.000000  150000.000000   \n",
       "75%    7.200000e+03        2008.000000    150.000000  150000.000000   \n",
       "max    1.000000e+08        9999.000000  17700.000000  150000.000000   \n",
       "\n",
       "       registration_month  pictures_count   postal_code  \n",
       "count        50000.000000         50000.0  50000.000000  \n",
       "mean             5.723360             0.0  50813.627300  \n",
       "std              3.711984             0.0  25779.747957  \n",
       "min              0.000000             0.0   1067.000000  \n",
       "25%              3.000000             0.0  30451.000000  \n",
       "50%              6.000000             0.0  49577.000000  \n",
       "75%              9.000000             0.0  71540.000000  \n",
       "max             12.000000             0.0  99998.000000  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Remove all non-numeric characters and then convert to numeric\n",
    "autos[\"odometer\"] = autos[\"odometer\"].str.replace(\",\",\"\").str.replace(\"km\",\"\").astype(int)\n",
    "autos[\"price\"] = autos[\"price\"].str.replace(\"$\",\"\").str.replace(\",\",\"\").str.replace(\".\",\"\").astype(int)\n",
    "# Rename the column so that the metric is evident\n",
    "autos.rename(columns = {\"odometer\":\"odometer_km\"}, inplace = True)\n",
    "autos.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have ensured that odometer and price columns are numeric reflecting the values.\n",
    "\n",
    "### 3. Remove outliers\n",
    "\n",
    "Now we will look at the values in these columns a bit deeper to see if there are any outliers that skew the statistics and remove them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===== Top 20 - High price listing =====\n",
      "99999999    1\n",
      "27322222    1\n",
      "12345678    3\n",
      "11111111    2\n",
      "10000000    1\n",
      "3890000     1\n",
      "1300000     1\n",
      "1234566     1\n",
      "999999      2\n",
      "999990      1\n",
      "350000      1\n",
      "345000      1\n",
      "299000      1\n",
      "295000      1\n",
      "265000      1\n",
      "259000      1\n",
      "250000      1\n",
      "220000      1\n",
      "198000      1\n",
      "197000      1\n",
      "Name: price, dtype: int64\n",
      "\n",
      "\n",
      "===== Bottom 20 - Low price listing =====\n",
      "0     1421\n",
      "1      156\n",
      "2        3\n",
      "3        1\n",
      "5        2\n",
      "8        1\n",
      "9        1\n",
      "10       7\n",
      "11       2\n",
      "12       3\n",
      "13       2\n",
      "14       1\n",
      "15       2\n",
      "17       3\n",
      "18       1\n",
      "20       4\n",
      "25       5\n",
      "29       1\n",
      "30       7\n",
      "35       1\n",
      "Name: price, dtype: int64\n",
      "\n",
      "\n",
      "===== Statistics of low prices =====\n",
      "count    1762.000000\n",
      "mean        5.632804\n",
      "std        18.374263\n",
      "min         0.000000\n",
      "25%         0.000000\n",
      "50%         0.000000\n",
      "75%         0.000000\n",
      "max        99.000000\n",
      "Name: price, dtype: float64\n",
      "\n",
      "\n",
      "===== Distribution % of high prices =====\n",
      "12345678    0.200000\n",
      "999999      0.133333\n",
      "11111111    0.133333\n",
      "999990      0.066667\n",
      "3890000     0.066667\n",
      "27322222    0.066667\n",
      "1234566     0.066667\n",
      "350000      0.066667\n",
      "99999999    0.066667\n",
      "10000000    0.066667\n",
      "1300000     0.066667\n",
      "Name: price, dtype: float64\n",
      "\n",
      "\n",
      "===== Listing 350,000 and above =====\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>name</th>\n",
       "      <th>brand</th>\n",
       "      <th>vehicle_type</th>\n",
       "      <th>price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>514</th>\n",
       "      <td>Ford_Focus_Turnier_1.6_16V_Style</td>\n",
       "      <td>ford</td>\n",
       "      <td>kombi</td>\n",
       "      <td>999999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2897</th>\n",
       "      <td>Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000</td>\n",
       "      <td>ford</td>\n",
       "      <td>limousine</td>\n",
       "      <td>11111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7814</th>\n",
       "      <td>Ferrari_F40</td>\n",
       "      <td>sonstige_autos</td>\n",
       "      <td>coupe</td>\n",
       "      <td>1300000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11137</th>\n",
       "      <td>suche_maserati_3200_gt_Zustand_unwichtig_laufe...</td>\n",
       "      <td>sonstige_autos</td>\n",
       "      <td>coupe</td>\n",
       "      <td>10000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22947</th>\n",
       "      <td>Bmw_530d_zum_ausschlachten</td>\n",
       "      <td>bmw</td>\n",
       "      <td>kombi</td>\n",
       "      <td>1234566</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24384</th>\n",
       "      <td>Schlachte_Golf_3_gt_tdi</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>11111111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27371</th>\n",
       "      <td>Fiat_Punto</td>\n",
       "      <td>fiat</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12345678</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36818</th>\n",
       "      <td>Porsche_991</td>\n",
       "      <td>porsche</td>\n",
       "      <td>coupe</td>\n",
       "      <td>350000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37585</th>\n",
       "      <td>Volkswagen_Jetta_GT</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>limousine</td>\n",
       "      <td>999990</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39377</th>\n",
       "      <td>Tausche_volvo_v40_gegen_van</td>\n",
       "      <td>volvo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12345678</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39705</th>\n",
       "      <td>Tausch_gegen_gleichwertiges</td>\n",
       "      <td>mercedes_benz</td>\n",
       "      <td>limousine</td>\n",
       "      <td>99999999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42221</th>\n",
       "      <td>Leasinguebernahme</td>\n",
       "      <td>citroen</td>\n",
       "      <td>limousine</td>\n",
       "      <td>27322222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43049</th>\n",
       "      <td>2_VW_Busse_T3</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>bus</td>\n",
       "      <td>999999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47598</th>\n",
       "      <td>Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...</td>\n",
       "      <td>opel</td>\n",
       "      <td>limousine</td>\n",
       "      <td>12345678</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47634</th>\n",
       "      <td>Ferrari_FXX</td>\n",
       "      <td>sonstige_autos</td>\n",
       "      <td>coupe</td>\n",
       "      <td>3890000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                    name           brand  \\\n",
       "514                     Ford_Focus_Turnier_1.6_16V_Style            ford   \n",
       "2897    Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000            ford   \n",
       "7814                                         Ferrari_F40  sonstige_autos   \n",
       "11137  suche_maserati_3200_gt_Zustand_unwichtig_laufe...  sonstige_autos   \n",
       "22947                         Bmw_530d_zum_ausschlachten             bmw   \n",
       "24384                            Schlachte_Golf_3_gt_tdi      volkswagen   \n",
       "27371                                         Fiat_Punto            fiat   \n",
       "36818                                        Porsche_991         porsche   \n",
       "37585                                Volkswagen_Jetta_GT      volkswagen   \n",
       "39377                        Tausche_volvo_v40_gegen_van           volvo   \n",
       "39705                        Tausch_gegen_gleichwertiges   mercedes_benz   \n",
       "42221                                  Leasinguebernahme         citroen   \n",
       "43049                                      2_VW_Busse_T3      volkswagen   \n",
       "47598  Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...            opel   \n",
       "47634                                        Ferrari_FXX  sonstige_autos   \n",
       "\n",
       "      vehicle_type     price  \n",
       "514          kombi    999999  \n",
       "2897     limousine  11111111  \n",
       "7814         coupe   1300000  \n",
       "11137        coupe  10000000  \n",
       "22947        kombi   1234566  \n",
       "24384          NaN  11111111  \n",
       "27371          NaN  12345678  \n",
       "36818        coupe    350000  \n",
       "37585    limousine    999990  \n",
       "39377          NaN  12345678  \n",
       "39705    limousine  99999999  \n",
       "42221    limousine  27322222  \n",
       "43049          bus    999999  \n",
       "47598    limousine  12345678  \n",
       "47634        coupe   3890000  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"=\"*5,\"Top 20 - High price listing\",\"=\"*5)\n",
    "print(autos[\"price\"].value_counts().sort_index(ascending = False).head(20))\n",
    "print('\\n')\n",
    "print(\"=\"*5,\"Bottom 20 - Low price listing\",\"=\"*5)\n",
    "print(autos[\"price\"].value_counts().sort_index(ascending = True).head(20))\n",
    "print('\\n')\n",
    "print(\"=\"*5,\"Statistics of low prices\",\"=\"*5)\n",
    "print(autos.loc[autos[\"price\"] < 100, \"price\"].describe())\n",
    "print('\\n')\n",
    "print(\"=\"*5,\"Distribution % of high prices\",\"=\"*5)\n",
    "print(autos.loc[autos[\"price\"].between(350000,99999999),\"price\"].value_counts(normalize=True, dropna=False))\n",
    "print('\\n')\n",
    "print(\"=\"*5,\"Listing 350,000 and above\",\"=\"*5)\n",
    "autos.loc[autos[\"price\"] >= 350000,[\"name\",\"brand\",\"vehicle_type\",\"price\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on above display of `price` column, we see few astronomically high prices and lot of listing with prices that are miniscule which would overall skew our results.\n",
    "\n",
    "On closer look at these high price listing, few are true vintage cars or high performance cars (such as Ferrari_F40) that demand these high prices and some are mistakes in the dataset (such as Ford Punto, Escort MK1 or Maserati 3200 GT).\n",
    "\n",
    "Same way if we look at miniscule prices (between 0 and 100) and potentially these might be scrap rather than any potential use of the car.\n",
    "\n",
    "But since large part of these are mistakes, we will remove any listing that has a price of above 350,000, and similarly remove any listing that has a price less than 100.\n",
    "\n",
    "Let's now look at the odometer values and see if there are any outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of unique odometer values:  13\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "150000    32424\n",
       "125000     5170\n",
       "100000     2169\n",
       "90000      1757\n",
       "80000      1436\n",
       "70000      1230\n",
       "60000      1164\n",
       "50000      1027\n",
       "40000       819\n",
       "30000       789\n",
       "20000       784\n",
       "10000       264\n",
       "5000        967\n",
       "Name: odometer_km, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"Number of unique odometer values: \",autos[\"odometer_km\"].unique().shape[0])\n",
    "autos[\"odometer_km\"].value_counts().sort_index(ascending = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From above display of `odometer_km` data, we could infer two things:\n",
    "\n",
    "1. It seems based on number of unique values that it possibly is already bucketed and grouped on certain km ranges\n",
    "2. It looks normal for used car listing to have done these kilometer ranges\n",
    "\n",
    "Hence, we will keep these and only remove the outliers based on the price and then display the statistics of our cleaned dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>price</th>\n",
       "      <th>registration_year</th>\n",
       "      <th>power_ps</th>\n",
       "      <th>odometer_km</th>\n",
       "      <th>registration_month</th>\n",
       "      <th>pictures_count</th>\n",
       "      <th>postal_code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>48224.000000</td>\n",
       "      <td>48224.000000</td>\n",
       "      <td>48224.000000</td>\n",
       "      <td>48224.000000</td>\n",
       "      <td>48224.000000</td>\n",
       "      <td>48224.0</td>\n",
       "      <td>48224.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>5930.371433</td>\n",
       "      <td>2004.730964</td>\n",
       "      <td>117.677609</td>\n",
       "      <td>125919.148142</td>\n",
       "      <td>5.801634</td>\n",
       "      <td>0.0</td>\n",
       "      <td>50987.919729</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>9078.372762</td>\n",
       "      <td>87.897388</td>\n",
       "      <td>201.206304</td>\n",
       "      <td>39543.339640</td>\n",
       "      <td>3.676976</td>\n",
       "      <td>0.0</td>\n",
       "      <td>25737.119986</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>100.000000</td>\n",
       "      <td>1000.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>5000.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1067.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1250.000000</td>\n",
       "      <td>1999.000000</td>\n",
       "      <td>73.000000</td>\n",
       "      <td>125000.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>30823.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3000.000000</td>\n",
       "      <td>2004.000000</td>\n",
       "      <td>107.000000</td>\n",
       "      <td>150000.000000</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>49716.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>7499.000000</td>\n",
       "      <td>2008.000000</td>\n",
       "      <td>150.000000</td>\n",
       "      <td>150000.000000</td>\n",
       "      <td>9.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>71666.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>350000.000000</td>\n",
       "      <td>9999.000000</td>\n",
       "      <td>17700.000000</td>\n",
       "      <td>150000.000000</td>\n",
       "      <td>12.000000</td>\n",
       "      <td>0.0</td>\n",
       "      <td>99998.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               price  registration_year      power_ps    odometer_km  \\\n",
       "count   48224.000000       48224.000000  48224.000000   48224.000000   \n",
       "mean     5930.371433        2004.730964    117.677609  125919.148142   \n",
       "std      9078.372762          87.897388    201.206304   39543.339640   \n",
       "min       100.000000        1000.000000      0.000000    5000.000000   \n",
       "25%      1250.000000        1999.000000     73.000000  125000.000000   \n",
       "50%      3000.000000        2004.000000    107.000000  150000.000000   \n",
       "75%      7499.000000        2008.000000    150.000000  150000.000000   \n",
       "max    350000.000000        9999.000000  17700.000000  150000.000000   \n",
       "\n",
       "       registration_month  pictures_count   postal_code  \n",
       "count        48224.000000         48224.0  48224.000000  \n",
       "mean             5.801634             0.0  50987.919729  \n",
       "std              3.676976             0.0  25737.119986  \n",
       "min              0.000000             0.0   1067.000000  \n",
       "25%              3.000000             0.0  30823.000000  \n",
       "50%              6.000000             0.0  49716.000000  \n",
       "75%              9.000000             0.0  71666.750000  \n",
       "max             12.000000             0.0  99998.000000  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "autos = autos.loc[autos[\"price\"].between(100,350000)]\n",
    "autos.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Date column cleansing: Part 1\n",
    "\n",
    "From our dataset, there are 5 columns that represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:\n",
    "\n",
    "- `date_crawled`: added by the crawler\n",
    "- `last_seen`: added by the crawler\n",
    "- `ad_created`: from the website\n",
    "- `registration_month`: from the website\n",
    "- `registration_year`: from the website\n",
    "\n",
    "We see that `date_crawled`, `last_seen` and `ad_created` columns are identified as string values by pandas. \n",
    "\n",
    "We will look at the string format on these columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date_crawled</th>\n",
       "      <th>ad_created</th>\n",
       "      <th>last_seen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-03-26 17:47:46</td>\n",
       "      <td>2016-03-26 00:00:00</td>\n",
       "      <td>2016-04-06 06:45:54</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-04-04 13:38:56</td>\n",
       "      <td>2016-04-04 00:00:00</td>\n",
       "      <td>2016-04-06 14:45:08</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-03-26 18:57:24</td>\n",
       "      <td>2016-03-26 00:00:00</td>\n",
       "      <td>2016-04-06 20:15:37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-03-12 16:58:10</td>\n",
       "      <td>2016-03-12 00:00:00</td>\n",
       "      <td>2016-03-15 03:16:28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-04-01 14:38:50</td>\n",
       "      <td>2016-04-01 00:00:00</td>\n",
       "      <td>2016-04-01 14:38:50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          date_crawled           ad_created            last_seen\n",
       "0  2016-03-26 17:47:46  2016-03-26 00:00:00  2016-04-06 06:45:54\n",
       "1  2016-04-04 13:38:56  2016-04-04 00:00:00  2016-04-06 14:45:08\n",
       "2  2016-03-26 18:57:24  2016-03-26 00:00:00  2016-04-06 20:15:37\n",
       "3  2016-03-12 16:58:10  2016-03-12 00:00:00  2016-03-15 03:16:28\n",
       "4  2016-04-01 14:38:50  2016-04-01 00:00:00  2016-04-01 14:38:50"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "autos[['date_crawled','ad_created','last_seen']][0:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that these 3 columns have the initial 10 characters representing the date (In the format Year-Month-Day).\n",
    "\n",
    "Let's now understand the distribution of data using this date format across the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===== Distribution of date_crawled column =====\n",
      "2016-03-05    0.025361\n",
      "2016-03-06    0.014039\n",
      "2016-03-07    0.036061\n",
      "2016-03-08    0.033179\n",
      "2016-03-09    0.033013\n",
      "2016-03-10    0.032287\n",
      "2016-03-11    0.032598\n",
      "2016-03-12    0.036911\n",
      "2016-03-13    0.015677\n",
      "2016-03-14    0.036662\n",
      "2016-03-15    0.034319\n",
      "2016-03-16    0.029467\n",
      "2016-03-17    0.031499\n",
      "2016-03-18    0.012898\n",
      "2016-03-19    0.034734\n",
      "2016-03-20    0.037803\n",
      "2016-03-21    0.037201\n",
      "2016-03-22    0.032888\n",
      "2016-03-23    0.032287\n",
      "2016-03-24    0.029446\n",
      "2016-03-25    0.031499\n",
      "2016-03-26    0.032308\n",
      "2016-03-27    0.031126\n",
      "2016-03-28    0.034962\n",
      "2016-03-29    0.034112\n",
      "2016-03-30    0.033738\n",
      "2016-03-31    0.031851\n",
      "2016-04-01    0.033697\n",
      "2016-04-02    0.035605\n",
      "2016-04-03    0.038611\n",
      "2016-04-04    0.036538\n",
      "2016-04-05    0.013064\n",
      "2016-04-06    0.003173\n",
      "2016-04-07    0.001389\n",
      "Name: date_crawled, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "print(\"=\"*5,\"Distribution of date_crawled column\",\"=\"*5)\n",
    "print(autos[\"date_crawled\"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. The data was pulled from the eBay website over a month from 5th March to 7th April, 2016\n",
    "2. Amount of data crawled is evenly spread except the last two days when the data crawled was very less in proportion to others"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===== Distribution of Top 5 most recent ad_created =====\n",
      "2016-04-07    0.001244\n",
      "2016-04-06    0.003256\n",
      "2016-04-05    0.011799\n",
      "2016-04-04    0.036890\n",
      "2016-04-03    0.038860\n",
      "Name: ad_created, dtype: float64\n",
      "\n",
      "\n",
      "===== Distribution of Bottom 5 most old ad_created =====\n",
      "2015-06-11    0.000021\n",
      "2015-08-10    0.000021\n",
      "2015-09-09    0.000021\n",
      "2015-11-10    0.000021\n",
      "2015-12-05    0.000021\n",
      "Name: ad_created, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "print(\"=\"*5,\"Distribution of Top 5 most recent ad_created\",\"=\"*5)\n",
    "print(autos[\"ad_created\"].str[:10].value_counts(normalize=True).sort_index(ascending=False).head(5))\n",
    "print(\"\\n\")\n",
    "print(\"=\"*5,\"Distribution of Bottom 5 most old ad_created\",\"=\"*5)\n",
    "print(autos[\"ad_created\"].str[:10].value_counts(normalize=True).sort_index(ascending=True).head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the exploration of the `ad_created` column, we observe the following:\n",
    "1. It shows a pattern that we saw with `date_crawled` column that most of the data is distributed between 9th March to 5th April, 2016\n",
    "2. There are listings with `ad_created` up to an year before the `date_crawled` date distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "===== Distribution of last_seen column =====\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "2016-04-06    0.221964\n",
       "2016-04-07    0.132154\n",
       "2016-04-05    0.125062\n",
       "2016-03-17    0.028098\n",
       "2016-04-03    0.025133\n",
       "2016-04-02    0.024884\n",
       "2016-03-30    0.024697\n",
       "2016-04-04    0.024531\n",
       "2016-03-31    0.023826\n",
       "2016-03-12    0.023785\n",
       "2016-04-01    0.022852\n",
       "2016-03-29    0.022292\n",
       "2016-03-22    0.021359\n",
       "2016-03-28    0.020840\n",
       "2016-03-20    0.020654\n",
       "2016-03-21    0.020550\n",
       "2016-03-24    0.019762\n",
       "2016-03-25    0.019098\n",
       "2016-03-23    0.018580\n",
       "2016-03-26    0.016672\n",
       "2016-03-16    0.016444\n",
       "2016-03-15    0.015863\n",
       "2016-03-19    0.015760\n",
       "2016-03-27    0.015552\n",
       "2016-03-14    0.012629\n",
       "2016-03-11    0.012400\n",
       "2016-03-10    0.010638\n",
       "2016-03-09    0.009580\n",
       "2016-03-13    0.008875\n",
       "2016-03-18    0.007320\n",
       "2016-03-08    0.007320\n",
       "2016-03-07    0.005433\n",
       "2016-03-06    0.004313\n",
       "2016-03-05    0.001078\n",
       "Name: last_seen, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(\"=\"*5,\"Distribution of last_seen column\",\"=\"*5)\n",
    "autos[\"last_seen\"].str[:10].value_counts(normalize=True, ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the distribution on column `last_seen`:\n",
    "1. Concurs with what we saw with `date_crawled` column that distribution is between 5th March to 7th April, 2016\n",
    "2. Here purely based on distribution percentages and most data is distributed towards 6th and 7th of April and could suggest a good few days lag between when the ad was created when it was last seen - So might not be a high traffic consumption of listing being viewed immediately.\n",
    "\n",
    "### 5. Date column cleansing: Part 2\n",
    "\n",
    "Let's now look at the column `registration_year`. Here we are only dealing with the **year** part of the date. \n",
    "\n",
    "It's important to understand the data in this column and any outliers here as this column probably suggests the registration year of the car being listed and in turn useful in our analysis based on the car's age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "count    48224.000000\n",
      "mean      2004.730964\n",
      "std         87.897388\n",
      "min       1000.000000\n",
      "25%       1999.000000\n",
      "50%       2004.000000\n",
      "75%       2008.000000\n",
      "max       9999.000000\n",
      "Name: registration_year, dtype: float64 \n",
      "\n",
      "1000    0.000021\n",
      "1001    0.000021\n",
      "1111    0.000021\n",
      "1800    0.000041\n",
      "1910    0.000041\n",
      "Name: registration_year, dtype: float64\n",
      "5911    0.000021\n",
      "6200    0.000021\n",
      "8888    0.000021\n",
      "9000    0.000021\n",
      "9999    0.000062\n",
      "Name: registration_year, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# To understand the statistics of the data behind the series\n",
    "print(autos[\"registration_year\"].describe(),'\\n')\n",
    "\n",
    "# To understand the data distribution percentage on the column\n",
    "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).head(5))\n",
    "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).tail(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the statistics on the `registration_year`, it clear that some of the data is not a year evident from the min (1000) and max (9999) values.\n",
    "\n",
    "Also, through the distribution of data we see:\n",
    "1. Large part of the data is distributed between year 2000 and 2014 (~85%)\n",
    "2. Even if we factor some vintage cars in the listing with old registration years, still there are some incorrect values that might have to be corrected. e.g. 1000, 1001, 1111\n",
    "3. There are car registration year beyond when the ad the month/year the actual list was created. e.g. Mar-Apr 2016 being our ad listing date ranges and we can assume these are incorrect values\n",
    "\n",
    "We will make an assumption that infact some vintage cars are in our listing so we will use the range being 1900 to 2016 and anything outside of this range as incorrect.\n",
    "\n",
    "Note: We have ignored the car listings in 1800 as invalid for the following reasons:\n",
    "\n",
    "- The period of 1800 is too early as first automobiles came only about 1886\n",
    "- Brand of the car in these listings says Mitsubishi (they were not founded until 1960's)\n",
    "\n",
    "Let's look at the number of listings between our assumed valid registration year ranges and outside of this range. We will also remove the data outside our year ranges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Listings within valid year range =  46352 \n",
      "\n",
      "Listings outside valid year range =  1872 \n",
      "\n",
      "1910    0.000043\n",
      "1927    0.000022\n",
      "1929    0.000022\n",
      "1931    0.000022\n",
      "1934    0.000043\n",
      "Name: registration_year, dtype: float64\n",
      "2012    0.028219\n",
      "2013    0.017281\n",
      "2014    0.014282\n",
      "2015    0.008198\n",
      "2016    0.025932\n",
      "Name: registration_year, dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# To see how many listings we have within our assumed valid registration year range.\n",
    "print(\"Listings within valid year range = \",autos[\"registration_year\"].between(1900,2016).sum(),\"\\n\")\n",
    "\n",
    "print(\"Listings outside valid year range = \",(~autos[\"registration_year\"].between(1900,2016)).sum(),\"\\n\")\n",
    "\n",
    "# To remove the values outside of our assumed valid registration year range\n",
    "autos = autos[ autos[\"registration_year\"].between(1900,2016) ]\n",
    "\n",
    "# To see the data distribtion now in our valid data set\n",
    "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).head(5))\n",
    "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).tail(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Analysis and Inference\n",
    "### 1. Car brand variation\n",
    "\n",
    "Now we will take a look at the `brand` column to perform some analysis and find the mean prices based on the brand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of car brands in the listing:  40\n"
     ]
    }
   ],
   "source": [
    "print('Number of car brands in the listing: ',autos[\"brand\"].unique().shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that there are 40 unique car brands in our current cleansed dataset.\n",
    "\n",
    "### 2. Average Price and Mileage analysis\n",
    "\n",
    "For the purposes of this project, we are going to consider only the **top 10** car brands by distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "audi             9380.72\n",
      "mercedes_benz    8672.65\n",
      "bmw              8381.68\n",
      "volkswagen       5436.95\n",
      "seat             4433.42\n",
      "ford             3779.27\n",
      "peugeot          3113.86\n",
      "opel             3005.50\n",
      "fiat             2836.87\n",
      "renault          2496.07\n",
      "dtype: float64\n"
     ]
    }
   ],
   "source": [
    "# Select only top 10 brands based on the distribution in the car listing\n",
    "car_brands = autos[\"brand\"].value_counts(normalize = True, dropna = False).head(10).index\n",
    "brand_mean_price = {}\n",
    "brand_mean_mileage = {}\n",
    "# Loop through top 10 brands to calculate mean price of cars\n",
    "for brand in car_brands:\n",
    "    mean_price = autos.loc[autos[\"brand\"] == brand, \"price\"].mean()\n",
    "    mean_mileage = autos.loc[autos[\"brand\"] == brand, \"odometer_km\"].mean()\n",
    "    brand_mean_price[brand] = round(mean_price,2)\n",
    "    brand_mean_mileage[brand] = round(mean_mileage,2)\n",
    "# Convert the brand aggregate data into Panda series to sort by values\n",
    "brand_agg_mean_price = pd.Series(brand_mean_price)\n",
    "brand_agg_mean_mileage = pd.Series(brand_mean_mileage)\n",
    "print(brand_agg_mean_price.sort_values(ascending = False))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above analysis, it seems among the top 10 car brands by distribution:\n",
    "1. **Audi**, **Mercedes Benz**, and **BMW** seems to have a higher mean price\n",
    "2. **Opel**, **Fiat**, and **Renault** are the lowest by significant margin\n",
    "3. **Volkswagen** and **Seat** are the sweet spot in the middle\n",
    "\n",
    "We will also compare this average price for the top 10 brands against it's average mileage.\n",
    "\n",
    "For this, we add our mean mileage as well and make a dataframe so that we can compare side by side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>mean_price</th>\n",
       "      <th>mean_mileage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>volkswagen</th>\n",
       "      <td>5436.95</td>\n",
       "      <td>128799.88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>bmw</th>\n",
       "      <td>8381.68</td>\n",
       "      <td>132695.32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>opel</th>\n",
       "      <td>3005.50</td>\n",
       "      <td>129384.43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mercedes_benz</th>\n",
       "      <td>8672.65</td>\n",
       "      <td>131025.67</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>audi</th>\n",
       "      <td>9380.72</td>\n",
       "      <td>129245.40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ford</th>\n",
       "      <td>3779.27</td>\n",
       "      <td>124277.11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>renault</th>\n",
       "      <td>2496.07</td>\n",
       "      <td>128281.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>peugeot</th>\n",
       "      <td>3113.86</td>\n",
       "      <td>127127.89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>fiat</th>\n",
       "      <td>2836.87</td>\n",
       "      <td>116950.29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>seat</th>\n",
       "      <td>4433.42</td>\n",
       "      <td>121536.64</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               mean_price  mean_mileage\n",
       "volkswagen        5436.95     128799.88\n",
       "bmw               8381.68     132695.32\n",
       "opel              3005.50     129384.43\n",
       "mercedes_benz     8672.65     131025.67\n",
       "audi              9380.72     129245.40\n",
       "ford              3779.27     124277.11\n",
       "renault           2496.07     128281.39\n",
       "peugeot           3113.86     127127.89\n",
       "fiat              2836.87     116950.29\n",
       "seat              4433.42     121536.64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create a dataframe with a single column from a series\n",
    "brand_agg = pd.DataFrame(brand_agg_mean_price, columns=[\"mean_price\"])\n",
    "# Add a column to the dataframe with another series sharing same label\n",
    "brand_agg[\"mean_mileage\"] = brand_agg_mean_mileage\n",
    "brand_agg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now by comparing the average price alongside the average mileage, we see that the mileage range for **Audi**, **Mercedes Benz**, and **BMW** are all close in the same range.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "We started with data cleansing as a big part of this project and we did the following:\n",
    "\n",
    "1. Renamed column names to a consistent format\n",
    "2. Converted certain numeric data which were stored as text\n",
    "3. Removed outliers:\n",
    "    - We removed listings that was outside of our price range we assumed was realistic (100 and 350000)\n",
    "    - We removed listings that was outside of our registered_year range that was realistic (1900 to 2016)\n",
    "    \n",
    "After cleansing the data, we started with our data analysis and listed the Top 10 car brands by price.\n",
    "\n",
    "Then upon analysing the average price and mileage of these top 10 car brands, we concluded that:\n",
    "\n",
    "1. **Audi**, **Mercedes Benz**, and **BMW** are the Top 3 brands per average car price and all of these are in the same mileage range\n",
    "2. **Volkswagen** and **Seat** are the best brands for value of money in terms of average price and mileage"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}