{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Exploring E-bay Car Sales Data\n", "- This Data is taken from [Kaggle Competitions](https://www.kaggle.com/orgesleka/used-cars-database/data)\n", "\n", "### Data Dictionary\n", "\n", "> dateCrawled - When this ad was first crawled. All field-values are taken from this date.
\n", "> name - Name of the car.
\n", "> seller - Whether the seller is private or a dealer.
\n", "> offerType - The type of listing
\n", "> price - The price on the ad to sell the car.
\n", "> abtest - Whether the listing is included in an A/B test.
\n", "> vehicleType - The vehicle Type.
\n", "> yearOfRegistration - The year in which which year the car was first registered.
\n", "> gearbox - The transmission type.
\n", "> powerPS - The power of the car in PS.
\n", "> model - The car model name.
\n", "> kilometer - How many kilometers the car has driven.
\n", "> monthOfRegistration - The month in which which year the car was first registered.
\n", "> fuelType - What type of fuel the car uses.\n", "> brand- The brand of the car.
\n", "> notRepairedDamage - If the car has a damage which is not yet repaired.
\n", "> dateCreated - The date on which the eBay listing was created.
\n", "> nrOfPictures - The number of pictures in the ad.
\n", "> postalCode - The postal code for the location of the vehicle.
\n", "> lastSeenOnline - When the crawler saw this ad last online.
\n", "\n", "### Aim\n", "We aim to clean the data and analyze the included used car listings usinfg `pandas` and `matplotlib`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Introduction to Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearboxpowerPSmodelodometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "autos = pd.read_csv('autos.csv', encoding=\"Latin-1\")\n", "autos.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", "dateCrawled 50000 non-null object\n", "name 50000 non-null object\n", "seller 50000 non-null object\n", "offerType 50000 non-null object\n", "price 50000 non-null object\n", "abtest 50000 non-null object\n", "vehicleType 44905 non-null object\n", "yearOfRegistration 50000 non-null int64\n", "gearbox 47320 non-null object\n", "powerPS 50000 non-null int64\n", "model 47242 non-null object\n", "odometer 50000 non-null object\n", "monthOfRegistration 50000 non-null int64\n", "fuelType 45518 non-null object\n", "brand 50000 non-null object\n", "notRepairedDamage 40171 non-null object\n", "dateCreated 50000 non-null object\n", "nrOfPictures 50000 non-null int64\n", "postalCode 50000 non-null int64\n", "lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n" ] } ], "source": [ "autos.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Key Observations:_\n", "\n", "- The Dataset includes of 20 Columns, most of which are Strings\n", "\n", "- Some Columns have null values, but none have more than 20% null values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning Column Names\n", "\n", "- The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n", " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n", " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n", " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n", " 'lastSeen'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_names = autos.columns\n", "column_names" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',\n", " 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',\n", " 'odometer', 'registration_month', 'fuel_type', 'brand',\n", " 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',\n", " 'last_seen'],\n", " dtype='object')\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "def clean_col(col):\n", " col.strip()\n", " col = col.replace(\"yearOfRegistration\",\n", " \"registration_year\")\n", " col = col.replace(\"monthOfRegistration\", \n", " \"registration_month\")\n", " col = col.replace(\"notRepairedDamage\", \n", " \"unrepaired_damage\")\n", " col = col.replace(\"dateCreated\", \n", " \"ad_created\")\n", " return re.sub('([a-z0-9])([A-Z])', r'\\1_\\2',col).lower()\n", " \n", "autos.columns = [clean_col(c) for c in autos.columns]\n", "print(autos.columns)\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Changed Names for all the columnms from camelCase to snake_case. For e.g. 'nrOfPictures' became 'nr_of_pictures'***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Intitial Exploration and Cleaning\n", "\n", "Some other cleaning tasks could be:\n", "- We will look for Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. \n", "- Examples of numeric data stored as text which can be cleaned and converted." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-08 10:40:35Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name seller offer_type price abtest \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-08 10:40:35 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicle_type registration_year gearbox power_ps model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fuel_type brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created nr_of_pictures postal_code last_seen \n", "count 50000 50000.0 50000.000000 50000 \n", "unique 76 NaN NaN 39481 \n", "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN NaN 8 \n", "mean NaN 0.0 50813.627300 NaN \n", "std NaN 0.0 25779.747957 NaN \n", "min NaN 0.0 1067.000000 NaN \n", "25% NaN 0.0 30451.000000 NaN \n", "50% NaN 0.0 49577.000000 NaN \n", "75% NaN 0.0 71540.000000 NaN \n", "max NaN 0.0 99998.000000 NaN " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Key Observations:_\n", "- **nr_of_pictures**, **seller** and **offer_type** columns have mostly a single value. And they should be dropped.\n", "- Columns **price** and **odometer** are shown as text (Object) type, whereas they should be numeric (int) type and should be converted." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5000\n", "1 8500\n", "2 8990\n", "3 4350\n", "4 1350\n", "Name: price, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price\"] = autos[\"price\"].replace({'\\$':'',',':''}, regex=True).astype(int)\n", "autos[\"price\"].head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 150000\n", "1 150000\n", "2 70000\n", "3 70000\n", "4 150000\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"odometer\"] = autos[\"odometer\"].replace({'km':'',',':''}, regex=True).astype(int)\n", "autos.rename(columns={'odometer':'odometer_km'}, inplace=True)\n", "autos['odometer_km'].head()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednamepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_createdpostal_codelast_seen
count50000500005.000000e+04500004490550000.0000004732050000.0000004724250000.00000050000.0000004551850000401715000050000.00000050000
unique4821338754NaN28NaN2NaN245NaNNaN740276NaN39481
top2016-03-08 10:40:35Ford_FiestaNaNtestlimousineNaNmanuellNaNgolfNaNNaNbenzinvolkswagennein2016-04-03 00:00:00NaN2016-04-07 06:17:27
freq378NaN2575612859NaN36993NaN4024NaNNaN3010710687352321946NaN8
meanNaNNaN9.840044e+03NaNNaN2005.073280NaN116.355920NaN125732.7000005.723360NaNNaNNaNNaN50813.627300NaN
stdNaNNaN4.811044e+05NaNNaN105.712813NaN209.216627NaN40042.2117063.711984NaNNaNNaNNaN25779.747957NaN
minNaNNaN0.000000e+00NaNNaN1000.000000NaN0.000000NaN5000.0000000.000000NaNNaNNaNNaN1067.000000NaN
25%NaNNaN1.100000e+03NaNNaN1999.000000NaN70.000000NaN125000.0000003.000000NaNNaNNaNNaN30451.000000NaN
50%NaNNaN2.950000e+03NaNNaN2003.000000NaN105.000000NaN150000.0000006.000000NaNNaNNaNNaN49577.000000NaN
75%NaNNaN7.200000e+03NaNNaN2008.000000NaN150.000000NaN150000.0000009.000000NaNNaNNaNNaN71540.000000NaN
maxNaNNaN1.000000e+08NaNNaN9999.000000NaN17700.000000NaN150000.00000012.000000NaNNaNNaNNaN99998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name price abtest vehicle_type \\\n", "count 50000 50000 5.000000e+04 50000 44905 \n", "unique 48213 38754 NaN 2 8 \n", "top 2016-03-08 10:40:35 Ford_Fiesta NaN test limousine \n", "freq 3 78 NaN 25756 12859 \n", "mean NaN NaN 9.840044e+03 NaN NaN \n", "std NaN NaN 4.811044e+05 NaN NaN \n", "min NaN NaN 0.000000e+00 NaN NaN \n", "25% NaN NaN 1.100000e+03 NaN NaN \n", "50% NaN NaN 2.950000e+03 NaN NaN \n", "75% NaN NaN 7.200000e+03 NaN NaN \n", "max NaN NaN 1.000000e+08 NaN NaN \n", "\n", " registration_year gearbox power_ps model odometer_km \\\n", "count 50000.000000 47320 50000.000000 47242 50000.000000 \n", "unique NaN 2 NaN 245 NaN \n", "top NaN manuell NaN golf NaN \n", "freq NaN 36993 NaN 4024 NaN \n", "mean 2005.073280 NaN 116.355920 NaN 125732.700000 \n", "std 105.712813 NaN 209.216627 NaN 40042.211706 \n", "min 1000.000000 NaN 0.000000 NaN 5000.000000 \n", "25% 1999.000000 NaN 70.000000 NaN 125000.000000 \n", "50% 2003.000000 NaN 105.000000 NaN 150000.000000 \n", "75% 2008.000000 NaN 150.000000 NaN 150000.000000 \n", "max 9999.000000 NaN 17700.000000 NaN 150000.000000 \n", "\n", " registration_month fuel_type brand unrepaired_damage \\\n", "count 50000.000000 45518 50000 40171 \n", "unique NaN 7 40 2 \n", "top NaN benzin volkswagen nein \n", "freq NaN 30107 10687 35232 \n", "mean 5.723360 NaN NaN NaN \n", "std 3.711984 NaN NaN NaN \n", "min 0.000000 NaN NaN NaN \n", "25% 3.000000 NaN NaN NaN \n", "50% 6.000000 NaN NaN NaN \n", "75% 9.000000 NaN NaN NaN \n", "max 12.000000 NaN NaN NaN \n", "\n", " ad_created postal_code last_seen \n", "count 50000 50000.000000 50000 \n", "unique 76 NaN 39481 \n", "top 2016-04-03 00:00:00 NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN 8 \n", "mean NaN 50813.627300 NaN \n", "std NaN 25779.747957 NaN \n", "min NaN 1067.000000 NaN \n", "25% NaN 30451.000000 NaN \n", "50% NaN 49577.000000 NaN \n", "75% NaN 71540.000000 NaN \n", "max NaN 99998.000000 NaN " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)\n", "autos.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring the Odometer and Price Columns\n", "\n", "- We continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns. \n", "- We will analyze the columns using minimum and maximum values. And, look for any values that look unrealistically high or low (outliers) that we might want to remove." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2357,)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].unique().shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5.000000e+04\n", "mean 9.840044e+03\n", "std 4.811044e+05\n", "min 0.000000e+00\n", "25% 1.100000e+03\n", "50% 2.950000e+03\n", "75% 7.200000e+03\n", "max 1.000000e+08\n", "Name: price, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].describe()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1421\n", "1 156\n", "2 3\n", "3 1\n", "5 2\n", "8 1\n", "9 1\n", "10 7\n", "11 2\n", "12 3\n", "Name: price, dtype: int64\n" ] }, { "data": { "text/plain": [ "99999999 1\n", "27322222 1\n", "12345678 3\n", "11111111 2\n", "10000000 1\n", "3890000 1\n", "1300000 1\n", "1234566 1\n", "999999 2\n", "999990 1\n", "Name: price, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(autos['price'].value_counts().sort_index().head(10))\n", "autos['price'].value_counts().sort_index(ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*We can observe outliers (below 100 or above 1000000), in `price` column and we will remove them.*" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2312,)\n" ] }, { "data": { "text/plain": [ "100 134\n", "110 3\n", "111 2\n", "115 2\n", "117 1\n", "120 39\n", "122 1\n", "125 8\n", "129 1\n", "130 15\n", "135 1\n", "139 1\n", "140 9\n", "145 2\n", "149 7\n", "150 224\n", "156 2\n", "160 8\n", "170 7\n", "173 1\n", "175 12\n", "179 1\n", "180 35\n", "185 1\n", "188 1\n", "190 16\n", "193 1\n", "195 2\n", "198 1\n", "199 41\n", " ... \n", "120000 2\n", "128000 1\n", "129000 1\n", "130000 1\n", "135000 1\n", "137999 1\n", "139997 1\n", "145000 1\n", "151990 1\n", "155000 1\n", "163500 1\n", "163991 1\n", "169000 1\n", "169999 1\n", "175000 1\n", "180000 1\n", "190000 1\n", "194000 1\n", "197000 1\n", "198000 1\n", "220000 1\n", "250000 1\n", "259000 1\n", "265000 1\n", "295000 1\n", "299000 1\n", "345000 1\n", "350000 1\n", "999990 1\n", "999999 2\n", "Name: price, Length: 2312, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Removing Outliers\n", "autos = autos[autos['price'].between(100,1000000)]\n", "\n", "print(autos['price'].unique().shape)\n", "autos['price'].value_counts().sort_index()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(13,)\n" ] }, { "data": { "text/plain": [ "count 48227.000000\n", "mean 125920.127729\n", "std 39542.413981\n", "min 5000.000000\n", "25% 125000.000000\n", "50% 150000.000000\n", "75% 150000.000000\n", "max 150000.000000\n", "Name: odometer_km, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(autos['odometer_km'].unique().shape)\n", "autos['odometer_km'].describe()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5000 760\n", "10000 245\n", "20000 757\n", "30000 777\n", "40000 814\n", "50000 1009\n", "60000 1153\n", "70000 1214\n", "80000 1412\n", "90000 1733\n", "100000 2101\n", "125000 5038\n", "150000 31214\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***We removed outliers from price column and found no outliers in odometer_km column.***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring the Date Columns\n", "\n", "There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledad_createdlast_seenregistration_monthregistration_year
02016-03-26 17:47:462016-03-26 00:00:002016-04-06 06:45:5432004
12016-04-04 13:38:562016-04-04 00:00:002016-04-06 14:45:0861997
22016-03-26 18:57:242016-03-26 00:00:002016-04-06 20:15:3772009
32016-03-12 16:58:102016-03-12 00:00:002016-03-15 03:16:2862007
42016-04-01 14:38:502016-04-01 00:00:002016-04-01 14:38:5072003
\n", "
" ], "text/plain": [ " date_crawled ad_created last_seen \\\n", "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54 \n", "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08 \n", "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37 \n", "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28 \n", "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50 \n", "\n", " registration_month registration_year \n", "0 3 2004 \n", "1 6 1997 \n", "2 7 2009 \n", "3 6 2007 \n", "4 7 2003 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['date_crawled','ad_created','last_seen', 'registration_month', 'registration_year']][0:5]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date_crawled object\n", "ad_created object\n", "last_seen object\n", "registration_month int64\n", "registration_year int64\n", "dtype: object" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['date_crawled','ad_created','last_seen', 'registration_month', 'registration_year']].dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. \n", "\n", "`registration_month` and `registration_year` are represented as numeric values, so we can use methods like Series.describe() and draw graphs to understand the distribution without any extra data processing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning `date_crawled`, `last_seen`, and `ad_created`\n", "We will clean and analyze the date columns first. We can notice that the first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date value by doing some string formatting" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednamepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_createdpostal_codelast_seen
020160326Peugeot_807_160_NAVTECH_ON_BOARD5000controlbus2004manuell158andere1500003lpgpeugeotnein201603267958820160406
120160404BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik8500controllimousine1997automatik2867er1500006benzinbmwnein201604047103420160406
220160326Volkswagen_Golf_1.6_United8990testlimousine2009manuell102golf700007benzinvolkswagennein201603263539420160406
320160312Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...4350controlkleinwagen2007automatik71fortwo700006benzinsmartnein201603123372920160315
420160401Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...1350testkombi2003manuell0focus1500007benzinfordnein201604013921820160401
\n", "
" ], "text/plain": [ " date_crawled name price \\\n", "0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD 5000 \n", "1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 \n", "2 20160326 Volkswagen_Golf_1.6_United 8990 \n", "3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 \n", "4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 \n", "\n", " abtest vehicle_type registration_year gearbox power_ps model \\\n", "0 control bus 2004 manuell 158 andere \n", "1 control limousine 1997 automatik 286 7er \n", "2 test limousine 2009 manuell 102 golf \n", "3 control kleinwagen 2007 automatik 71 fortwo \n", "4 test kombi 2003 manuell 0 focus \n", "\n", " odometer_km registration_month fuel_type brand unrepaired_damage \\\n", "0 150000 3 lpg peugeot nein \n", "1 150000 6 benzin bmw nein \n", "2 70000 7 benzin volkswagen nein \n", "3 70000 6 benzin smart nein \n", "4 150000 7 benzin ford nein \n", "\n", " ad_created postal_code last_seen \n", "0 20160326 79588 20160406 \n", "1 20160404 71034 20160406 \n", "2 20160326 35394 20160406 \n", "3 20160312 33729 20160315 \n", "4 20160401 39218 20160401 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'] = autos['date_crawled'].str[:10]\n", "autos['date_crawled'] = autos['date_crawled'].str.replace('-','').astype(int)\n", "\n", "autos['last_seen'] = autos['last_seen'].str[:10]\n", "autos['last_seen'] = autos['last_seen'].str.replace('-','').astype(int)\n", "\n", "autos['ad_created'] = autos['ad_created'].str[:10]\n", "autos['ad_created'] = autos['ad_created'].str.replace('-','').astype(int)\n", "\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Analyzing `date_crawled`**" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[20160326 20160404 20160312 20160401 20160321 20160320 20160316 20160322\n", " 20160315 20160331 20160323 20160329 20160317 20160305 20160306 20160328\n", " 20160310 20160403 20160319 20160402 20160314 20160405 20160311 20160307\n", " 20160308 20160327 20160309 20160325 20160318 20160330 20160324 20160313\n", " 20160406 20160407]\n" ] }, { "data": { "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "print(autos['date_crawled'].unique())\n", "autos['date_crawled'].groupby(autos['date_crawled']).count().plot(kind='bar', figsize=(14, 3))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***DATE CRAWLED - The above distribution look like a uniform ditribution.
Which means that the ads were crawled on a regular basis, with an average of 3% ads being crawled daily for 34 days.***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Analyzing `ad_created`**" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[20160326 20160404 20160312 20160401 20160321 20160320 20160316 20160322\n", " 20160314 20160331 20160323 20160329 20160317 20160305 20160306 20160328\n", " 20160310 20160403 20160319 20160402 20160315 20160405 20160311 20160307\n", " 20160308 20160327 20160309 20160325 20160318 20160330 20160324 20160313\n", " 20160406 20160304 20160407 20160224 20160302 20160229 20160103 20151110\n", " 20160301 20160303 20160228 20160127 20160219 20160225 20160223 20160214\n", " 20160212 20160129 20160205 20160122 20160201 20160202 20160217 20160221\n", " 20150810 20150611 20160211 20160114 20160110 20160208 20151205 20160227\n", " 20160222 20160113 20150909 20160220 20160116 20151230 20160207 20160107\n", " 20160226 20160218 20160209 20160216]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "print(autos['ad_created'].unique())\n", "autos['ad_created'].groupby(autos['ad_created']).count().plot(kind='bar', figsize=(16, 4))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***AD CREATED - The above distribbution look like a left skewed ditribution.Which means that most of the ads were created recently.
It could be beacuse the cars that were posted earlier (like a month or two ago) would have got sold, hence their ad was pulled down.
Leaving only a very small number of ads (of unsold cars) posted from a long period of time.***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Analyzing `last_seen`**" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[20160406 20160315 20160401 20160323 20160407 20160326 20160316 20160402\n", " 20160318 20160405 20160317 20160307 20160328 20160312 20160324 20160404\n", " 20160330 20160331 20160320 20160319 20160403 20160314 20160310 20160327\n", " 20160322 20160329 20160311 20160325 20160313 20160309 20160321 20160306\n", " 20160308 20160305]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "print(autos['last_seen'].unique())\n", "autos['last_seen'].groupby(autos['last_seen']).count().plot(kind='bar', figsize=(14, 3))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***LAST SEEN - Here the distribution is heavily left skewed, with almost 50% of the ads being seen in last 3 days of the dataset.
We can give a similar analogy to 'ad_created', older car ads are in less quantity in last seen, as they might be sold already.
And, as a consequence most of the last seens ads are of recent sellers, who are interested in selling their cars ASAP.***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning, Analyzing `registration_month` and `registration_year`\n", "\n", "Removing outliers (if any) from both the columns." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 3 6 7 4 8 12 10 0 9 11 5 2 1]\n" ] }, { "data": { "text/plain": [ "3 4981\n", "0 4313\n", "6 4255\n", "4 4020\n", "5 4016\n", "7 3842\n", "10 3577\n", "12 3359\n", "9 3318\n", "11 3305\n", "1 3199\n", "8 3115\n", "2 2927\n", "Name: registration_month, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(autos['registration_month'].unique())\n", "autos['registration_month'].value_counts() " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 3 6 7 4 8 12 10 9 11 5 2 1]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfAAAADXCAYAAADlcgPcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAFalJREFUeJzt3X+0XWV95/H3hx8iCPIzUCRgrMYRXI5oM8Aa7BSBwYiO0BlZoF0aKTXLJV04nc5qse0qjNZVmK7KlPpjSiUaHIUCVknRiimIVCtCgEjAYBPRQha/AgGsomjgO3+cJ/UQ7s09N5ycc3fu+7XWWWfvZz977+++XPI5z7777J2qQpIkdcsO4y5AkiRNnwEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHXQTuMuYEv222+/mjdv3rjLkCRpZG655ZaHq2rOVP1mdIDPmzePFStWjLsMSZJGJsm/DNLPU+iSJHWQAS5JUgcNFOBJfpBkVZKVSVa0tn2SLE+ypr3v3dqT5MIka5PcnuS1fdtZ1PqvSbJo2xySJEnbv+mMwF9fVYdX1YI2fzZwbVXNB65t8wBvBOa312Lg49ALfOAc4EjgCOCcTaEvSZKm57mcQj8JWNqmlwIn97VfUj03AnslORB4A7C8qjZU1aPAcmDhc9i/JEmz1qABXsBXktySZHFrO6Cq7gdo7/u39oOAe/vWXdfaJmt/hiSLk6xIsmL9+vWDH4kkSbPIoF8jO7qq7kuyP7A8yV1b6JsJ2moL7c9sqLoIuAhgwYIFz1qurfeqpa8a2rZWLVo1tG1JkqZvoBF4Vd3X3h8CPk/vb9gPtlPjtPeHWvd1wMF9q88F7ttCuyRJmqYpAzzJC5LssWkaOAG4A1gGbLqSfBFwVZteBryzXY1+FPB4O8V+DXBCkr3bxWsntDZJkjRNg5xCPwD4fJJN/T9bVV9OcjNweZIzgHuAU1r/LwEnAmuBJ4DTAapqQ5IPAje3fh+oqg1DOxJJkmaRKQO8qu4GXj1B+yPAcRO0F3DmJNtaAiyZfpmSJKmfd2KTJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpgwxwSZI6yACXJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpgwxwSZI6yACXJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpg3YadwGSBvPnp755KNv53b+5eijbkTRejsAlSeogA1ySpA4ywCVJ6iADXJKkDho4wJPsmOS2JFe3+Zck+VaSNUn+JsnzWvsubX5tWz6vbxvvb+3fTfKGYR+MJEmzxXRG4O8DVvfNnw9cUFXzgUeBM1r7GcCjVfUy4ILWjySHAacBrwQWAh9LsuNzK1+SpNlpoABPMhd4E/CJNh/gWODK1mUpcHKbPqnN05Yf1/qfBFxWVU9W1feBtcARwzgISZJmm0G/B/5/gN8D9mjz+wKPVdXGNr8OOKhNHwTcC1BVG5M83vofBNzYt83+df5NksXAYoBDDjlk4ANRN61+xaFD29ahd62eutMAPvqe64ayHYAz/++xQ9uWJPWbcgSe5M3AQ1V1S3/zBF1rimVbWucXDVUXVdWCqlowZ86cqcqTJGlWGmQEfjTwliQnAs8HXkhvRL5Xkp3aKHwucF/rvw44GFiXZCdgT2BDX/sm/etIkqRpmHIEXlXvr6q5VTWP3kVo11XVbwBfBd7aui0CrmrTy9o8bfl1VVWt/bR2lfpLgPnATUM7EkmSZpHnci/03wcuS/InwG3Axa39YuDTSdbSG3mfBlBVdya5HPgOsBE4s6qeeg77lyRp1ppWgFfV9cD1bfpuJriKvKp+CpwyyfofAj403SIlzUzrzv7HoWxn7nm/OpTtSLOJd2KTJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpgwxwSZI6yACXJKmDDHBJkjroudyJTZLUUb/01ZVD29YDrz98aNvS4ByBS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EFehS5J0hbMO/uLQ9nOD85701C2s4kjcEmSOsgRuCRtY9de99Khbeu4Y783tG3NRDN1tDsTOQKXJKmDDHBJkjpouziF7ikXSdJs4whckqQOMsAlSeqgKU+hJ3k+cAOwS+t/ZVWdk+QlwGXAPsCtwDuq6mdJdgEuAX4FeAQ4tap+0Lb1fuAM4CngrKq6ZviHJGk2O/fcc2fktqRhG2QE/iRwbFW9GjgcWJjkKOB84IKqmg88Si+Yae+PVtXLgAtaP5IcBpwGvBJYCHwsyY7DPBhJkmaLKQO8en7UZndurwKOBa5s7UuBk9v0SW2etvy4JGntl1XVk1X1fWAtcMRQjkKSpFlmoL+BJ9kxyUrgIWA58D3gsara2LqsAw5q0wcB9wK05Y8D+/a3T7COJEmahoECvKqeqqrDgbn0Rs2HTtStvWeSZZO1P0OSxUlWJFmxfv36QcqTJGnWmdZV6FX1GHA9cBSwV5JNF8HNBe5r0+uAgwHa8j2BDf3tE6zTv4+LqmpBVS2YM2fOdMqTJGnWGOQq9DnAz6vqsSS7AsfTuzDtq8Bb6V2Jvgi4qq2yrM1/sy2/rqoqyTLgs0k+DLwImA/cNOTjmTnO3XNI23l8ONuRJG1XBrkT24HA0nbF+A7A5VV1dZLvAJcl+RPgNuDi1v9i4NNJ1tIbeZ8GUFV3Jrkc+A6wETizqp4a7uFIkjQ7TBngVXU78JoJ2u9mgqvIq+qnwCmTbOtDwIemX6YkSernndgkSeogA1ySpA4ywCVJ6iADXJKkDjLAJUnqIANckqQOMsAlSeogA1ySpA4ywCVJ6iADXJKkDjLAJUnqIANckqQOMsAlSeogA1ySpA4ywCVJ6iADXJKkDjLAJUnqIANckqQOMsAlSeogA1ySpA4ywCVJ6iADXJKkDpoywJMcnOSrSVYnuTPJ+1r7PkmWJ1nT3vdu7UlyYZK1SW5P8tq+bS1q/dckWbTtDkuSpO3bICPwjcDvVtWhwFHAmUkOA84Grq2q+cC1bR7gjcD89loMfBx6gQ+cAxwJHAGcsyn0JUnS9EwZ4FV1f1Xd2qb/FVgNHAScBCxt3ZYCJ7fpk4BLqudGYK8kBwJvAJZX1YaqehRYDiwc6tFIkjRLTOtv4EnmAa8BvgUcUFX3Qy/kgf1bt4OAe/tWW9faJmuXJEnTNHCAJ9kd+Bzw36vqh1vqOkFbbaF98/0sTrIiyYr169cPWp4kSbPKQAGeZGd64f2Zqvrb1vxgOzVOe3+ota8DDu5bfS5w3xban6GqLqqqBVW1YM6cOdM5FkmSZo1BrkIPcDGwuqo+3LdoGbDpSvJFwFV97e9sV6MfBTzeTrFfA5yQZO928doJrU2SJE3TTgP0ORp4B7AqycrW9gfAecDlSc4A7gFOacu+BJwIrAWeAE4HqKoNST4I3Nz6faCqNgzlKCRJmmWmDPCq+joT//0a4LgJ+hdw5iTbWgIsmU6BkiTp2bwTmyRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EEGuCRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EEGuCRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EEGuCRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR00ZYAnWZLkoSR39LXtk2R5kjXtfe/WniQXJlmb5PYkr+1bZ1HrvybJom1zOJIkzQ6DjMA/BSzcrO1s4Nqqmg9c2+YB3gjMb6/FwMehF/jAOcCRwBHAOZtCX5IkTd+UAV5VNwAbNms+CVjappcCJ/e1X1I9NwJ7JTkQeAOwvKo2VNWjwHKe/aFAkiQNaGv/Bn5AVd0P0N73b+0HAff29VvX2iZrf5Yki5OsSLJi/fr1W1meJEnbt2FfxJYJ2moL7c9urLqoqhZU1YI5c+YMtThJkrYXWxvgD7ZT47T3h1r7OuDgvn5zgfu20C5JkrbC1gb4MmDTleSLgKv62t/ZrkY/Cni8nWK/Bjghyd7t4rUTWpskSdoKO03VIcmlwDHAfknW0bua/Dzg8iRnAPcAp7TuXwJOBNYCTwCnA1TVhiQfBG5u/T5QVZtfGCdJkgY0ZYBX1dsmWXTcBH0LOHOS7SwBlkyrOkmSNCHvxCZJUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR10MgDPMnCJN9NsjbJ2aPevyRJ24ORBniSHYGPAm8EDgPeluSwUdYgSdL2YNQj8COAtVV1d1X9DLgMOGnENUiS1HmpqtHtLHkrsLCqfqvNvwM4sqp+u6/PYmBxm/13wHeHtPv9gIeHtK1hsabBzcS6rGkw1jS4mViXNQ1mmDW9uKrmTNVppyHtbFCZoO0ZnyCq6iLgoqHvOFlRVQuGvd3nwpoGNxPrsqbBWNPgZmJd1jSYcdQ06lPo64CD++bnAveNuAZJkjpv1AF+MzA/yUuSPA84DVg24hokSeq8kZ5Cr6qNSX4buAbYEVhSVXeOaPdDPy0/BNY0uJlYlzUNxpoGNxPrsqbBjLymkV7EJkmShsM7sUmS1EEGuCRJHWSAS5LUQQb4CCV5RZLjkuy+WfvCMdZ0RJL/0KYPS/I/kpw4rnomkuSScdewuSSvaz+rE8ZYw5FJXtimd03yv5L8XZLzk+w5pprOSnLw1D1HJ8nzkrwzyfFt/u1JPpLkzCQ7j7Gulyb5n0n+IsmfJ3nPuP67qZtm3UVsSU6vqk+OYb9nAWcCq4HDgfdV1VVt2a1V9dox1HQOvfvS7wQsB44ErgeOB66pqg+NoabNv1YY4PXAdQBV9ZZR1wSQ5KaqOqJNv5vef8vPAycAf1dV542hpjuBV7dvd1wEPAFcCRzX2v/rGGp6HPgx8D3gUuCKqlo/6jo2q+kz9H7HdwMeA3YH/pbezylVtWgMNZ0F/Bfga8CJwErgUeDXgfdW1fWjrkkdVFWz6gXcM6b9rgJ2b9PzgBX0QhzgtjHWtCO9f9h+CLywte8K3D6mmm4F/h9wDPBr7f3+Nv1rY/y9ua1v+mZgTpt+AbBqTDWt7v+5bbZs5bh+TvTO7J0AXAysB74MLAL2GFNNt7f3nYAHgR3bfMb4e76qr47dgOvb9CHj+veg7X9P4DzgLuCR9lrd2vYaV12+Jn5tl6fQk9w+yWsVcMCYytqxqn4EUFU/oBdMb0zyYSa+xewobKyqp6rqCeB7VfXDVt9PgKfHVNMC4BbgD4HHqzcS+UlVfa2qvjammgB2SLJ3kn3pjdrWA1TVj4GNY6rpjiSnt+lvJ1kAkOTlwM/HVFNV1dNV9ZWqOgN4EfAxYCFw95hq2qHdOGoPemG56TT1LsDYTqHzi/tw7EKvNqrqHsZb0+X0zgQcU1X7VtW+9M6APQpcMca6JpTk78e03xcm+dMkn07y9s2WfWxUdYz6XuijcgDwBnq/dP0C/NPoywHggSSHV9VKgKr6UZI3A0uAV42ppp8l2a0F+K9samx/hxtLgFfV08AFSa5o7w8yM35P96T3wSJAJfmlqnqgXc8wrg9gvwX8RZI/ovcQhW8muRe4ty0bh2f8LKrq5/Tutrgsya7jKYmL6Y0od6T3wfCKJHcDR9F7IuI4fAK4OcmNwH8CzgdIMgfYMKaaAOZV1fn9DVX1AHB+kt8cR0FJJvvzYuj9OXIcPgmsAT4H/GaS/wa8vaqepPd7NRLb5d/Ak1wMfLKqvj7Bss9W1dsnWG1b1zSX3oj3gQmWHV1V3xhDTbu0X7jN2/cDDqyqVaOuaYJa3gQcXVV/MO5aJpJkN+CAqvr+GGvYA/hleh901lXVg2Os5eVV9c/j2v9kkrwIoKruS7IXves87qmqm8ZY0yuBQ4E7ququcdXRL8lXgH8Alm76PUpyAPAu4D9X1fFjqOkpetcKTPRB+aiqGvkHwyQrq+rwvvk/pHctw1uA5TWia5q2ywCXJE1fkr2Bs4GTgP1b84P0zqKcV1Wbn9UcRU13AL9eVWsmWHZvVY38Ww9JVgOvbGcMN7UtAn6P3rVOLx5JHQa4JGkqY/wGz1vpXSj63QmWnVxVXxhDTf8b+EpV/cNm7QuBv6yq+SOpwwCXJE0lyT1Vdci46+g3rg8VWzLKmgxwSRLQ+wbPZIuAl1fVLqOsZyoz9EPFyGqaCVf3SpJmhhn3DZ4pPlSM5WvBM6UmA1yStMnV9C7CWrn5giTXj74cYAZ+qGCG1GSAS5IAaDffmWzZyL9+28zEDxUzoib/Bi5JUgdtl7dSlSRpe2eAS5LUQQa4NIMleUuSs7ew/PCteX57knn9D2FIsiDJhVtb56gk2SvJe/vmj0ly9ThrksbFAJdGJD3T+n+uqpbVlp81fji9ezBPtL8tXaQ6D/i3AK+qFVV11nRqG5O9gPdO2UuaBQxwaRtqI93V7RGDtwLvSPLNJLcmuaI9zYwkJya5K8nXk1y4aVSZ5F1JPtKmT0lyR5JvJ7mhPSLzA8CpSVYmOTXJuUkuag+luKTt/x/b/m5N8h9baecBv9rW+53+kWySfZJ8oT2C98Yk/761n5tkSZLrk9ydZNLAb/u9K8knWs2fSXJ8km8kWZPkiK3c13nAS1vdf9badk9yZdvfZ5KM6+lw0miN+4Hkvnxtzy96I92n6T1icD/gBuAFbdnvA38MPJ/eI0Bf0tovBa5u0+8CPtKmVwEHtem9Nl/e5s+l99jTXdv8bsDz2/R8YEWbPmbTPjafB/4SOKdNHwus7Nv2P9F7fvV+wCPAzls47o30HpW7Q6tpCb3vyZ4EfGFr9tW2e8dmdT8OzG37+SbwunH/d/flaxQvR+DStvcvVXUjvRA/DPhGkpXAIuDFwCuAu+sXjyS9dJLtfAP4VJJ303u29WSWVdVP2vTOwF8nWQVc0fY/ldcBnwaoquuAfdN7RjzAF6vqyap6GHiILd916vtVtap6T2y6E7i2qoreB5F5Q9zXTVW1ru1nZd+2pe2aN3KRtr0ft/fQe1bw2/oXJnnNIBupqvckORJ4E7AyyeGTdP1x3/Tv0Hsc5KvpjVB/OsCuJjoFvemGEf3Pj3+KLf8b0t/36b75p/vWG8a+plOTtN1wBC6Nzo3A0UleBpBktyQvB+4CfjnJvNbv1IlWTvLSqvpWVf0x8DBwMPCvwB5b2OeewP1tdPoOfjFy39J6NwC/0fZ5DPBwVf1wkAPcCtPd11THK80aflKVRqSq1id5F3Bpkk1Pdfqjqvrn9tWoLyd5GLhpkk38WZL59Eat1wLfBu4Bzm6n5P90gnU+BnwuySnAV/nF6Px2YGOSbwOfAm7rW+dc4JPtgQ1P0DvVv61Ma19V9Ui7EO4O4O+BL27D2qQZzVupSjNAkt2r6kftCuqPAmuq6oJx1yVp5vIUujQzvLuNou+kd9r7r8Zcj6QZzhG4pK2WZF96p/M3d1xVPTLqeqTZxACXJKmDPIUuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSB/1/kHKZANEL9TsAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Observed 0 as an outlier\n", "autos = autos[autos['registration_month']!=0]\n", "\n", "print(autos['registration_month'].unique())\n", "autos['registration_month'].groupby(autos['registration_month']).count().plot(kind='bar', figsize=(8, 3))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Observed outlier for 'Registration month' was 0. Removed it.\n", "And, as expected the ditribution is uniform, since people can buy cars any month of the year.***" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1800, 1927, 1929, 1931, 1934, 1937, 1938, 1939, 1941, 1943, 1948, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2800, 4800, 5000, 6200, 9000]\n" ] }, { "data": { "text/plain": [ "count 43914.000000\n", "mean 2004.060277\n", "std 44.351169\n", "min 1800.000000\n", "25% 1999.000000\n", "50% 2004.000000\n", "75% 2008.000000\n", "max 9000.000000\n", "Name: registration_year, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(sorted(autos['registration_year'].unique()))\n", "autos['registration_year'].describe()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1927, 1929, 1931, 1934, 1937, 1938, 1939, 1941, 1943, 1948, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Here all the values for years before 1900 and after 2017 are invlaid or outliers\n", "\n", "autos = autos[autos['registration_year'].between(1900,2017)]\n", "\n", "print(sorted(autos['registration_year'].unique()))\n", "autos['registration_year'].groupby(autos['registration_year']).count().plot(kind='bar', figsize=(16, 3))\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Observed outliers for 'Registration year' were values below 1900 and above 2017. Removed those outliers.
We can observe that the distribution is left skewed.\n", "Most of the cars were registered between 1998-2008.
\n", "It can be explained from the fact that most cars are used for max. 10-15 years and then sell/discard them.***" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednamepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_createdpostal_codelast_seen
020160326Peugeot_807_160_NAVTECH_ON_BOARD5000controlbus2004manuell158andere1500003lpgpeugeotnein201603267958820160406
120160404BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik8500controllimousine1997automatik2867er1500006benzinbmwnein201604047103420160406
220160326Volkswagen_Golf_1.6_United8990testlimousine2009manuell102golf700007benzinvolkswagennein201603263539420160406
320160312Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...4350controlkleinwagen2007automatik71fortwo700006benzinsmartnein201603123372920160315
420160401Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...1350testkombi2003manuell0focus1500007benzinfordnein201604013921820160401
\n", "
" ], "text/plain": [ " date_crawled name price \\\n", "0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD 5000 \n", "1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 \n", "2 20160326 Volkswagen_Golf_1.6_United 8990 \n", "3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 \n", "4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 \n", "\n", " abtest vehicle_type registration_year gearbox power_ps model \\\n", "0 control bus 2004 manuell 158 andere \n", "1 control limousine 1997 automatik 286 7er \n", "2 test limousine 2009 manuell 102 golf \n", "3 control kleinwagen 2007 automatik 71 fortwo \n", "4 test kombi 2003 manuell 0 focus \n", "\n", " odometer_km registration_month fuel_type brand unrepaired_damage \\\n", "0 150000 3 lpg peugeot nein \n", "1 150000 6 benzin bmw nein \n", "2 70000 7 benzin volkswagen nein \n", "3 70000 6 benzin smart nein \n", "4 150000 7 benzin ford nein \n", "\n", " ad_created postal_code last_seen \n", "0 20160326 79588 20160406 \n", "1 20160404 71034 20160406 \n", "2 20160326 35394 20160406 \n", "3 20160312 33729 20160315 \n", "4 20160401 39218 20160401 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring Price by Brand\n", "\n", "***Selecting Top Brands***" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')\n" ] } ], "source": [ "brands = autos['brand'].value_counts(normalize=True)\n", "results = brands >= 0.05\n", "\n", "## Finding names of brands wit more then 5% share of ads\n", "top_brands_name = brands[results].index\n", "print(top_brands_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Mean Price***" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'volkswagen': 5943.71,\n", " 'bmw': 8637.41,\n", " 'opel': 3173.85,\n", " 'mercedes_benz': 8846.68,\n", " 'audi': 9718.9,\n", " 'ford': 4289.74}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brand_mean_price = {}\n", "\n", "for car in top_brands_name:\n", " mean_price = autos.loc[autos['brand']== car,'price'].mean()\n", " brand_mean_price[car] = round(mean_price, 2)\n", " \n", "brand_mean_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Mean Mileage***" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'volkswagen': 128365.67,\n", " 'bmw': 132575.41,\n", " 'opel': 128805.64,\n", " 'mercedes_benz': 131034.56,\n", " 'audi': 128830.76,\n", " 'ford': 123839.23}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brand_mean_mileage = {}\n", "\n", "for car in top_brands_name:\n", " mean_mileage = autos.loc[autos['brand'] == car, 'odometer_km'].mean()\n", " brand_mean_mileage[car] = round(mean_mileage, 2)\n", "\n", "brand_mean_mileage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***The average prices are as follows:***\n", "\n", "|**Brand**|Average Price|\n", "|-|-|\n", "|**Volkswagen:** | 5943.71 |\n", "|**BMW:** | 8637.41|\n", "|**Opel:** | 3173.85|\n", "|**Mercedes Benz:** | 8846.68|\n", "|**Audi:** | 9718.9|\n", "|**Ford:** | 4289.74|\n", "\n", "***Here we may notice that the most expensive vehicles are from Audi, BMW and Mercedes.
While Ford, Opel are the most economical options, Volkswagen is in between.***\n", "\n", "---\n", "\n", "***The Average Mileage is as follows***\n", "\n", "|**Brand**|Average Price|\n", "|-|-|\n", "|**Volkswagen** | 128365.67|\n", "|**BMW**|132575.41|\n", "|**Opel** | 128805.64|\n", "|**Mercedes Benz**| 131034.56|\n", "|**Audi**|128830.76|\n", "|**Ford**| 123839.23|\n", "\n", "***Here, maximum mileage is given by BMW and least by Ford.***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Storing Aggregate Data in a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**First we will convert dictionary into a Series**" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bmp_series\n", "\n", "audi 9718.90\n", "mercedes_benz 8846.68\n", "bmw 8637.41\n", "volkswagen 5943.71\n", "ford 4289.74\n", "opel 3173.85\n", "dtype: float64 \n", "\n", "bmm_series\n" ] }, { "data": { "text/plain": [ "bmw 132575.41\n", "mercedes_benz 131034.56\n", "audi 128830.76\n", "opel 128805.64\n", "volkswagen 128365.67\n", "ford 123839.23\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bmp_series = pd.Series(brand_mean_price).sort_values(ascending=False)\n", "bmm_series = pd.Series(brand_mean_mileage).sort_values(ascending=False)\n", "print('bmp_series\\n')\n", "print(bmp_series,'\\n')\n", "print('bmm_series')\n", "bmm_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Create a dataframe from the first series object using the dataframe constructor.**" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_price
audi9718.90
mercedes_benz8846.68
bmw8637.41
volkswagen5943.71
ford4289.74
opel3173.85
\n", "
" ], "text/plain": [ " mean_price\n", "audi 9718.90\n", "mercedes_benz 8846.68\n", "bmw 8637.41\n", "volkswagen 5943.71\n", "ford 4289.74\n", "opel 3173.85" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_brands_df = pd.DataFrame(bmp_series, columns=['mean_price'])\n", "top_brands_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Assign the other series as a new column in this dataframe.**" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_pricemean_mileage
audi9718.90128830.76
mercedes_benz8846.68131034.56
bmw8637.41132575.41
volkswagen5943.71128365.67
ford4289.74123839.23
opel3173.85128805.64
\n", "
" ], "text/plain": [ " mean_price mean_mileage\n", "audi 9718.90 128830.76\n", "mercedes_benz 8846.68 131034.56\n", "bmw 8637.41 132575.41\n", "volkswagen 5943.71 128365.67\n", "ford 4289.74 123839.23\n", "opel 3173.85 128805.64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_brands_df['mean_mileage'] = bmm_series\n", "top_brands_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***Here we can observe that car mileage/brand does not vary much as compared to the car prices/brand.

While mean price difference for most expensive car brand (Audi) and most economical car brand (Opel) was a gigantic 71%.
However, same is not true for mean mileage, with total difference between top mileage giving brand(BMW) & least mileage giving brand(Ford) being mere 7%.***" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }