{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# eBay Kleinanzeigen: Analysis of used car listing\n", "\n", "## About:\n", "In this project our aim is to analyse the dataset of used cars from eBay Kleinanzeigen (A Classified section of the German eBay [website](https://en.wikipedia.org/wiki/Classified_advertising)).\n", "\n", "Initially, we will step through the data cleansing process as dataset itself needs extensive cleansing and is a significant part of the project.\n", "\n", "Finally, after our data cleansing, we will answer the question of top brands and their average prices and compare that alongside average mileage to draw inferences.\n", "\n", "## Dataset:\n", "For the purposes of this project, the dataset has been limited to 50,000 rows.\n", "\n", "The full dataset can be downloaded from [here](https://data.world/data-society/used-cars-data).\n", "\n", "Below we can find the metadata about this dataset:\n", "\n", "`dateCrawled` - When this ad was first crawled. All field-values are taken from this date.\n", " \n", "`name` - Name of the car.\n", "\n", "`seller` - Whether the seller is private or a dealer.\n", "\n", "`offerType` - The type of listing\n", "\n", "`price` - The price on the ad to sell the car.\n", "\n", "`abtest` - Whether the listing is included in an A/B test.\n", "\n", "`vehicleType` - The vehicle Type.\n", "\n", "`yearOfRegistration` - The year in which the car was first registered.\n", "\n", "`gearbox` - The transmission type.\n", "\n", "`powerPS` - The power of the car in PS.\n", "\n", "`model` - The car model name.\n", "\n", "`kilometer` - How many kilometers the car has driven.\n", "\n", "`monthOfRegistration` - The month in which the car was first registered.\n", "\n", "`fuelType` - What type of fuel the car uses.\n", "\n", "`brand` - The brand of the car.\n", "\n", "`notRepairedDamage` - If the car has a damage which is not yet repaired.\n", "\n", "`dateCreated` - The date on which the eBay listing was created.\n", "\n", "`nrOfPictures` - The number of pictures in the ad.\n", "\n", "`postalCode` - The postal code for the location of the vehicle.\n", "\n", "`lastSeenOnline` - When the crawler saw this ad last online.\n", "\n", "\n", "## Initial Data Exploration\n", "\n", "We will start by importing the `pandas` and `NumPy` libraries and read our csv data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "autos = pd.read_csv(\"autos.csv\", encoding=\"Latin-1\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 dateCrawled 50000 non-null object\n", " 1 name 50000 non-null object\n", " 2 seller 50000 non-null object\n", " 3 offerType 50000 non-null object\n", " 4 price 50000 non-null object\n", " 5 abtest 50000 non-null object\n", " 6 vehicleType 44905 non-null object\n", " 7 yearOfRegistration 50000 non-null int64 \n", " 8 gearbox 47320 non-null object\n", " 9 powerPS 50000 non-null int64 \n", " 10 model 47242 non-null object\n", " 11 odometer 50000 non-null object\n", " 12 monthOfRegistration 50000 non-null int64 \n", " 13 fuelType 45518 non-null object\n", " 14 brand 50000 non-null object\n", " 15 notRepairedDamage 40171 non-null object\n", " 16 dateCreated 50000 non-null object\n", " 17 nrOfPictures 50000 non-null int64 \n", " 18 postalCode 50000 non-null int64 \n", " 19 lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearboxpowerPSmodelodometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.info()\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `seller`, `no_of_pics` and `offer_type` column seems to have only one value for 99.99% of the data and hence can be ignored safely\n", "* `registration_year` seem to have some odd values suggested by the min being 1000 and max being 9999. This needs further investigation\n", "* `registration_month` as well have invalid values suggested by the min value being 0. This needs further investigation\n", "* `powerPS` column has **0** for about 11% of the data and that is invalid for that column\n", "* `fuel_type` column has **NaN** for about 10% of the data which we would want to exclude for any analysis based on that column\n", "* `unrepaired_damage` column has **NaN** for about 20% of the data which we want to exclude for any analysis based on that column\n", "* Columns `price`, `odometer` ideally should be numeric for our analysis\n", "* Columns `dateCrawled`,`dateCreated`, and `lastSeen` should be Date type\n", "\n", "Additionally, the column names use camelCase instead of the preferred snake_case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleansing\n", "\n", "\n", "### 1. Rename labels\n", "\n", "We will clean/rename the column names so that we have consistent snake_case and also reduce some lengthy column names to more apt shorter ones so that it will aid when we use those as labels in our data analysis." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_createdpictures_countpostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created pictures_count \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',\n", " 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',\n", " 'odometer', 'registration_month', 'fuel_type', 'brand',\n", " 'unrepaired_damage', 'ad_created', 'pictures_count', 'postal_code',\n", " 'last_seen']\n", "autos.columns = new_columns\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Datatype conversions\n", "\n", "We will describe the statistics of the data across the columns to:\n", " - Drop any text columns which have almost all values as the same\n", " - Convert the data on any columns which have numeric data which are stored as text" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_createdpictures_countpostal_codelast_seen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-16 21:50:53Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name seller offer_type price abtest \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-16 21:50:53 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicle_type registration_year gearbox power_ps model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fuel_type brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created pictures_count postal_code last_seen \n", "count 50000 50000.0 50000.000000 50000 \n", "unique 76 NaN NaN 39481 \n", "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN NaN 8 \n", "mean NaN 0.0 50813.627300 NaN \n", "std NaN 0.0 25779.747957 NaN \n", "min NaN 0.0 1067.000000 NaN \n", "25% NaN 0.0 30451.000000 NaN \n", "50% NaN 0.0 49577.000000 NaN \n", "75% NaN 0.0 71540.000000 NaN \n", "max NaN 0.0 99998.000000 NaN " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include=\"all\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above table, we can infer the following:\n", "- Columns `seller`, `offer_type` and `pictures_count` have the same text data for almost the entire dataset and hence can safely be ignored\n", "- Columns `price`, `odometer` are numeric values stored as text with its metric\n", "\n", "Let's now look at converting the text to numeric on the columns `price` and `odometer`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Odometer =====\n", "150,000km 32424\n", "125,000km 5170\n", "100,000km 2169\n", "90,000km 1757\n", "80,000km 1436\n", "70,000km 1230\n", "60,000km 1164\n", "50,000km 1027\n", "5,000km 967\n", "40,000km 819\n", "30,000km 789\n", "20,000km 784\n", "10,000km 264\n", "Name: odometer, dtype: int64\n", "\n", "\n", "===== Price =====\n", "$0 1421\n", "$500 781\n", "$1,500 734\n", "$2,500 643\n", "$1,000 639\n", " ... \n", "$23,790 1\n", "$9,655 1\n", "$29,445 1\n", "$29,699 1\n", "$35,800 1\n", "Name: price, Length: 2357, dtype: int64\n" ] } ], "source": [ "print(\"=\"*5,\"Odometer\",\"=\"*5)\n", "print(autos[\"odometer\"].value_counts())\n", "print('\\n')\n", "print(\"=\"*5,\"Price\",\"=\"*5)\n", "print(autos[\"price\"].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to remove the non-numeric characters as seen above on both of these columns and then we can convert to a numeric dtype." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceregistration_yearpower_psodometer_kmregistration_monthpictures_countpostal_code
count5.000000e+0450000.00000050000.00000050000.00000050000.00000050000.050000.000000
mean9.840044e+032005.073280116.355920125732.7000005.7233600.050813.627300
std4.811044e+05105.712813209.21662740042.2117063.7119840.025779.747957
min0.000000e+001000.0000000.0000005000.0000000.0000000.01067.000000
25%1.100000e+031999.00000070.000000125000.0000003.0000000.030451.000000
50%2.950000e+032003.000000105.000000150000.0000006.0000000.049577.000000
75%7.200000e+032008.000000150.000000150000.0000009.0000000.071540.000000
max1.000000e+089999.00000017700.000000150000.00000012.0000000.099998.000000
\n", "
" ], "text/plain": [ " price registration_year power_ps odometer_km \\\n", "count 5.000000e+04 50000.000000 50000.000000 50000.000000 \n", "mean 9.840044e+03 2005.073280 116.355920 125732.700000 \n", "std 4.811044e+05 105.712813 209.216627 40042.211706 \n", "min 0.000000e+00 1000.000000 0.000000 5000.000000 \n", "25% 1.100000e+03 1999.000000 70.000000 125000.000000 \n", "50% 2.950000e+03 2003.000000 105.000000 150000.000000 \n", "75% 7.200000e+03 2008.000000 150.000000 150000.000000 \n", "max 1.000000e+08 9999.000000 17700.000000 150000.000000 \n", "\n", " registration_month pictures_count postal_code \n", "count 50000.000000 50000.0 50000.000000 \n", "mean 5.723360 0.0 50813.627300 \n", "std 3.711984 0.0 25779.747957 \n", "min 0.000000 0.0 1067.000000 \n", "25% 3.000000 0.0 30451.000000 \n", "50% 6.000000 0.0 49577.000000 \n", "75% 9.000000 0.0 71540.000000 \n", "max 12.000000 0.0 99998.000000 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Remove all non-numeric characters and then convert to numeric\n", "autos[\"odometer\"] = autos[\"odometer\"].str.replace(\",\",\"\").str.replace(\"km\",\"\").astype(int)\n", "autos[\"price\"] = autos[\"price\"].str.replace(\"$\",\"\").str.replace(\",\",\"\").str.replace(\".\",\"\").astype(int)\n", "# Rename the column so that the metric is evident\n", "autos.rename(columns = {\"odometer\":\"odometer_km\"}, inplace = True)\n", "autos.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have ensured that odometer and price columns are numeric reflecting the values.\n", "\n", "### 3. Remove outliers\n", "\n", "Now we will look at the values in these columns a bit deeper to see if there are any outliers that skew the statistics and remove them" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Top 20 - High price listing =====\n", "99999999 1\n", "27322222 1\n", "12345678 3\n", "11111111 2\n", "10000000 1\n", "3890000 1\n", "1300000 1\n", "1234566 1\n", "999999 2\n", "999990 1\n", "350000 1\n", "345000 1\n", "299000 1\n", "295000 1\n", "265000 1\n", "259000 1\n", "250000 1\n", "220000 1\n", "198000 1\n", "197000 1\n", "Name: price, dtype: int64\n", "\n", "\n", "===== Bottom 20 - Low price listing =====\n", "0 1421\n", "1 156\n", "2 3\n", "3 1\n", "5 2\n", "8 1\n", "9 1\n", "10 7\n", "11 2\n", "12 3\n", "13 2\n", "14 1\n", "15 2\n", "17 3\n", "18 1\n", "20 4\n", "25 5\n", "29 1\n", "30 7\n", "35 1\n", "Name: price, dtype: int64\n", "\n", "\n", "===== Statistics of low prices =====\n", "count 1762.000000\n", "mean 5.632804\n", "std 18.374263\n", "min 0.000000\n", "25% 0.000000\n", "50% 0.000000\n", "75% 0.000000\n", "max 99.000000\n", "Name: price, dtype: float64\n", "\n", "\n", "===== Distribution % of high prices =====\n", "12345678 0.200000\n", "999999 0.133333\n", "11111111 0.133333\n", "999990 0.066667\n", "3890000 0.066667\n", "27322222 0.066667\n", "1234566 0.066667\n", "350000 0.066667\n", "99999999 0.066667\n", "10000000 0.066667\n", "1300000 0.066667\n", "Name: price, dtype: float64\n", "\n", "\n", "===== Listing 350,000 and above =====\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namebrandvehicle_typeprice
514Ford_Focus_Turnier_1.6_16V_Stylefordkombi999999
2897Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000fordlimousine11111111
7814Ferrari_F40sonstige_autoscoupe1300000
11137suche_maserati_3200_gt_Zustand_unwichtig_laufe...sonstige_autoscoupe10000000
22947Bmw_530d_zum_ausschlachtenbmwkombi1234566
24384Schlachte_Golf_3_gt_tdivolkswagenNaN11111111
27371Fiat_PuntofiatNaN12345678
36818Porsche_991porschecoupe350000
37585Volkswagen_Jetta_GTvolkswagenlimousine999990
39377Tausche_volvo_v40_gegen_vanvolvoNaN12345678
39705Tausch_gegen_gleichwertigesmercedes_benzlimousine99999999
42221Leasinguebernahmecitroenlimousine27322222
430492_VW_Busse_T3volkswagenbus999999
47598Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...opellimousine12345678
47634Ferrari_FXXsonstige_autoscoupe3890000
\n", "
" ], "text/plain": [ " name brand \\\n", "514 Ford_Focus_Turnier_1.6_16V_Style ford \n", "2897 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 ford \n", "7814 Ferrari_F40 sonstige_autos \n", "11137 suche_maserati_3200_gt_Zustand_unwichtig_laufe... sonstige_autos \n", "22947 Bmw_530d_zum_ausschlachten bmw \n", "24384 Schlachte_Golf_3_gt_tdi volkswagen \n", "27371 Fiat_Punto fiat \n", "36818 Porsche_991 porsche \n", "37585 Volkswagen_Jetta_GT volkswagen \n", "39377 Tausche_volvo_v40_gegen_van volvo \n", "39705 Tausch_gegen_gleichwertiges mercedes_benz \n", "42221 Leasinguebernahme citroen \n", "43049 2_VW_Busse_T3 volkswagen \n", "47598 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... opel \n", "47634 Ferrari_FXX sonstige_autos \n", "\n", " vehicle_type price \n", "514 kombi 999999 \n", "2897 limousine 11111111 \n", "7814 coupe 1300000 \n", "11137 coupe 10000000 \n", "22947 kombi 1234566 \n", "24384 NaN 11111111 \n", "27371 NaN 12345678 \n", "36818 coupe 350000 \n", "37585 limousine 999990 \n", "39377 NaN 12345678 \n", "39705 limousine 99999999 \n", "42221 limousine 27322222 \n", "43049 bus 999999 \n", "47598 limousine 12345678 \n", "47634 coupe 3890000 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"=\"*5,\"Top 20 - High price listing\",\"=\"*5)\n", "print(autos[\"price\"].value_counts().sort_index(ascending = False).head(20))\n", "print('\\n')\n", "print(\"=\"*5,\"Bottom 20 - Low price listing\",\"=\"*5)\n", "print(autos[\"price\"].value_counts().sort_index(ascending = True).head(20))\n", "print('\\n')\n", "print(\"=\"*5,\"Statistics of low prices\",\"=\"*5)\n", "print(autos.loc[autos[\"price\"] < 100, \"price\"].describe())\n", "print('\\n')\n", "print(\"=\"*5,\"Distribution % of high prices\",\"=\"*5)\n", "print(autos.loc[autos[\"price\"].between(350000,99999999),\"price\"].value_counts(normalize=True, dropna=False))\n", "print('\\n')\n", "print(\"=\"*5,\"Listing 350,000 and above\",\"=\"*5)\n", "autos.loc[autos[\"price\"] >= 350000,[\"name\",\"brand\",\"vehicle_type\",\"price\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on above display of `price` column, we see few astronomically high prices and lot of listing with prices that are miniscule which would overall skew our results.\n", "\n", "On closer look at these high price listing, few are true vintage cars or high performance cars (such as Ferrari_F40) that demand these high prices and some are mistakes in the dataset (such as Ford Punto, Escort MK1 or Maserati 3200 GT).\n", "\n", "Same way if we look at miniscule prices (between 0 and 100) and potentially these might be scrap rather than any potential use of the car.\n", "\n", "But since large part of these are mistakes, we will remove any listing that has a price of above 350,000, and similarly remove any listing that has a price less than 100.\n", "\n", "Let's now look at the odometer values and see if there are any outliers." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique odometer values: 13\n" ] }, { "data": { "text/plain": [ "150000 32424\n", "125000 5170\n", "100000 2169\n", "90000 1757\n", "80000 1436\n", "70000 1230\n", "60000 1164\n", "50000 1027\n", "40000 819\n", "30000 789\n", "20000 784\n", "10000 264\n", "5000 967\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Number of unique odometer values: \",autos[\"odometer_km\"].unique().shape[0])\n", "autos[\"odometer_km\"].value_counts().sort_index(ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From above display of `odometer_km` data, we could infer two things:\n", "\n", "1. It seems based on number of unique values that it possibly is already bucketed and grouped on certain km ranges\n", "2. It looks normal for used car listing to have done these kilometer ranges\n", "\n", "Hence, we will keep these and only remove the outliers based on the price and then display the statistics of our cleaned dataset." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceregistration_yearpower_psodometer_kmregistration_monthpictures_countpostal_code
count48224.00000048224.00000048224.00000048224.00000048224.00000048224.048224.000000
mean5930.3714332004.730964117.677609125919.1481425.8016340.050987.919729
std9078.37276287.897388201.20630439543.3396403.6769760.025737.119986
min100.0000001000.0000000.0000005000.0000000.0000000.01067.000000
25%1250.0000001999.00000073.000000125000.0000003.0000000.030823.000000
50%3000.0000002004.000000107.000000150000.0000006.0000000.049716.000000
75%7499.0000002008.000000150.000000150000.0000009.0000000.071666.750000
max350000.0000009999.00000017700.000000150000.00000012.0000000.099998.000000
\n", "
" ], "text/plain": [ " price registration_year power_ps odometer_km \\\n", "count 48224.000000 48224.000000 48224.000000 48224.000000 \n", "mean 5930.371433 2004.730964 117.677609 125919.148142 \n", "std 9078.372762 87.897388 201.206304 39543.339640 \n", "min 100.000000 1000.000000 0.000000 5000.000000 \n", "25% 1250.000000 1999.000000 73.000000 125000.000000 \n", "50% 3000.000000 2004.000000 107.000000 150000.000000 \n", "75% 7499.000000 2008.000000 150.000000 150000.000000 \n", "max 350000.000000 9999.000000 17700.000000 150000.000000 \n", "\n", " registration_month pictures_count postal_code \n", "count 48224.000000 48224.0 48224.000000 \n", "mean 5.801634 0.0 50987.919729 \n", "std 3.676976 0.0 25737.119986 \n", "min 0.000000 0.0 1067.000000 \n", "25% 3.000000 0.0 30823.000000 \n", "50% 6.000000 0.0 49716.000000 \n", "75% 9.000000 0.0 71666.750000 \n", "max 12.000000 0.0 99998.000000 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos = autos.loc[autos[\"price\"].between(100,350000)]\n", "autos.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Date column cleansing: Part 1\n", "\n", "From our dataset, there are 5 columns that represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:\n", "\n", "- `date_crawled`: added by the crawler\n", "- `last_seen`: added by the crawler\n", "- `ad_created`: from the website\n", "- `registration_month`: from the website\n", "- `registration_year`: from the website\n", "\n", "We see that `date_crawled`, `last_seen` and `ad_created` columns are identified as string values by pandas. \n", "\n", "We will look at the string format on these columns." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledad_createdlast_seen
02016-03-26 17:47:462016-03-26 00:00:002016-04-06 06:45:54
12016-04-04 13:38:562016-04-04 00:00:002016-04-06 14:45:08
22016-03-26 18:57:242016-03-26 00:00:002016-04-06 20:15:37
32016-03-12 16:58:102016-03-12 00:00:002016-03-15 03:16:28
42016-04-01 14:38:502016-04-01 00:00:002016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled ad_created last_seen\n", "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54\n", "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08\n", "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37\n", "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28\n", "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['date_crawled','ad_created','last_seen']][0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that these 3 columns have the initial 10 characters representing the date (In the format Year-Month-Day).\n", "\n", "Let's now understand the distribution of data using this date format across the columns." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Distribution of date_crawled column =====\n", "2016-03-05 0.025361\n", "2016-03-06 0.014039\n", "2016-03-07 0.036061\n", "2016-03-08 0.033179\n", "2016-03-09 0.033013\n", "2016-03-10 0.032287\n", "2016-03-11 0.032598\n", "2016-03-12 0.036911\n", "2016-03-13 0.015677\n", "2016-03-14 0.036662\n", "2016-03-15 0.034319\n", "2016-03-16 0.029467\n", "2016-03-17 0.031499\n", "2016-03-18 0.012898\n", "2016-03-19 0.034734\n", "2016-03-20 0.037803\n", "2016-03-21 0.037201\n", "2016-03-22 0.032888\n", "2016-03-23 0.032287\n", "2016-03-24 0.029446\n", "2016-03-25 0.031499\n", "2016-03-26 0.032308\n", "2016-03-27 0.031126\n", "2016-03-28 0.034962\n", "2016-03-29 0.034112\n", "2016-03-30 0.033738\n", "2016-03-31 0.031851\n", "2016-04-01 0.033697\n", "2016-04-02 0.035605\n", "2016-04-03 0.038611\n", "2016-04-04 0.036538\n", "2016-04-05 0.013064\n", "2016-04-06 0.003173\n", "2016-04-07 0.001389\n", "Name: date_crawled, dtype: float64\n" ] } ], "source": [ "print(\"=\"*5,\"Distribution of date_crawled column\",\"=\"*5)\n", "print(autos[\"date_crawled\"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. The data was pulled from the eBay website over a month from 5th March to 7th April, 2016\n", "2. Amount of data crawled is evenly spread except the last two days when the data crawled was very less in proportion to others" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Distribution of Top 5 most recent ad_created =====\n", "2016-04-07 0.001244\n", "2016-04-06 0.003256\n", "2016-04-05 0.011799\n", "2016-04-04 0.036890\n", "2016-04-03 0.038860\n", "Name: ad_created, dtype: float64\n", "\n", "\n", "===== Distribution of Bottom 5 most old ad_created =====\n", "2015-06-11 0.000021\n", "2015-08-10 0.000021\n", "2015-09-09 0.000021\n", "2015-11-10 0.000021\n", "2015-12-05 0.000021\n", "Name: ad_created, dtype: float64\n" ] } ], "source": [ "print(\"=\"*5,\"Distribution of Top 5 most recent ad_created\",\"=\"*5)\n", "print(autos[\"ad_created\"].str[:10].value_counts(normalize=True).sort_index(ascending=False).head(5))\n", "print(\"\\n\")\n", "print(\"=\"*5,\"Distribution of Bottom 5 most old ad_created\",\"=\"*5)\n", "print(autos[\"ad_created\"].str[:10].value_counts(normalize=True).sort_index(ascending=True).head(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the exploration of the `ad_created` column, we observe the following:\n", "1. It shows a pattern that we saw with `date_crawled` column that most of the data is distributed between 9th March to 5th April, 2016\n", "2. There are listings with `ad_created` up to an year before the `date_crawled` date distribution" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===== Distribution of last_seen column =====\n" ] }, { "data": { "text/plain": [ "2016-04-06 0.221964\n", "2016-04-07 0.132154\n", "2016-04-05 0.125062\n", "2016-03-17 0.028098\n", "2016-04-03 0.025133\n", "2016-04-02 0.024884\n", "2016-03-30 0.024697\n", "2016-04-04 0.024531\n", "2016-03-31 0.023826\n", "2016-03-12 0.023785\n", "2016-04-01 0.022852\n", "2016-03-29 0.022292\n", "2016-03-22 0.021359\n", "2016-03-28 0.020840\n", "2016-03-20 0.020654\n", "2016-03-21 0.020550\n", "2016-03-24 0.019762\n", "2016-03-25 0.019098\n", "2016-03-23 0.018580\n", "2016-03-26 0.016672\n", "2016-03-16 0.016444\n", "2016-03-15 0.015863\n", "2016-03-19 0.015760\n", "2016-03-27 0.015552\n", "2016-03-14 0.012629\n", "2016-03-11 0.012400\n", "2016-03-10 0.010638\n", "2016-03-09 0.009580\n", "2016-03-13 0.008875\n", "2016-03-18 0.007320\n", "2016-03-08 0.007320\n", "2016-03-07 0.005433\n", "2016-03-06 0.004313\n", "2016-03-05 0.001078\n", "Name: last_seen, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"=\"*5,\"Distribution of last_seen column\",\"=\"*5)\n", "autos[\"last_seen\"].str[:10].value_counts(normalize=True, ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the distribution on column `last_seen`:\n", "1. Concurs with what we saw with `date_crawled` column that distribution is between 5th March to 7th April, 2016\n", "2. Here purely based on distribution percentages and most data is distributed towards 6th and 7th of April and could suggest a good few days lag between when the ad was created when it was last seen - So might not be a high traffic consumption of listing being viewed immediately.\n", "\n", "### 5. Date column cleansing: Part 2\n", "\n", "Let's now look at the column `registration_year`. Here we are only dealing with the **year** part of the date. \n", "\n", "It's important to understand the data in this column and any outliers here as this column probably suggests the registration year of the car being listed and in turn useful in our analysis based on the car's age." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 48224.000000\n", "mean 2004.730964\n", "std 87.897388\n", "min 1000.000000\n", "25% 1999.000000\n", "50% 2004.000000\n", "75% 2008.000000\n", "max 9999.000000\n", "Name: registration_year, dtype: float64 \n", "\n", "1000 0.000021\n", "1001 0.000021\n", "1111 0.000021\n", "1800 0.000041\n", "1910 0.000041\n", "Name: registration_year, dtype: float64\n", "5911 0.000021\n", "6200 0.000021\n", "8888 0.000021\n", "9000 0.000021\n", "9999 0.000062\n", "Name: registration_year, dtype: float64\n" ] } ], "source": [ "# To understand the statistics of the data behind the series\n", "print(autos[\"registration_year\"].describe(),'\\n')\n", "\n", "# To understand the data distribution percentage on the column\n", "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).head(5))\n", "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).tail(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the statistics on the `registration_year`, it clear that some of the data is not a year evident from the min (1000) and max (9999) values.\n", "\n", "Also, through the distribution of data we see:\n", "1. Large part of the data is distributed between year 2000 and 2014 (~85%)\n", "2. Even if we factor some vintage cars in the listing with old registration years, still there are some incorrect values that might have to be corrected. e.g. 1000, 1001, 1111\n", "3. There are car registration year beyond when the ad the month/year the actual list was created. e.g. Mar-Apr 2016 being our ad listing date ranges and we can assume these are incorrect values\n", "\n", "We will make an assumption that infact some vintage cars are in our listing so we will use the range being 1900 to 2016 and anything outside of this range as incorrect.\n", "\n", "Note: We have ignored the car listings in 1800 as invalid for the following reasons:\n", "\n", "- The period of 1800 is too early as first automobiles came only about 1886\n", "- Brand of the car in these listings says Mitsubishi (they were not founded until 1960's)\n", "\n", "Let's look at the number of listings between our assumed valid registration year ranges and outside of this range. We will also remove the data outside our year ranges." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Listings within valid year range = 46352 \n", "\n", "Listings outside valid year range = 1872 \n", "\n", "1910 0.000043\n", "1927 0.000022\n", "1929 0.000022\n", "1931 0.000022\n", "1934 0.000043\n", "Name: registration_year, dtype: float64\n", "2012 0.028219\n", "2013 0.017281\n", "2014 0.014282\n", "2015 0.008198\n", "2016 0.025932\n", "Name: registration_year, dtype: float64\n" ] } ], "source": [ "# To see how many listings we have within our assumed valid registration year range.\n", "print(\"Listings within valid year range = \",autos[\"registration_year\"].between(1900,2016).sum(),\"\\n\")\n", "\n", "print(\"Listings outside valid year range = \",(~autos[\"registration_year\"].between(1900,2016)).sum(),\"\\n\")\n", "\n", "# To remove the values outside of our assumed valid registration year range\n", "autos = autos[ autos[\"registration_year\"].between(1900,2016) ]\n", "\n", "# To see the data distribtion now in our valid data set\n", "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).head(5))\n", "print(autos[\"registration_year\"].value_counts(normalize=True).sort_index(ascending=True).tail(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Analysis and Inference\n", "### 1. Car brand variation\n", "\n", "Now we will take a look at the `brand` column to perform some analysis and find the mean prices based on the brand." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of car brands in the listing: 40\n" ] } ], "source": [ "print('Number of car brands in the listing: ',autos[\"brand\"].unique().shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that there are 40 unique car brands in our current cleansed dataset.\n", "\n", "### 2. Average Price and Mileage analysis\n", "\n", "For the purposes of this project, we are going to consider only the **top 10** car brands by distribution." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "audi 9380.72\n", "mercedes_benz 8672.65\n", "bmw 8381.68\n", "volkswagen 5436.95\n", "seat 4433.42\n", "ford 3779.27\n", "peugeot 3113.86\n", "opel 3005.50\n", "fiat 2836.87\n", "renault 2496.07\n", "dtype: float64\n" ] } ], "source": [ "# Select only top 10 brands based on the distribution in the car listing\n", "car_brands = autos[\"brand\"].value_counts(normalize = True, dropna = False).head(10).index\n", "brand_mean_price = {}\n", "brand_mean_mileage = {}\n", "# Loop through top 10 brands to calculate mean price of cars\n", "for brand in car_brands:\n", " mean_price = autos.loc[autos[\"brand\"] == brand, \"price\"].mean()\n", " mean_mileage = autos.loc[autos[\"brand\"] == brand, \"odometer_km\"].mean()\n", " brand_mean_price[brand] = round(mean_price,2)\n", " brand_mean_mileage[brand] = round(mean_mileage,2)\n", "# Convert the brand aggregate data into Panda series to sort by values\n", "brand_agg_mean_price = pd.Series(brand_mean_price)\n", "brand_agg_mean_mileage = pd.Series(brand_mean_mileage)\n", "print(brand_agg_mean_price.sort_values(ascending = False))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above analysis, it seems among the top 10 car brands by distribution:\n", "1. **Audi**, **Mercedes Benz**, and **BMW** seems to have a higher mean price\n", "2. **Opel**, **Fiat**, and **Renault** are the lowest by significant margin\n", "3. **Volkswagen** and **Seat** are the sweet spot in the middle\n", "\n", "We will also compare this average price for the top 10 brands against it's average mileage.\n", "\n", "For this, we add our mean mileage as well and make a dataframe so that we can compare side by side." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_pricemean_mileage
volkswagen5436.95128799.88
bmw8381.68132695.32
opel3005.50129384.43
mercedes_benz8672.65131025.67
audi9380.72129245.40
ford3779.27124277.11
renault2496.07128281.39
peugeot3113.86127127.89
fiat2836.87116950.29
seat4433.42121536.64
\n", "
" ], "text/plain": [ " mean_price mean_mileage\n", "volkswagen 5436.95 128799.88\n", "bmw 8381.68 132695.32\n", "opel 3005.50 129384.43\n", "mercedes_benz 8672.65 131025.67\n", "audi 9380.72 129245.40\n", "ford 3779.27 124277.11\n", "renault 2496.07 128281.39\n", "peugeot 3113.86 127127.89\n", "fiat 2836.87 116950.29\n", "seat 4433.42 121536.64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a dataframe with a single column from a series\n", "brand_agg = pd.DataFrame(brand_agg_mean_price, columns=[\"mean_price\"])\n", "# Add a column to the dataframe with another series sharing same label\n", "brand_agg[\"mean_mileage\"] = brand_agg_mean_mileage\n", "brand_agg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now by comparing the average price alongside the average mileage, we see that the mileage range for **Audi**, **Mercedes Benz**, and **BMW** are all close in the same range.\n", "\n", "## Conclusion\n", "\n", "We started with data cleansing as a big part of this project and we did the following:\n", "\n", "1. Renamed column names to a consistent format\n", "2. Converted certain numeric data which were stored as text\n", "3. Removed outliers:\n", " - We removed listings that was outside of our price range we assumed was realistic (100 and 350000)\n", " - We removed listings that was outside of our registered_year range that was realistic (1900 to 2016)\n", " \n", "After cleansing the data, we started with our data analysis and listed the Top 10 car brands by price.\n", "\n", "Then upon analysing the average price and mileage of these top 10 car brands, we concluded that:\n", "\n", "1. **Audi**, **Mercedes Benz**, and **BMW** are the Top 3 brands per average car price and all of these are in the same mileage range\n", "2. **Volkswagen** and **Seat** are the best brands for value of money in terms of average price and mileage" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }