{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Guided Project: Exploring Ebay Car Sales Data\n", "\n", "In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.\n", "\n", "The dataset was originally scraped and uploaded to Kaggle. A few modifications have been made from the original dataset that was uploaded to Kaggle:\n", "\n", "- 50,000 data points from the full dataset have been sampled\n", "- The dataset has been dirtied a bit to more closely resemble what would be expected from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)\n", "\n", "The file and relevant libraries have been uploaded below" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "autos = pd.read_csv('autos.csv', encoding = \"Latin-1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial inspection\n", "\n", "- Writing the name of a panda variable, will show the first few and last few rows\n", "- The file has been initally inspected, with info and head panda functions. The columns at index values 6, 8, 10, 13 and 15 all have null values. \n", "- There are also some numeric values, currently represented as a string e.g. `price`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 dateCrawled 50000 non-null object\n", " 1 name 50000 non-null object\n", " 2 seller 50000 non-null object\n", " 3 offerType 50000 non-null object\n", " 4 price 50000 non-null object\n", " 5 abtest 50000 non-null object\n", " 6 vehicleType 44905 non-null object\n", " 7 yearOfRegistration 50000 non-null int64 \n", " 8 gearbox 47320 non-null object\n", " 9 powerPS 50000 non-null int64 \n", " 10 model 47242 non-null object\n", " 11 odometer 50000 non-null object\n", " 12 monthOfRegistration 50000 non-null int64 \n", " 13 fuelType 45518 non-null object\n", " 14 brand 50000 non-null object\n", " 15 notRepairedDamage 40171 non-null object\n", " 16 dateCreated 50000 non-null object\n", " 17 nrOfPictures 50000 non-null int64 \n", " 18 postalCode 50000 non-null int64 \n", " 19 lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearboxpowerPSmodelodometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos\n", "autos.info()\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The headings have been provided in CamelCase, rather than Snake_case\n", "- I will manually access and edit these, using the dataframe.column function" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n", " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n", " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n", " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n", " 'lastSeen'],\n", " dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.columns" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "New_headings = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',\n", " 'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',\n", " 'odometer', 'registration_month', 'fuel_type', 'brand',\n", " 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',\n", " 'last_seen']\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_PSmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_PS model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.columns = New_headings\n", "\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Next we will do some basic data exploration to determine what other cleaning tasks need to be done. \n", "- Initially we will look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_PSmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-12 16:06:22Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name seller offer_type price abtest \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-12 16:06:22 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicle_type registration_year gearbox power_PS model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fuel_type brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created nr_of_pictures postal_code last_seen \n", "count 50000 50000.0 50000.000000 50000 \n", "unique 76 NaN NaN 39481 \n", "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN NaN 8 \n", "mean NaN 0.0 50813.627300 NaN \n", "std NaN 0.0 25779.747957 NaN \n", "min NaN 0.0 1067.000000 NaN \n", "25% NaN 0.0 30451.000000 NaN \n", "50% NaN 0.0 49577.000000 NaN \n", "75% NaN 0.0 71540.000000 NaN \n", "max NaN 0.0 99998.000000 NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include = 'all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The `seller` and `offer_type` columns each contain only 1 unique value, so we will consider dropping them" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The columns `price` and `odometer`, are objects, so we will consider converting them to numeric values by stripping out the non-numeric characters, then converting to int64 or float64 and finally renaming the column" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_PSmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_createdpostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:00795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:00710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:00353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:00337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:00392182016-04-01 14:38:50
............................................................
499952016-03-27 14:38:19Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__XenonprivatAngebot$24,900controllimousine2011automatik239q5100,000km1dieselaudinein2016-03-27 00:00:00821312016-04-01 13:47:40
499962016-03-28 10:50:25Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...privatAngebot$1,980controlcabrio1996manuell75astra150,000km5benzinopelnein2016-03-28 00:00:00448072016-04-02 14:18:02
499972016-04-02 14:44:48Fiat_500_C_1.2_Dualogic_LoungeprivatAngebot$13,200testcabrio2014automatik695005,000km11benzinfiatnein2016-04-02 00:00:00734302016-04-04 11:47:27
499982016-03-08 19:25:42Audi_A3_2.0_TDI_Sportback_AmbitionprivatAngebot$22,900controlkombi2013manuell150a340,000km11dieselaudinein2016-03-08 00:00:00356832016-04-05 16:45:07
499992016-03-14 00:42:12Opel_Vectra_1.6_16VprivatAngebot$1,250controllimousine1996manuell101vectra150,000km1benzinopelnein2016-03-13 00:00:00458972016-04-06 21:18:48
\n", "

50000 rows × 19 columns

\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "... ... ... \n", "49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon \n", "49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... \n", "49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge \n", "49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition \n", "49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "... ... ... ... ... ... ... \n", "49995 privat Angebot $24,900 control limousine 2011 \n", "49996 privat Angebot $1,980 control cabrio 1996 \n", "49997 privat Angebot $13,200 test cabrio 2014 \n", "49998 privat Angebot $22,900 control kombi 2013 \n", "49999 privat Angebot $1,250 control limousine 1996 \n", "\n", " gearbox power_PS model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "... ... ... ... ... ... ... \n", "49995 automatik 239 q5 100,000km 1 diesel \n", "49996 manuell 75 astra 150,000km 5 benzin \n", "49997 automatik 69 500 5,000km 11 benzin \n", "49998 manuell 150 a3 40,000km 11 diesel \n", "49999 manuell 101 vectra 150,000km 1 benzin \n", "\n", " brand unrepaired_damage ad_created postal_code \\\n", "0 peugeot nein 2016-03-26 00:00:00 79588 \n", "1 bmw nein 2016-04-04 00:00:00 71034 \n", "2 volkswagen nein 2016-03-26 00:00:00 35394 \n", "3 smart nein 2016-03-12 00:00:00 33729 \n", "4 ford nein 2016-04-01 00:00:00 39218 \n", "... ... ... ... ... \n", "49995 audi nein 2016-03-27 00:00:00 82131 \n", "49996 opel nein 2016-03-28 00:00:00 44807 \n", "49997 fiat nein 2016-04-02 00:00:00 73430 \n", "49998 audi nein 2016-03-08 00:00:00 35683 \n", "49999 opel nein 2016-03-13 00:00:00 45897 \n", "\n", " last_seen \n", "0 2016-04-06 06:45:54 \n", "1 2016-04-06 14:45:08 \n", "2 2016-04-06 20:15:37 \n", "3 2016-03-15 03:16:28 \n", "4 2016-04-01 14:38:50 \n", "... ... \n", "49995 2016-04-01 13:47:40 \n", "49996 2016-04-02 14:18:02 \n", "49997 2016-04-04 11:47:27 \n", "49998 2016-04-05 16:45:07 \n", "49999 2016-04-06 21:18:48 \n", "\n", "[50000 rows x 19 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.drop('nr_of_pictures', axis=1)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 date_crawled 50000 non-null object\n", " 1 name 50000 non-null object\n", " 2 seller 50000 non-null object\n", " 3 offer_type 50000 non-null object\n", " 4 price 50000 non-null object\n", " 5 abtest 50000 non-null object\n", " 6 vehicle_type 44905 non-null object\n", " 7 registration_year 50000 non-null int64 \n", " 8 gearbox 47320 non-null object\n", " 9 power_PS 50000 non-null int64 \n", " 10 model 47242 non-null object\n", " 11 odometer 50000 non-null object\n", " 12 registration_month 50000 non-null int64 \n", " 13 fuel_type 45518 non-null object\n", " 14 brand 50000 non-null object\n", " 15 unrepaired_damage 40171 non-null object\n", " 16 ad_created 50000 non-null object\n", " 17 nr_of_pictures 50000 non-null int64 \n", " 18 postal_code 50000 non-null int64 \n", " 19 last_seen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n" ] } ], "source": [ "autos.info()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_PSmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_PS model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "autos.loc[:,\"odometer\"] = autos.loc[:,\"odometer\"].str.replace(\"km\",\"\").str.replace(\",\",\"\")\n", "autos.loc[:,\"odometer\"] = autos.loc[:,\"odometer\"].astype(int)\n", "autos.rename({\"odometer\":\"odometer_km\"}, axis = 1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_PSmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot5000controlbus2004manuell158andere1500003lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot8500controllimousine1997automatik2867er1500006benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot8990testlimousine2009manuell102golf700007benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot4350controlkleinwagen2007automatik71fortwo700006benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot1350testkombi2003manuell0focus1500007benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot 5000 control bus 2004 \n", "1 privat Angebot 8500 control limousine 1997 \n", "2 privat Angebot 8990 test limousine 2009 \n", "3 privat Angebot 4350 control kleinwagen 2007 \n", "4 privat Angebot 1350 test kombi 2003 \n", "\n", " gearbox power_PS model odometer_km registration_month fuel_type \\\n", "0 manuell 158 andere 150000 3 lpg \n", "1 automatik 286 7er 150000 6 benzin \n", "2 manuell 102 golf 70000 7 benzin \n", "3 automatik 71 fortwo 70000 6 benzin \n", "4 manuell 0 focus 150000 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:,\"price\"] = autos.loc[:,\"price\"].str.replace(\"$\",\"\").str.replace(\",\",\"\")\n", "autos.loc[:,\"price\"] = autos.loc[:,\"price\"].astype(int)\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Next, I will try to identify any outliers in the `odometer_km` and `price` columns. \n", "- I will use the .shape (unique values), .describe() (min, max, etc..) and .value_counts() (counts of unique values) functions." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50000,)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'odometer_km'].shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000.000000\n", "mean 125732.700000\n", "std 40042.211706\n", "min 5000.000000\n", "25% 125000.000000\n", "50% 150000.000000\n", "75% 150000.000000\n", "max 150000.000000\n", "Name: odometer_km, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'odometer_km'].describe()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150000 32424\n", "125000 5170\n", "100000 2169\n", "90000 1757\n", "80000 1436\n", "70000 1230\n", "60000 1164\n", "50000 1027\n", "40000 819\n", "30000 789\n", "20000 784\n", "10000 264\n", "5000 967\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'odometer_km'].value_counts().sort_index(ascending = False)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50000,)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'price'].shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5.000000e+04\n", "mean 9.840044e+03\n", "std 4.811044e+05\n", "min 0.000000e+00\n", "25% 1.100000e+03\n", "50% 2.950000e+03\n", "75% 7.200000e+03\n", "max 1.000000e+08\n", "Name: price, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'price'].describe()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "99999999 1\n", "27322222 1\n", "12345678 3\n", "11111111 2\n", "10000000 1\n", " ... \n", "5 2\n", "3 1\n", "2 3\n", "1 156\n", "0 1421\n", "Name: price, Length: 2357, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'price'].value_counts().sort_index(ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The top and bottom five values, look out of place, so I will convert values >= 10000000 and <= 100, to Null and then drop the rows with Null values" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "autos.loc[autos.loc[:, 'price'] == 99999999, 'price'] = np.nan\n", "autos.loc[autos.loc[:, 'price'] < 50, 'price'] = np.nan\n", "autos = autos.dropna()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12345678.0 1\n", "11111111.0 1\n", "999999.0 1\n", "999990.0 1\n", "350000.0 1\n", " ..\n", "70.0 1\n", "66.0 1\n", "65.0 1\n", "60.0 2\n", "50.0 15\n", "Name: price, Length: 2146, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# checking the new largest and smallest values in the price series\n", "autos.loc[:, 'price'].value_counts().sort_index(ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The new largest and smallest values seem like sensible prices of cars, although you would have to be very rich to afford, the most expensive in the list!" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(34550, 20)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dates and date ranges\n", "\n", "- Next we will move on to the date columns and understand the date range the data covers.\n", "- The date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledad_createdlast_seen
02016-03-26 17:47:462016-03-26 00:00:002016-04-06 06:45:54
12016-04-04 13:38:562016-04-04 00:00:002016-04-06 14:45:08
22016-03-26 18:57:242016-03-26 00:00:002016-04-06 20:15:37
32016-03-12 16:58:102016-03-12 00:00:002016-03-15 03:16:28
42016-04-01 14:38:502016-04-01 00:00:002016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled ad_created last_seen\n", "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54\n", "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08\n", "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37\n", "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28\n", "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['date_crawled','ad_created','last_seen']][0:5]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledad_createdlast_seen
02016-03-262016-03-262016-04-06
12016-04-042016-04-042016-04-06
22016-03-262016-03-262016-04-06
32016-03-122016-03-122016-03-15
42016-04-012016-04-012016-04-01
\n", "
" ], "text/plain": [ " date_crawled ad_created last_seen\n", "0 2016-03-26 2016-03-26 2016-04-06\n", "1 2016-04-04 2016-04-04 2016-04-06\n", "2 2016-03-26 2016-03-26 2016-04-06\n", "3 2016-03-12 2016-03-12 2016-03-15\n", "4 2016-04-01 2016-04-01 2016-04-01" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'date_crawled'] = autos.loc[:, 'date_crawled'].str[:10]\n", "autos.loc[:, 'ad_created'] = autos.loc[:, 'ad_created'].str[:10]\n", "autos.loc[:, 'last_seen'] = autos.loc[:, 'last_seen'].str[:10]\n", "autos[['date_crawled','ad_created','last_seen']][0:5]" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2018 1\n", "2016 40\n", "2015 344\n", "2014 603\n", "2013 731\n", " ... \n", "1950 1\n", "1941 1\n", "1937 1\n", "1934 1\n", "1931 1\n", "Name: registration_year, Length: 68, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:, 'registration_year'].value_counts().sort_index(ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- All the data from the date cells, looks sensible and there was no obvious pattern in the `date_crawled`,`ad_created` or `last_seen` cells. This may be because I dropped any \"problem rows\" when removing Null values. \n", "- However, to improve the quality of the data slightly, I've removed rows with registration years outside of 1950 - 2016" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "autos = autos.loc[autos.loc[:, 'registration_year'].between(1950,2016),:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- We will now convert dates to uniform integers in the format YYYYMMDD" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "autos.loc[:, 'date_crawled'] = autos.loc[:, 'date_crawled'].str.replace(\"-\",\"\").astype(int)\n", "autos.loc[:, 'ad_created'] = autos.loc[:, 'ad_created'].str.replace(\"-\",\"\").astype(int)\n", "autos.loc[:, 'last_seen'] = autos.loc[:, 'last_seen'].str.replace(\"-\",\"\").astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Brand Aggregation\n", "\n", "- In this section we will explore the differences between brands\n", "- We will only be looking at brands that occupy >= 5% of the total\n", "- These brands have been highlighted below" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford'], dtype='object')" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brands = autos.loc[:,'brand'].value_counts(normalize = True, ascending = False)\n", "significant_brands = brands.loc[brands > 0.05].index\n", "significant_brands" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'volkswagen': 6445.090143218197,\n", " 'bmw': 9230.636386200049,\n", " 'mercedes_benz': 9543.794084507042,\n", " 'opel': 6953.577796706417,\n", " 'audi': 10581.772583701832,\n", " 'ford': 9577.086712683347}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_price = {}\n", "\n", "for b in significant_brands:\n", " mean = autos.loc[autos.loc[:,'brand']==b,'price'].mean()\n", " mean_price[b] = mean\n", " \n", "mean_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Next we will create a new dataframe comparing the mean price and mileage, for the top 6 brands, identified above.\n", "- This will allow us to identify any lin between price and mileage" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'volkswagen': 126191.37882617243,\n", " 'bmw': 131522.71034996276,\n", " 'mercedes_benz': 129816.9014084507,\n", " 'opel': 127535.49119818285,\n", " 'audi': 126366.0770688566,\n", " 'ford': 122918.46419327006}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_mileage = {}\n", "\n", "for b in significant_brands:\n", " mean = autos.loc[autos.loc[:,'brand']==b,'odometer_km'].mean()\n", " mean_mileage[b] = mean\n", " \n", "mean_mileage" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_pricemean_mileage
volkswagen6445.090143126191.378826
bmw9230.636386131522.710350
mercedes_benz9543.794085129816.901408
opel6953.577797127535.491198
audi10581.772584126366.077069
ford9577.086713122918.464193
\n", "
" ], "text/plain": [ " mean_price mean_mileage\n", "volkswagen 6445.090143 126191.378826\n", "bmw 9230.636386 131522.710350\n", "mercedes_benz 9543.794085 129816.901408\n", "opel 6953.577797 127535.491198\n", "audi 10581.772584 126366.077069\n", "ford 9577.086713 122918.464193" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "price_series = pd.Series(mean_price)\n", "mileage_series = pd.Series(mean_mileage)\n", "\n", "df = pd.DataFrame(price_series, columns =['mean_price'])\n", "df.loc[:,'mean_mileage'] = mileage_series\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_pricemean_mileageprice_per_mile
audi10581.772584126366.0770690.083739
ford9577.086713122918.4641930.077914
mercedes_benz9543.794085129816.9014080.073517
bmw9230.636386131522.7103500.070183
opel6953.577797127535.4911980.054523
volkswagen6445.090143126191.3788260.051074
\n", "
" ], "text/plain": [ " mean_price mean_mileage price_per_mile\n", "audi 10581.772584 126366.077069 0.083739\n", "ford 9577.086713 122918.464193 0.077914\n", "mercedes_benz 9543.794085 129816.901408 0.073517\n", "bmw 9230.636386 131522.710350 0.070183\n", "opel 6953.577797 127535.491198 0.054523\n", "volkswagen 6445.090143 126191.378826 0.051074" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:,'price_per_mile'] = df.iloc[:,0] / df.iloc[:,1]\n", "\n", "df.sort_values(by = 'mean_price', ascending = False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- I have added in an extra column (\"price_per_mile\") to help with the analysis. \n", "- There does not appear to be any clear link between `mean_price` and `mean_mileage`\n", "- In the table above, as the mean_price decreases, the mean_mileage tends to stay fairly conisistent.\n", "- However, the results may indicate that the price is determined more so by the brand than the mileage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional analysis\n", "\n", "1) Find the most common brand/model combinations\n", "\n", "2) Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.\n", "\n", "3) How much cheaper are cars with damage than their non-damaged counterparts?" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "volkswagengolf 2748\n", "bmw3er 2077\n", "volkswagenpolo 1129\n", "Name: brand_model_comb, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:,'brand_model_comb'] = autos.loc[:,'brand'].str.cat(autos.loc[:,'model'])\n", "autos.loc[:,'brand_model_comb'].value_counts().head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- It appears that the 3 most popular brad/model combinations are the volkswagen gold, the BMW 3ER and the volskwagen polo" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20224.724452554743\n", "12408.266312921267\n", "5439.197750877979\n" ] } ], "source": [ "_0km_50km_mean = autos.loc[autos.loc[:,'odometer_km'].between(0,50000), 'price'].mean()\n", "_50km_100km_mean = autos.loc[autos.loc[:,'odometer_km'].between(50000,100000), 'price'].mean()\n", "_100km_150km_mean = autos.loc[autos.loc[:,'odometer_km'].between(100000,150000), 'price'].mean()\n", "print(_0km_50km_mean)\n", "print(_50km_100km_mean)\n", "print(_100km_150km_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- It is apparent that the average price of a car, decreases as the mileage increases" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_PS...odometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seenbrand_model_comb
020160326Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot5000.0controlbus2004manuell158...1500003lpgpeugeotnein2016032607958820160406peugeotandere
120160404BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot8500.0controllimousine1997automatik286...1500006benzinbmwnein2016040407103420160406bmw7er
220160326Volkswagen_Golf_1.6_UnitedprivatAngebot8990.0testlimousine2009manuell102...700007benzinvolkswagennein2016032603539420160406volkswagengolf
320160312Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot4350.0controlkleinwagen2007automatik71...700006benzinsmartnein2016031203372920160315smartfortwo
420160401Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot1350.0testkombi2003manuell0...1500007benzinfordnein2016040103921820160401fordfocus
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " date_crawled name seller \\\n", "0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD privat \n", "1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat \n", "2 20160326 Volkswagen_Golf_1.6_United privat \n", "3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat \n", "4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat \n", "\n", " offer_type price abtest vehicle_type registration_year gearbox \\\n", "0 Angebot 5000.0 control bus 2004 manuell \n", "1 Angebot 8500.0 control limousine 1997 automatik \n", "2 Angebot 8990.0 test limousine 2009 manuell \n", "3 Angebot 4350.0 control kleinwagen 2007 automatik \n", "4 Angebot 1350.0 test kombi 2003 manuell \n", "\n", " power_PS ... odometer_km registration_month fuel_type brand \\\n", "0 158 ... 150000 3 lpg peugeot \n", "1 286 ... 150000 6 benzin bmw \n", "2 102 ... 70000 7 benzin volkswagen \n", "3 71 ... 70000 6 benzin smart \n", "4 0 ... 150000 7 benzin ford \n", "\n", " unrepaired_damage ad_created nr_of_pictures postal_code last_seen \\\n", "0 nein 20160326 0 79588 20160406 \n", "1 nein 20160404 0 71034 20160406 \n", "2 nein 20160326 0 35394 20160406 \n", "3 nein 20160312 0 33729 20160315 \n", "4 nein 20160401 0 39218 20160401 \n", "\n", " brand_model_comb \n", "0 peugeotandere \n", "1 bmw7er \n", "2 volkswagengolf \n", "3 smartfortwo \n", "4 fordfocus \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['nein', 'ja'], dtype=object)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[:,'unrepaired_damage'].unique()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5421.324586578532" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "No_unrepaired_damage = autos.loc[autos.loc[:,'unrepaired_damage'] == 'nein', 'price'].mean()\n", "unrepaired_damage = autos.loc[autos.loc[:,'unrepaired_damage'] == 'ja', 'price'].mean()\n", "No_unrepaired_damage - unrepaired_damage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- On average, cars with no unrepaired damage, cost 5421.3 more than those with unrepaired damage" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }