{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "We're working with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.\n", "\n", "The dataset was originally scraped and uploaded to Kaggle by user orgesleka.,/a>\n", "The original dataset isn't available on Kaggle anymore, but you can find it here.\n", "\n", "Few modifications were made from the original dataset:\n", "\n", "\n", "The data dictionary provided with data is as follows:\n", "\n", " \n", " The aim of this project is to clean the data and analyze the included used car listings.\n", "\n", "Let's start by importing the libraries we need and reading the dataset into pandas." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "autos = pd.read_csv(\"autos.csv\", encoding='Latin-1')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearboxpowerPSmodelodometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
...............................................................
499952016-03-27 14:38:19Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__XenonprivatAngebot$24,900controllimousine2011automatik239q5100,000km1dieselaudinein2016-03-27 00:00:000821312016-04-01 13:47:40
499962016-03-28 10:50:25Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...privatAngebot$1,980controlcabrio1996manuell75astra150,000km5benzinopelnein2016-03-28 00:00:000448072016-04-02 14:18:02
499972016-04-02 14:44:48Fiat_500_C_1.2_Dualogic_LoungeprivatAngebot$13,200testcabrio2014automatik695005,000km11benzinfiatnein2016-04-02 00:00:000734302016-04-04 11:47:27
499982016-03-08 19:25:42Audi_A3_2.0_TDI_Sportback_AmbitionprivatAngebot$22,900controlkombi2013manuell150a340,000km11dieselaudinein2016-03-08 00:00:000356832016-04-05 16:45:07
499992016-03-14 00:42:12Opel_Vectra_1.6_16VprivatAngebot$1,250controllimousine1996manuell101vectra150,000km1benzinopelnein2016-03-13 00:00:000458972016-04-06 21:18:48
\n", "

50000 rows × 20 columns

\n", "
" ], "text/plain": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "... ... ... \n", "49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon \n", "49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... \n", "49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge \n", "49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition \n", "49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "... ... ... ... ... ... ... \n", "49995 privat Angebot $24,900 control limousine 2011 \n", "49996 privat Angebot $1,980 control cabrio 1996 \n", "49997 privat Angebot $13,200 test cabrio 2014 \n", "49998 privat Angebot $22,900 control kombi 2013 \n", "49999 privat Angebot $1,250 control limousine 1996 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "... ... ... ... ... ... ... \n", "49995 automatik 239 q5 100,000km 1 diesel \n", "49996 manuell 75 astra 150,000km 5 benzin \n", "49997 automatik 69 500 5,000km 11 benzin \n", "49998 manuell 150 a3 40,000km 11 diesel \n", "49999 manuell 101 vectra 150,000km 1 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "... ... ... ... ... \n", "49995 audi nein 2016-03-27 00:00:00 0 \n", "49996 opel nein 2016-03-28 00:00:00 0 \n", "49997 fiat nein 2016-04-02 00:00:00 0 \n", "49998 audi nein 2016-03-08 00:00:00 0 \n", "49999 opel nein 2016-03-13 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 \n", "... ... ... \n", "49995 82131 2016-04-01 13:47:40 \n", "49996 44807 2016-04-02 14:18:02 \n", "49997 73430 2016-04-04 11:47:27 \n", "49998 35683 2016-04-05 16:45:07 \n", "49999 45897 2016-04-06 21:18:48 \n", "\n", "[50000 rows x 20 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the above information we can the see that our datasets contains 50000 rows and 20 columns " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 dateCrawled 50000 non-null object\n", " 1 name 50000 non-null object\n", " 2 seller 50000 non-null object\n", " 3 offerType 50000 non-null object\n", " 4 price 50000 non-null object\n", " 5 abtest 50000 non-null object\n", " 6 vehicleType 44905 non-null object\n", " 7 yearOfRegistration 50000 non-null int64 \n", " 8 gearbox 47320 non-null object\n", " 9 powerPS 50000 non-null int64 \n", " 10 model 47242 non-null object\n", " 11 odometer 50000 non-null object\n", " 12 monthOfRegistration 50000 non-null int64 \n", " 13 fuelType 45518 non-null object\n", " 14 brand 50000 non-null object\n", " 15 notRepairedDamage 40171 non-null object\n", " 16 dateCreated 50000 non-null object\n", " 17 nrOfPictures 50000 non-null int64 \n", " 18 postalCode 50000 non-null int64 \n", " 19 lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n" ] } ], "source": [ "# Enquiring further by checking the info() and head \n", "\n", "autos.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "most of our columns are objects while year of regristration, powerps, monthofregistration, nrofpictures and postalcode are int64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make the following observations:\n", "\n", "\n", " \n", "Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n", " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n", " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n", " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n", " 'lastSeen'],\n", " dtype='object')\n" ] } ], "source": [ "print(autos.columns)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "autos.columns = ['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',\n", " 'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',\n", " 'odometer', 'registration_month', 'fueltype', 'brand',\n", " 'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',\n", " 'lastseen']" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecrawlednameselleroffertypepriceabtestvehicletyperegistration_yeargearboxpowerpsmodelodometerregistration_monthfueltypebrandunrepaired_damagead_creatednrofpicturespostalcodelastseen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " datecrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offertype price abtest vehicletype registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerps model odometer registration_month fueltype \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created nrofpictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalcode lastseen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:\n", "\n", "

The following methods are helpful for exploring the data:

\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecrawlednameselleroffertypepriceabtestvehicletyperegistration_yeargearboxpowerpsmodelodometerregistration_monthfueltypebrandunrepaired_damagead_creatednrofpicturespostalcodelastseen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-11 22:38:16Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " datecrawled name seller offertype price abtest \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-11 22:38:16 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicletype registration_year gearbox powerps model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fueltype brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created nrofpictures postalcode lastseen \n", "count 50000 50000.0 50000.000000 50000 \n", "unique 76 NaN NaN 39481 \n", "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN NaN 8 \n", "mean NaN 0.0 50813.627300 NaN \n", "std NaN 0.0 25779.747957 NaN \n", "min NaN 0.0 1067.000000 NaN \n", "25% NaN 0.0 30451.000000 NaN \n", "50% NaN 0.0 49577.000000 NaN \n", "75% NaN 0.0 71540.000000 NaN \n", "max NaN 0.0 99998.000000 NaN " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include='all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What can be seen is that columns \"seller\" and \"offer_type\" consist of almost all the same values.\\ Therefore, these columns are deemed unuseful for further analysis and will be dropped from the dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150,000km 32424\n", "125,000km 5170\n", "100,000km 2169\n", "90,000km 1757\n", "80,000km 1436\n", "70,000km 1230\n", "60,000km 1164\n", "50,000km 1027\n", "5,000km 967\n", "40,000km 819\n", "30,000km 789\n", "20,000km 784\n", "10,000km 264\n", "Name: odometer, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Taking a quick look at individual columns\n", "autos['odometer'].value_counts(dropna = False)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#remove columns that provide unuseful data\n", "autos.drop(['seller','offertype'],axis=1,inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Exploration of price and odometer_km

\n", "As mentioned before, there are two columns which contain numeric data that need to be reformatted.\\ This will be done below. Afterwards they will be investigated for the presence of outliers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

From this brief exploration we can see that the 'price' and 'odemeter'

columns are numeric values stored as text\n", "\n", "For each column we will:\n", "\n", "Remove all non-numeric characters\n", "Convert the column to a numeric type\n", "Rename the 'odometer' column to 'odometer_km'" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#price\n", "autos['price'] = (autos['price']\n", " .str.replace(\"$\",\"\")\n", " .str.replace(\",\",\"\")\n", " .astype(float)\n", " )\n", "\n", "\n", "#odometer\n", "autos['odometer'] = (autos['odometer'].\n", " str.replace(\"km\",\"\").\n", " str.replace(\",\",\"\").\n", " astype(int)\n", " )\n", "\n", "autos.rename({'odometer': 'odometer_km'},axis=1,inplace=True)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5000.0\n", "1 8500.0\n", "2 8990.0\n", "3 4350.0\n", "4 1350.0\n", "Name: price, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)\n", "autos['price'].head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150000 32424\n", "125000 5170\n", "100000 2169\n", "90000 1757\n", "80000 1436\n", "70000 1230\n", "60000 1164\n", "50000 1027\n", "5000 967\n", "40000 819\n", "30000 789\n", "20000 784\n", "10000 264\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# autos['odometer'] = autos['odometer'].str.replace(',', '').str.replace('km', '').astype(int)\n", "# autos.rename({'odometer':\"odometer_km\"}, axis = 1, inplace = True)\n", "autos['odometer_km'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns 'price' and 'odometer_km' have now succesfully been transformed to integer values.\\ They will now be further analysed in order to check for outliers that might need to be removed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2357" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].nunique()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5.000000e+04\n", "mean 9.840044e+03\n", "std 4.811044e+05\n", "min 0.000000e+00\n", "25% 1.100000e+03\n", "50% 2.950000e+03\n", "75% 7.200000e+03\n", "max 1.000000e+08\n", "Name: price, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see that the price column appears to have a 'Positive or Right Skew' with its maximum value being of the magnitude 10^5 times greater than values within 75% of the dataset." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2357,)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].unique().shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 1421\n", "500.0 781\n", "1500.0 734\n", "2500.0 643\n", "1200.0 639\n", " ... \n", "6202.0 1\n", "18310.0 1\n", "898.0 1\n", "11240.0 1\n", "789.0 1\n", "Name: price, Length: 2357, dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].value_counts()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000.000000\n", "mean 125732.700000\n", "std 40042.211706\n", "min 5000.000000\n", "25% 125000.000000\n", "50% 150000.000000\n", "75% 150000.000000\n", "max 150000.000000\n", "Name: odometer_km, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see that the 'odometer_km' column appears to have a 'Negative or Left Skew' with its minimum value being of the magnitude 10^2 times less than values within 75% of the dataset." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(50000,)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(13,)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].unique().shape" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(13,)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].unique().shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Removing outliers in the 'Price' & 'odometer_km' columns

" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Removing outliers in the Price column and checking the new statistical distribution\n", "autos['price'] = autos.loc[autos['price'].between(0,30000), 'price']" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 49206.000000\n", "mean 5025.773483\n", "std 5679.154441\n", "min 0.000000\n", "25% 1100.000000\n", "50% 2850.000000\n", "75% 6900.000000\n", "max 30000.000000\n", "Name: price, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lowest price is zero, which has 1421 occurences in the dataset.\\ Price represents the starting value of an action. Since it is only the starting price it does not mean that a car will be sold for that price. Therefore anything below 500 euro's (which seems like a realistic lower bound for used car prices) is not removed from the dataset.\n", "\n", "There are also outliers at the top. When investigating the unique values with their respective counts the starting price increases slowly until 350,000.\\ Onwards the price increases rapidly. Occurences with a starting price higher than 350,000 will be investigated further." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1985" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].nunique()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Removing outliers in the odometer_km column and checking its statistical distribution\n", "autos['odometer_km'] = autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km']" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].nunique()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceregistration_yearpowerpsodometer_kmregistration_monthnrofpicturespostalcode
count49206.00000050000.00000050000.00000049033.00000050000.00000050000.050000.000000
mean5025.7734832005.073280116.355920128113.7193325.7233600.050813.627300
std5679.154441105.712813209.21662736631.3817853.7119840.025779.747957
min0.0000001000.0000000.00000010000.0000000.0000000.01067.000000
25%1100.0000001999.00000070.000000125000.0000003.0000000.030451.000000
50%2850.0000002003.000000105.000000150000.0000006.0000000.049577.000000
75%6900.0000002008.000000150.000000150000.0000009.0000000.071540.000000
max30000.0000009999.00000017700.000000150000.00000012.0000000.099998.000000
\n", "
" ], "text/plain": [ " price registration_year powerps odometer_km \\\n", "count 49206.000000 50000.000000 50000.000000 49033.000000 \n", "mean 5025.773483 2005.073280 116.355920 128113.719332 \n", "std 5679.154441 105.712813 209.216627 36631.381785 \n", "min 0.000000 1000.000000 0.000000 10000.000000 \n", "25% 1100.000000 1999.000000 70.000000 125000.000000 \n", "50% 2850.000000 2003.000000 105.000000 150000.000000 \n", "75% 6900.000000 2008.000000 150.000000 150000.000000 \n", "max 30000.000000 9999.000000 17700.000000 150000.000000 \n", "\n", " registration_month nrofpictures postalcode \n", "count 50000.000000 50000.0 50000.000000 \n", "mean 5.723360 0.0 50813.627300 \n", "std 3.711984 0.0 25779.747957 \n", "min 0.000000 0.0 1067.000000 \n", "25% 3.000000 0.0 30451.000000 \n", "50% 6.000000 0.0 49577.000000 \n", "75% 9.000000 0.0 71540.000000 \n", "max 12.000000 0.0 99998.000000 " ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The new Dataframe\n", "autos.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Exploring the Date Columns

\n", "With the initial read in of the data, columns - 'date_crawled', 'last_seen' and 'ad_created' are all identified as string vlaues by pandas.\n", "\n", "Because these 3 columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.\n", "\n", "As the other two time columns - 'registration_month' & 'registration_year' are already in a numerical format, their distribution can be understood using the Series.describe() method" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['datecrawled', 'name', 'price', 'abtest', 'vehicletype',\n", " 'registration_year', 'gearbox', 'powerps', 'model', 'odometer_km',\n", " 'registration_month', 'fueltype', 'brand', 'unrepaired_damage',\n", " 'ad_created', 'nrofpictures', 'postalcode', 'lastseen'],\n", " dtype='object')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.columns" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecrawledad_createdlastseen
02016-03-26 17:47:462016-03-26 00:00:002016-04-06 06:45:54
12016-04-04 13:38:562016-04-04 00:00:002016-04-06 14:45:08
22016-03-26 18:57:242016-03-26 00:00:002016-04-06 20:15:37
32016-03-12 16:58:102016-03-12 00:00:002016-03-15 03:16:28
42016-04-01 14:38:502016-04-01 00:00:002016-04-01 14:38:50
52016-03-21 13:47:452016-03-21 00:00:002016-04-06 09:45:21
\n", "
" ], "text/plain": [ " datecrawled ad_created lastseen\n", "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54\n", "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08\n", "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37\n", "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28\n", "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50\n", "5 2016-03-21 13:47:45 2016-03-21 00:00:00 2016-04-06 09:45:21" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[0:5, ['datecrawled', 'ad_created', 'lastseen']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extracting the date only and calculating the distribution of values in these 3 columns" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "2016-03-05 2.5%\n", "2016-03-06 1.4%\n", "2016-03-07 3.6%\n", "2016-03-08 3.3%\n", "2016-03-09 3.3%\n", "2016-03-10 3.2%\n", "2016-03-11 3.2%\n", "2016-03-12 3.7%\n", "2016-03-13 1.6%\n", "2016-03-14 3.7%\n", "2016-03-15 3.4%\n", "2016-03-16 2.9%\n", "2016-03-17 3.2%\n", "2016-03-18 1.3%\n", "2016-03-19 3.5%\n", "2016-03-20 3.8%\n", "2016-03-21 3.8%\n", "2016-03-22 3.3%\n", "2016-03-23 3.2%\n", "2016-03-24 2.9%\n", "2016-03-25 3.2%\n", "2016-03-26 3.2%\n", "2016-03-27 3.1%\n", "2016-03-28 3.5%\n", "2016-03-29 3.4%\n", "2016-03-30 3.4%\n", "2016-03-31 3.2%\n", "2016-04-01 3.4%\n", "2016-04-02 3.5%\n", "2016-04-03 3.9%\n", "2016-04-04 3.7%\n", "2016-04-05 1.3%\n", "2016-04-06 0.3%\n", "2016-04-07 0.1%\n", "Name: datecrawled, dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# INVESTIGATING THE DATE_CRAWLED COLUMN\n", "date_crawled = autos['datecrawled'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'\n", "date_crawled.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the data above it seems like the period over which the data has been crawled covers roughly one month (March-April 2016).\\ The distribution is more or less uniform." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2015-06-11 0.0%\n", "2015-08-10 0.0%\n", "2015-09-09 0.0%\n", "2015-11-10 0.0%\n", "2015-12-05 0.0%\n", " ... \n", "2016-04-03 3.9%\n", "2016-04-04 3.7%\n", "2016-04-05 1.2%\n", "2016-04-06 0.3%\n", "2016-04-07 0.1%\n", "Name: ad_created, Length: 76, dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# INVESTIGATING THE ad_created COLUMN\n", "ad_created = autos['ad_created'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'\n", "ad_created.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dates ads were created range from June 2015 until April of 2016. The majority (+- 97%) of ads in the dataset were created after the date on which data was crawled for the first time.\n", "\n", "This make sense as most auctions are only 'live' for a short period of time." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-03-05 0.1%\n", "2016-03-06 0.4%\n", "2016-03-07 0.5%\n", "2016-03-08 0.8%\n", "2016-03-09 1.0%\n", "2016-03-10 1.1%\n", "2016-03-11 1.3%\n", "2016-03-12 2.4%\n", "2016-03-13 0.9%\n", "2016-03-14 1.3%\n", "2016-03-15 1.6%\n", "2016-03-16 1.6%\n", "2016-03-17 2.8%\n", "2016-03-18 0.7%\n", "2016-03-19 1.6%\n", "2016-03-20 2.1%\n", "2016-03-21 2.1%\n", "2016-03-22 2.2%\n", "2016-03-23 1.9%\n", "2016-03-24 2.0%\n", "2016-03-25 1.9%\n", "2016-03-26 1.7%\n", "2016-03-27 1.6%\n", "2016-03-28 2.1%\n", "2016-03-29 2.2%\n", "2016-03-30 2.5%\n", "2016-03-31 2.4%\n", "2016-04-01 2.3%\n", "2016-04-02 2.5%\n", "2016-04-03 2.5%\n", "2016-04-04 2.5%\n", "2016-04-05 12.4%\n", "2016-04-06 22.1%\n", "2016-04-07 13.1%\n", "Name: lastseen, dtype: object" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# INVESTIGATING THE lastseen COLUMN\n", "last_seen = autos['lastseen'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'\n", "last_seen.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last seen dates looks a bit uniform, but the last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales. It's more likely that these values are to do with the crawling period ending and don't indicate car sales." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000.000000\n", "mean 2005.073280\n", "std 105.712813\n", "min 1000.000000\n", "25% 1999.000000\n", "50% 2003.000000\n", "75% 2008.000000\n", "max 9999.000000\n", "Name: registration_year, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "registration_year = autos[\"registration_year\"].describe()\n", "registration_year" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2004\n", "1 1997\n", "2 2009\n", "3 2007\n", "4 2003\n", "5 2006\n", "6 1995\n", "7 1998\n", "8 2000\n", "9 1997\n", "Name: registration_year, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"registration_year\"].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The minimum and maximum value of registration year seems strange.\n", "\n", "The lowest registration year is 1,000 which must be incorrect as cars only started appearing in the late 1800's, Also the max year is 9999 which is talking about the future, Due to this all occurences with a registration year before 1885 (first patented practical automobile) will be removed. \n", "\n", "All registration years after 2016 must be incorrect as ads were created in 2015 & 2016.\\ These will be removed from the dataset as well\n", "\n", "A manufactured car can't be registered after its listing was seen but before, any vehicle with a registration year above 2016 we can deduce as inaccurate as our ad_created column which represents date_of_listing only has data on years up to 2016.\n", "\n", "Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1800s. In an attempt to validate this we can count the number of listings that fall outside the 1900-2016 interval and see if it is safe to remove those rows entirely or if they require custom logic." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
priceregistration_yearpowerpsodometer_kmregistration_monthnrofpicturespostalcode
count47245.00000048030.00000048030.00000047117.00000048030.00000048030.048030.000000
mean5094.4830142002.795066117.140496127874.8646995.7676040.050936.383094
std5729.1836737.426905195.44914936786.7146943.6968050.025791.666655
min0.0000001800.0000000.00000010000.0000000.0000000.01067.000000
25%1100.0000001999.00000071.000000125000.0000003.0000000.030459.000000
50%2900.0000002003.000000107.000000150000.0000006.0000000.049696.000000
75%6999.0000002008.000000150.000000150000.0000009.0000000.071665.000000
max30000.0000002016.00000017700.000000150000.00000012.0000000.099998.000000
\n", "
" ], "text/plain": [ " price registration_year powerps odometer_km \\\n", "count 47245.000000 48030.000000 48030.000000 47117.000000 \n", "mean 5094.483014 2002.795066 117.140496 127874.864699 \n", "std 5729.183673 7.426905 195.449149 36786.714694 \n", "min 0.000000 1800.000000 0.000000 10000.000000 \n", "25% 1100.000000 1999.000000 71.000000 125000.000000 \n", "50% 2900.000000 2003.000000 107.000000 150000.000000 \n", "75% 6999.000000 2008.000000 150.000000 150000.000000 \n", "max 30000.000000 2016.000000 17700.000000 150000.000000 \n", "\n", " registration_month nrofpictures postalcode \n", "count 48030.000000 48030.0 48030.000000 \n", "mean 5.767604 0.0 50936.383094 \n", "std 3.696805 0.0 25791.666655 \n", "min 0.000000 0.0 1067.000000 \n", "25% 3.000000 0.0 30459.000000 \n", "50% 6.000000 0.0 49696.000000 \n", "75% 9.000000 0.0 71665.000000 \n", "max 12.000000 0.0 99998.000000 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[autos['registration_year'].between(1800,2016)].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is still a large variety in registration years of the cars.The mean of apprpximately 2002 with a small standard deviation indicate that most cars are approximately between 7 and 21 years old." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Exploring Price by brand

" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "volkswagen 10687\n", "opel 5461\n", "bmw 5429\n", "mercedes_benz 4734\n", "audi 4283\n", "ford 3479\n", "renault 2404\n", "peugeot 1456\n", "fiat 1308\n", "seat 941\n", "skoda 786\n", "mazda 757\n", "nissan 754\n", "smart 701\n", "citroen 701\n", "toyota 617\n", "sonstige_autos 546\n", "hyundai 488\n", "volvo 457\n", "mini 424\n", "mitsubishi 406\n", "honda 399\n", "kia 356\n", "alfa_romeo 329\n", "porsche 294\n", "suzuki 293\n", "chevrolet 283\n", "chrysler 181\n", "dacia 129\n", "daihatsu 128\n", "jeep 110\n", "subaru 109\n", "land_rover 99\n", "saab 80\n", "daewoo 79\n", "trabant 78\n", "jaguar 77\n", "rover 69\n", "lancia 57\n", "lada 31\n", "Name: brand, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "volkswagen is the most used brand of the vehicle which is responsible for over 20% of the population of the brands of the vehicle\n", "followed by opel, bmw, mercedes_benz, audi, ford, renault, peugeot and fiat " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].nunique()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "volkswagen 21.4%\n", "opel 10.9%\n", "bmw 10.9%\n", "mercedes_benz 9.5%\n", "audi 8.6%\n", "ford 7.0%\n", "renault 4.8%\n", "peugeot 2.9%\n", "fiat 2.6%\n", "seat 1.9%\n", "skoda 1.6%\n", "mazda 1.5%\n", "nissan 1.5%\n", "smart 1.4%\n", "citroen 1.4%\n", "toyota 1.2%\n", "sonstige_autos 1.1%\n", "hyundai 1.0%\n", "volvo 0.9%\n", "mini 0.8%\n", "mitsubishi 0.8%\n", "honda 0.8%\n", "kia 0.7%\n", "alfa_romeo 0.7%\n", "porsche 0.6%\n", "suzuki 0.6%\n", "chevrolet 0.6%\n", "chrysler 0.4%\n", "dacia 0.3%\n", "daihatsu 0.3%\n", "jeep 0.2%\n", "subaru 0.2%\n", "land_rover 0.2%\n", "saab 0.2%\n", "daewoo 0.2%\n", "trabant 0.2%\n", "jaguar 0.2%\n", "rover 0.1%\n", "lancia 0.1%\n", "lada 0.1%\n", "Name: brand, dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When working with data on cars, it's good to explore variations across different car brands. Aggregation can be used to understand the brand column.\n", "\n" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',\n", " 'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',\n", " 'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'],\n", " dtype='object')" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#selecting the top 20 brands \n", "auto_brands_20 = autos['brand'].value_counts().head(20).index\n", "auto_brands_20" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "brand_mean_prices = {}\n", "for brand in auto_brands_20:\n", " mean_price = round(autos.loc[autos['brand'] == brand, 'price'].mean())\n", " # Creating a new key_value pair with this mean price\n", " brand_mean_prices[brand] = mean_price" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'volkswagen': 4945.0,\n", " 'opel': 2834.0,\n", " 'bmw': 7142.0,\n", " 'mercedes_benz': 7209.0,\n", " 'audi': 7640.0,\n", " 'ford': 3421.0,\n", " 'renault': 2314.0,\n", " 'peugeot': 3011.0,\n", " 'fiat': 2698.0,\n", " 'seat': 4188.0,\n", " 'skoda': 6271.0,\n", " 'mazda': 3773.0,\n", " 'nissan': 4511.0,\n", " 'smart': 3483.0,\n", " 'citroen': 3645.0,\n", " 'toyota': 4984.0,\n", " 'sonstige_autos': 6692.0,\n", " 'hyundai': 5317.0,\n", " 'volvo': 4686.0,\n", " 'mini': 10281.0}" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brand_mean_prices" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'mini': 10281.0,\n", " 'audi': 7640.0,\n", " 'mercedes_benz': 7209.0,\n", " 'bmw': 7142.0,\n", " 'sonstige_autos': 6692.0,\n", " 'skoda': 6271.0,\n", " 'hyundai': 5317.0,\n", " 'toyota': 4984.0,\n", " 'volkswagen': 4945.0,\n", " 'volvo': 4686.0,\n", " 'nissan': 4511.0,\n", " 'seat': 4188.0,\n", " 'mazda': 3773.0,\n", " 'citroen': 3645.0,\n", " 'smart': 3483.0,\n", " 'ford': 3421.0,\n", " 'peugeot': 3011.0,\n", " 'opel': 2834.0,\n", " 'fiat': 2698.0,\n", " 'renault': 2314.0}" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 5 - Sorting the values in the brand dictionary according to the highest mean price\n", "brand_mean_prices = dict(sorted(brand_mean_prices.items(), key = lambda data: data[1], reverse = True ))\n", "brand_mean_prices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.\n", "\n", "

Audi, BMW and Mercedes Benz are more expensive

\n", "

Ford and Opel are less expensive

\n", "

Volkswagen is in between 4945.0

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the exploration and brief statistical aggregation conducted, it can be seen that the brand mini has the highest mean selling price among the top 20 German car brands in our dataset, followed by audi, sostige , mercedes and bmw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Discovering more Insights through Aggregation

\n", "For the top 20 brands, we would like to use yet another aggregation process to understand the average mileage for those cars and if there's any visible link with its mean price.\n", "\n", "Using these two aggregated series objects we can then combine then into a single subset dataframe (with a shared index)." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "brand_mean_mileage = {}\n", "for brand in auto_brands_20:\n", " mean_mileage = round(autos.loc[autos['brand'] == brand, 'odometer_km'].mean())\n", " # Creating a new key_value pair with this mean price\n", " brand_mean_mileage[brand] = mean_mileage" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'volkswagen': 131187.0,\n", " 'opel': 132091.0,\n", " 'bmw': 134817.0,\n", " 'mercedes_benz': 132721.0,\n", " 'audi': 130967.0,\n", " 'ford': 126436.0,\n", " 'renault': 130734.0,\n", " 'peugeot': 128540.0,\n", " 'fiat': 120208.0,\n", " 'seat': 123958.0,\n", " 'skoda': 113009.0,\n", " 'mazda': 127067.0,\n", " 'nissan': 121135.0,\n", " 'smart': 101444.0,\n", " 'citroen': 121257.0,\n", " 'toyota': 117262.0,\n", " 'sonstige_autos': 101298.0,\n", " 'hyundai': 107836.0,\n", " 'volvo': 139812.0,\n", " 'mini': 91205.0}" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brand_mean_mileage" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'volvo': 139812.0,\n", " 'bmw': 134817.0,\n", " 'mercedes_benz': 132721.0,\n", " 'opel': 132091.0,\n", " 'volkswagen': 131187.0,\n", " 'audi': 130967.0,\n", " 'renault': 130734.0,\n", " 'peugeot': 128540.0,\n", " 'mazda': 127067.0,\n", " 'ford': 126436.0,\n", " 'seat': 123958.0,\n", " 'citroen': 121257.0,\n", " 'nissan': 121135.0,\n", " 'fiat': 120208.0,\n", " 'toyota': 117262.0,\n", " 'skoda': 113009.0,\n", " 'hyundai': 107836.0,\n", " 'smart': 101444.0,\n", " 'sonstige_autos': 101298.0,\n", " 'mini': 91205.0}" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 5 - Sorting the values in the brand dictionary according to the highest mean mileage\n", "brand_mean_mileage = dict(sorted(brand_mean_mileage.items(), key = lambda data: data[1], reverse = True ))\n", "brand_mean_mileage" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [], "source": [ "#Converting both dictionaries to series objects, using the series constructor.\n", "mean_prices_top20 = pd.Series(brand_mean_prices)\n", "mean_milieage_top20 = pd.Series(brand_mean_mileage)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_price
mini10281.0
audi7640.0
mercedes_benz7209.0
bmw7142.0
sonstige_autos6692.0
skoda6271.0
hyundai5317.0
toyota4984.0
volkswagen4945.0
volvo4686.0
nissan4511.0
seat4188.0
mazda3773.0
citroen3645.0
smart3483.0
ford3421.0
peugeot3011.0
opel2834.0
fiat2698.0
renault2314.0
\n", "
" ], "text/plain": [ " mean_price\n", "mini 10281.0\n", "audi 7640.0\n", "mercedes_benz 7209.0\n", "bmw 7142.0\n", "sonstige_autos 6692.0\n", "skoda 6271.0\n", "hyundai 5317.0\n", "toyota 4984.0\n", "volkswagen 4945.0\n", "volvo 4686.0\n", "nissan 4511.0\n", "seat 4188.0\n", "mazda 3773.0\n", "citroen 3645.0\n", "smart 3483.0\n", "ford 3421.0\n", "peugeot 3011.0\n", "opel 2834.0\n", "fiat 2698.0\n", "renault 2314.0" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Step 4 - Creating a dataframe using the mean_prices series\n", "top_20_brands = pd.DataFrame(mean_prices_top20, columns=['mean_price'])\n", "top_20_brands" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "# Appending the mean_mileage data in the other series as a column of this new dataframe to see if there is any possible link \n", "top_20_brands['mean_milieage'] = mean_milieage_top20" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_pricemean_milieage
volvo4686.0139812.0
bmw7142.0134817.0
mercedes_benz7209.0132721.0
opel2834.0132091.0
volkswagen4945.0131187.0
audi7640.0130967.0
renault2314.0130734.0
peugeot3011.0128540.0
mazda3773.0127067.0
ford3421.0126436.0
seat4188.0123958.0
citroen3645.0121257.0
nissan4511.0121135.0
fiat2698.0120208.0
toyota4984.0117262.0
skoda6271.0113009.0
hyundai5317.0107836.0
smart3483.0101444.0
sonstige_autos6692.0101298.0
mini10281.091205.0
\n", "
" ], "text/plain": [ " mean_price mean_milieage\n", "volvo 4686.0 139812.0\n", "bmw 7142.0 134817.0\n", "mercedes_benz 7209.0 132721.0\n", "opel 2834.0 132091.0\n", "volkswagen 4945.0 131187.0\n", "audi 7640.0 130967.0\n", "renault 2314.0 130734.0\n", "peugeot 3011.0 128540.0\n", "mazda 3773.0 127067.0\n", "ford 3421.0 126436.0\n", "seat 4188.0 123958.0\n", "citroen 3645.0 121257.0\n", "nissan 4511.0 121135.0\n", "fiat 2698.0 120208.0\n", "toyota 4984.0 117262.0\n", "skoda 6271.0 113009.0\n", "hyundai 5317.0 107836.0\n", "smart 3483.0 101444.0\n", "sonstige_autos 6692.0 101298.0\n", "mini 10281.0 91205.0" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_20_brands.sort_values('mean_milieage', ascending = False )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both mean price and mean mileage can be seen and from the table we can see that brands with higer mean mileage have low mean price, therefore It is impossible to conclude whether higher mileage is affecting the price. This is due to the fact that within a brand there are a lot of other variables affecting price (such as car type, engine type, registration year etc). In order to confirm whether mileage affects the price a slice of the dataset is necessary where all those variables are kept the same as much as possible." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecrawlednamepriceabtestvehicletyperegistration_yeargearboxpowerpsmodelodometer_kmregistration_monthfueltypebrandunrepaired_damagead_creatednrofpicturespostalcodelastseen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARD5000.0controlbus2004manuell158andere150000.03lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik8500.0controllimousine1997automatik2867er150000.06benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_United8990.0testlimousine2009manuell102golf70000.07benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...4350.0controlkleinwagen2007automatik71fortwo70000.06benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...1350.0testkombi2003manuell0focus150000.07benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
52016-03-21 13:47:45Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...7900.0testbus2006automatik150voyager150000.04dieselchryslerNaN2016-03-21 00:00:000229622016-04-06 09:45:21
62016-03-20 17:55:21VW_Golf_III_GT_Special_Electronic_Green_Metall...300.0testlimousine1995manuell90golf150000.08benzinvolkswagenNaN2016-03-20 00:00:000315352016-03-23 02:48:59
72016-03-16 18:55:19Golf_IV_1.9_TDI_90PS1990.0controllimousine1998manuell90golf150000.012dieselvolkswagennein2016-03-16 00:00:000534742016-04-07 03:17:32
82016-03-22 16:51:34Seat_Arosa250.0testNaN2000manuell0arosa150000.010NaNseatnein2016-03-22 00:00:00074262016-03-26 18:18:10
92016-03-16 13:47:02Renault_Megane_Scenic_1.6e_RT_Klimaanlage590.0controlbus1997manuell90megane150000.07benzinrenaultnein2016-03-16 00:00:000157492016-04-06 10:46:35
\n", "
" ], "text/plain": [ " datecrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... \n", "6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... \n", "7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS \n", "8 2016-03-22 16:51:34 Seat_Arosa \n", "9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage \n", "\n", " price abtest vehicletype registration_year gearbox powerps \\\n", "0 5000.0 control bus 2004 manuell 158 \n", "1 8500.0 control limousine 1997 automatik 286 \n", "2 8990.0 test limousine 2009 manuell 102 \n", "3 4350.0 control kleinwagen 2007 automatik 71 \n", "4 1350.0 test kombi 2003 manuell 0 \n", "5 7900.0 test bus 2006 automatik 150 \n", "6 300.0 test limousine 1995 manuell 90 \n", "7 1990.0 control limousine 1998 manuell 90 \n", "8 250.0 test NaN 2000 manuell 0 \n", "9 590.0 control bus 1997 manuell 90 \n", "\n", " model odometer_km registration_month fueltype brand \\\n", "0 andere 150000.0 3 lpg peugeot \n", "1 7er 150000.0 6 benzin bmw \n", "2 golf 70000.0 7 benzin volkswagen \n", "3 fortwo 70000.0 6 benzin smart \n", "4 focus 150000.0 7 benzin ford \n", "5 voyager 150000.0 4 diesel chrysler \n", "6 golf 150000.0 8 benzin volkswagen \n", "7 golf 150000.0 12 diesel volkswagen \n", "8 arosa 150000.0 10 NaN seat \n", "9 megane 150000.0 7 benzin renault \n", "\n", " unrepaired_damage ad_created nrofpictures postalcode \\\n", "0 nein 2016-03-26 00:00:00 0 79588 \n", "1 nein 2016-04-04 00:00:00 0 71034 \n", "2 nein 2016-03-26 00:00:00 0 35394 \n", "3 nein 2016-03-12 00:00:00 0 33729 \n", "4 nein 2016-04-01 00:00:00 0 39218 \n", "5 NaN 2016-03-21 00:00:00 0 22962 \n", "6 NaN 2016-03-20 00:00:00 0 31535 \n", "7 nein 2016-03-16 00:00:00 0 53474 \n", "8 nein 2016-03-22 00:00:00 0 7426 \n", "9 nein 2016-03-16 00:00:00 0 15749 \n", "\n", " lastseen \n", "0 2016-04-06 06:45:54 \n", "1 2016-04-06 14:45:08 \n", "2 2016-04-06 20:15:37 \n", "3 2016-03-15 03:16:28 \n", "4 2016-04-01 14:38:50 \n", "5 2016-04-06 09:45:21 \n", "6 2016-03-23 02:48:59 \n", "7 2016-04-07 03:17:32 \n", "8 2016-03-26 18:18:10 \n", "9 2016-04-06 10:46:35 " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "vehicletype, gearbox and fueltype are three columns which have values in german." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',\n", " 'cabrio', 'andere'], dtype=object)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"vehicletype\"].unique()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['manuell', 'automatik', nan], dtype=object)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"gearbox\"].unique()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',\n", " 'andere'], dtype=object)" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"fueltype\"].unique()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "#Identify categorical data that uses german words, translate them and map the values to their english counterparts\n", "translate_map = { 'kleinwagen' : \"small car\", \n", " 'kombi' : \"estate car\" ,\n", " 'cabrio' : \"convertible\", \n", " 'andere' : \"other\", \n", " 'elektro' : \"electric\",\n", " 'benzin' : 'petrol',\n", " 'manuell' : \"manual\", \n", " 'automatik' : \"automatic\",\n", " \"bus\" : \"bus\",\n", " \"limousine\" : \"limousine\",\n", " \"coupe\":\"coupe\",\n", " \"suv\" : \"suv\",\n", " \"lpg\" : \"lpg\",\n", " \"diesel\" : \"diesel\",\n", " \"cng\" : \"cng\",\n", " \"hybrid\": \"hybrid\",\n", " \"Unknown\" : \"Unknown\",\n", " \"nein\" : \"no\",\n", " \"ja\" : \"yes\", }\n", "\n", "\n", "categorical = [\"fueltype\", \"gearbox\", \"vehicletype\", \n", " \"unrepaired_damage\"]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "for word in categorical:\n", " autos[word] = autos[word].map(translate_map)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['manual', 'automatic', 'Unknown'], dtype=object)" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#replace nan values with \"Unknown\"\n", "for c in categorical:\n", " autos.loc[ autos[c].isnull() , c ] = \"Unknown\"\n", "\n", "autos[\"gearbox\"].unique()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['bus', 'limousine', 'small car', 'estate car', 'Unknown', 'coupe',\n", " 'suv', 'convertible', 'other'], dtype=object)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"vehicletype\"].unique()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['lpg', 'petrol', 'diesel', 'Unknown', 'cng', 'hybrid', 'electric',\n", " 'other'], dtype=object)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"fueltype\"].unique()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['no', 'Unknown', 'yes'], dtype=object)" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"unrepaired_damage\"].unique()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "#Data cleaning of the dates, Convert the dates to be uniform numeric data, so \"2016-03-21\" becomes the integer 20160321\n", "\n", "date_cols = [\"datecrawled\", \"ad_created\", \"lastseen\"]\n", "\n", "for c in date_cols:\n", " temp = autos[c].str[:10].str.replace(\"-\",\"\").astype(int)\n", " autos[c] = temp" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecrawlednamepriceabtestvehicletyperegistration_yeargearboxpowerpsmodelodometer_kmregistration_monthfueltypebrandunrepaired_damagead_creatednrofpicturespostalcodelastseen
020160326Peugeot_807_160_NAVTECH_ON_BOARD5000.0controlbus2004manual158andere150000.03lpgpeugeotno2016032607958820160406
120160404BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik8500.0controllimousine1997automatic2867er150000.06petrolbmwno2016040407103420160406
220160326Volkswagen_Golf_1.6_United8990.0testlimousine2009manual102golf70000.07petrolvolkswagenno2016032603539420160406
320160312Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...4350.0controlsmall car2007automatic71fortwo70000.06petrolsmartno2016031203372920160315
420160401Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...1350.0testestate car2003manual0focus150000.07petrolfordno2016040103921820160401
520160321Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...7900.0testbus2006automatic150voyager150000.04dieselchryslerUnknown2016032102296220160406
620160320VW_Golf_III_GT_Special_Electronic_Green_Metall...300.0testlimousine1995manual90golf150000.08petrolvolkswagenUnknown2016032003153520160323
720160316Golf_IV_1.9_TDI_90PS1990.0controllimousine1998manual90golf150000.012dieselvolkswagenno2016031605347420160407
820160322Seat_Arosa250.0testUnknown2000manual0arosa150000.010Unknownseatno201603220742620160326
920160316Renault_Megane_Scenic_1.6e_RT_Klimaanlage590.0controlbus1997manual90megane150000.07petrolrenaultno2016031601574920160406
1020160315VW_Golf_Tuning_in_siber/grau999.0testUnknown2017manual90NaN150000.04petrolvolkswagenno2016031408615720160407
1120160316Mercedes_A140_Motorschaden350.0controlUnknown2000Unknown0NaN150000.00petrolmercedes_benzUnknown2016031601749820160316
1220160331Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...5299.0controlsmall car2010automatic71fortwo50000.09petrolsmartno2016033103459020160406
1320160323Audi_A3_1.6_tuning1350.0controllimousine1999manual101a3150000.011petrolaudino2016032301204320160401
1420160323Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...3999.0testsmall car2007manual75clio150000.09petrolrenaultUnknown2016032308173720160401
1520160401Corvette_C3_Coupe_T_Top_Crossfire_Injection18900.0testcoupe1982automatic203NaN80000.06petrolsonstige_autosno2016040106127620160402
1620160316Opel_Vectra_B_Kombi350.0testestate car1999manual101vectra150000.05petrolopelno2016031605729920160318
1720160329Volkswagen_Scirocco_2_G605500.0testcoupe1990manual205scirocco150000.06petrolvolkswagenno2016032907482120160405
1820160326Verkaufen_mein_bmw_e36_320_i_touring300.0controlbus1995manual1503er150000.00petrolbmwUnknown2016032605432920160402
1920160317mazda_tribute_2.0_mit_gas_und_tuev_neu_20184150.0controlsuv2004manual124andere150000.02lpgmazdano2016031704087820160317
\n", "
" ], "text/plain": [ " datecrawled name price \\\n", "0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD 5000.0 \n", "1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500.0 \n", "2 20160326 Volkswagen_Golf_1.6_United 8990.0 \n", "3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350.0 \n", "4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350.0 \n", "5 20160321 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 7900.0 \n", "6 20160320 VW_Golf_III_GT_Special_Electronic_Green_Metall... 300.0 \n", "7 20160316 Golf_IV_1.9_TDI_90PS 1990.0 \n", "8 20160322 Seat_Arosa 250.0 \n", "9 20160316 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 590.0 \n", "10 20160315 VW_Golf_Tuning_in_siber/grau 999.0 \n", "11 20160316 Mercedes_A140_Motorschaden 350.0 \n", "12 20160331 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 5299.0 \n", "13 20160323 Audi_A3_1.6_tuning 1350.0 \n", "14 20160323 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 3999.0 \n", "15 20160401 Corvette_C3_Coupe_T_Top_Crossfire_Injection 18900.0 \n", "16 20160316 Opel_Vectra_B_Kombi 350.0 \n", "17 20160329 Volkswagen_Scirocco_2_G60 5500.0 \n", "18 20160326 Verkaufen_mein_bmw_e36_320_i_touring 300.0 \n", "19 20160317 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 4150.0 \n", "\n", " abtest vehicletype registration_year gearbox powerps model \\\n", "0 control bus 2004 manual 158 andere \n", "1 control limousine 1997 automatic 286 7er \n", "2 test limousine 2009 manual 102 golf \n", "3 control small car 2007 automatic 71 fortwo \n", "4 test estate car 2003 manual 0 focus \n", "5 test bus 2006 automatic 150 voyager \n", "6 test limousine 1995 manual 90 golf \n", "7 control limousine 1998 manual 90 golf \n", "8 test Unknown 2000 manual 0 arosa \n", "9 control bus 1997 manual 90 megane \n", "10 test Unknown 2017 manual 90 NaN \n", "11 control Unknown 2000 Unknown 0 NaN \n", "12 control small car 2010 automatic 71 fortwo \n", "13 control limousine 1999 manual 101 a3 \n", "14 test small car 2007 manual 75 clio \n", "15 test coupe 1982 automatic 203 NaN \n", "16 test estate car 1999 manual 101 vectra \n", "17 test coupe 1990 manual 205 scirocco \n", "18 control bus 1995 manual 150 3er \n", "19 control suv 2004 manual 124 andere \n", "\n", " odometer_km registration_month fueltype brand \\\n", "0 150000.0 3 lpg peugeot \n", "1 150000.0 6 petrol bmw \n", "2 70000.0 7 petrol volkswagen \n", "3 70000.0 6 petrol smart \n", "4 150000.0 7 petrol ford \n", "5 150000.0 4 diesel chrysler \n", "6 150000.0 8 petrol volkswagen \n", "7 150000.0 12 diesel volkswagen \n", "8 150000.0 10 Unknown seat \n", "9 150000.0 7 petrol renault \n", "10 150000.0 4 petrol volkswagen \n", "11 150000.0 0 petrol mercedes_benz \n", "12 50000.0 9 petrol smart \n", "13 150000.0 11 petrol audi \n", "14 150000.0 9 petrol renault \n", "15 80000.0 6 petrol sonstige_autos \n", "16 150000.0 5 petrol opel \n", "17 150000.0 6 petrol volkswagen \n", "18 150000.0 0 petrol bmw \n", "19 150000.0 2 lpg mazda \n", "\n", " unrepaired_damage ad_created nrofpictures postalcode lastseen \n", "0 no 20160326 0 79588 20160406 \n", "1 no 20160404 0 71034 20160406 \n", "2 no 20160326 0 35394 20160406 \n", "3 no 20160312 0 33729 20160315 \n", "4 no 20160401 0 39218 20160401 \n", "5 Unknown 20160321 0 22962 20160406 \n", "6 Unknown 20160320 0 31535 20160323 \n", "7 no 20160316 0 53474 20160407 \n", "8 no 20160322 0 7426 20160326 \n", "9 no 20160316 0 15749 20160406 \n", "10 no 20160314 0 86157 20160407 \n", "11 Unknown 20160316 0 17498 20160316 \n", "12 no 20160331 0 34590 20160406 \n", "13 no 20160323 0 12043 20160401 \n", "14 Unknown 20160323 0 81737 20160401 \n", "15 no 20160401 0 61276 20160402 \n", "16 no 20160316 0 57299 20160318 \n", "17 no 20160329 0 74821 20160405 \n", "18 Unknown 20160326 0 54329 20160402 \n", "19 no 20160317 0 40878 20160317 " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "finding the most common brand/model combinations; this can be done using the aggregation method" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',\n", " 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',\n", " 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',\n", " 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',\n", " 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',\n", " 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',\n", " 'rover', 'daihatsu', 'lancia'], dtype=object)" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unique_brands = autos['brand'].unique()\n", "\n", "unique_brands" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'peugeot': 'andere',\n", " 'bmw': 'z_reihe',\n", " 'volkswagen': 'up',\n", " 'smart': 'roadster',\n", " 'ford': 'transit',\n", " 'chrysler': 'voyager',\n", " 'seat': 'toledo',\n", " 'renault': 'twingo',\n", " 'mercedes_benz': 'vito',\n", " 'audi': 'tt',\n", " 'sonstige_autos': nan,\n", " 'opel': 'zafira',\n", " 'mazda': 'rx_reihe',\n", " 'porsche': 'cayenne',\n", " 'mini': 'one',\n", " 'toyota': 'yaris',\n", " 'dacia': 'sandero',\n", " 'nissan': 'x_trail',\n", " 'jeep': 'wrangler',\n", " 'saab': 'andere',\n", " 'volvo': 'xc_reihe',\n", " 'mitsubishi': 'pajero',\n", " 'jaguar': 'x_type',\n", " 'fiat': 'stilo',\n", " 'skoda': 'yeti',\n", " 'subaru': 'legacy',\n", " 'kia': 'sportage',\n", " 'citroen': 'c5',\n", " 'chevrolet': 'spark',\n", " 'hyundai': 'tucson',\n", " 'honda': 'jazz',\n", " 'daewoo': 'nubira',\n", " 'suzuki': 'swift',\n", " 'trabant': 'andere',\n", " 'land_rover': 'range_rover_sport',\n", " 'alfa_romeo': 'spider',\n", " 'lada': 'samara',\n", " 'rover': 'rangerover',\n", " 'daihatsu': 'terios',\n", " 'lancia': 'ypsilon'}" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict_model_brand = {}\n", "\n", "for ub in unique_brands:\n", " brands_tem = autos.loc[ autos[\"brand\"] == ub , \"model\" ].value_counts().index.max()\n", " dict_model_brand[ub] = brands_tem\n", " \n", "dict_model_brand" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "using aggregation to see if average prices follows any patterns based on the mileage." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[10000.0,\n", " 20000.0,\n", " 30000.0,\n", " 40000.0,\n", " 50000.0,\n", " 60000.0,\n", " 70000.0,\n", " 80000.0,\n", " 90000.0,\n", " 100000.0,\n", " 125000.0,\n", " 150000.0,\n", " nan]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "groups_odometer = autos[\"odometer_km\"].unique().tolist()\n", "\n", "groups_odometer = sorted(groups_odometer)\n", "\n", "groups_odometer" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{10000.0: 12652.79,\n", " 20000.0: 12008.03,\n", " 30000.0: 12416.99,\n", " 40000.0: 12365.93,\n", " 50000.0: 11109.58,\n", " 60000.0: 10098.24,\n", " 70000.0: 9450.56,\n", " 80000.0: 8523.17,\n", " 90000.0: 7448.37,\n", " 100000.0: 6919.56,\n", " 125000.0: 5637.48,\n", " 150000.0: 3519.07,\n", " nan: nan}" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dict_avg_price_mil = {}\n", "\n", "for kilom in groups_odometer:\n", " temp = autos.loc[autos[\"odometer_km\"] == kilom,\n", " \"price\" ].mean()\n", " temp = round(temp,2)\n", " dict_avg_price_mil[kilom] = temp\n", " \n", "dict_avg_price_mil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The average car price is decreasing with increase in mileage. The reason for this could be cars with damage but further analysis is needed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking how cheaper are cars with damage than their non-damaged counterparts" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['no', 'Unknown', 'yes'], dtype=object)" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['unrepaired_damage'].unique()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "no 35232\n", "Unknown 9829\n", "yes 4939\n", "Name: unrepaired_damage, dtype: int64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['unrepaired_damage'].value_counts()" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "damage_not_damaged = ['no', 'Unknown', 'yes']" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'no': 6091.86273770872,\n", " 'Unknown': 2768.6617932726717,\n", " 'yes': 2043.6258876039765}" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "damage_not_damaged_counterparts = {}\n", "\n", "for up in damage_not_damaged:\n", " val = autos.loc[autos[\"unrepaired_damage\"] == up, \"price\" ].mean()\n", " damage_not_damaged_counterparts[up] = val\n", " \n", "damage_not_damaged_counterparts" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4048.236850104743" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diff_in_price = damage_not_damaged_counterparts[\"no\"] - damage_not_damaged_counterparts[\"yes\"]\n", "\n", "diff_in_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a difference of about 4000 dollars in average price of damaged and non damaged cars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The damaged price is more expensive than the non-damaged car" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 2 }