{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Exploring Ebay Car Sales Data\n", "\n", "we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The aim of this project is to clean the data and analyze the included used car listings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#we import the NumPy and Pandas library and read the data set\n", "import numpy as np\n", "import pandas as pd\n", "\n", "autos = pd.read_csv(\"autos.csv\", encoding = \"Windows-1252\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearboxpowerPSmodelodometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
52016-03-21 13:47:45Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...privatAngebot$7,900testbus2006automatik150voyager150,000km4dieselchryslerNaN2016-03-21 00:00:000229622016-04-06 09:45:21
62016-03-20 17:55:21VW_Golf_III_GT_Special_Electronic_Green_Metall...privatAngebot$300testlimousine1995manuell90golf150,000km8benzinvolkswagenNaN2016-03-20 00:00:000315352016-03-23 02:48:59
72016-03-16 18:55:19Golf_IV_1.9_TDI_90PSprivatAngebot$1,990controllimousine1998manuell90golf150,000km12dieselvolkswagennein2016-03-16 00:00:000534742016-04-07 03:17:32
82016-03-22 16:51:34Seat_ArosaprivatAngebot$250testNaN2000manuell0arosa150,000km10NaNseatnein2016-03-22 00:00:00074262016-03-26 18:18:10
92016-03-16 13:47:02Renault_Megane_Scenic_1.6e_RT_KlimaanlageprivatAngebot$590controlbus1997manuell90megane150,000km7benzinrenaultnein2016-03-16 00:00:000157492016-04-06 10:46:35
102016-03-15 01:41:36VW_Golf_Tuning_in_siber/grauprivatAngebot$999testNaN2017manuell90NaN150,000km4benzinvolkswagennein2016-03-14 00:00:000861572016-04-07 03:16:21
112016-03-16 18:45:34Mercedes_A140_MotorschadenprivatAngebot$350controlNaN2000NaN0NaN150,000km0benzinmercedes_benzNaN2016-03-16 00:00:000174982016-03-16 18:45:34
122016-03-31 19:48:22Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...privatAngebot$5,299controlkleinwagen2010automatik71fortwo50,000km9benzinsmartnein2016-03-31 00:00:000345902016-04-06 14:17:52
132016-03-23 10:48:32Audi_A3_1.6_tuningprivatAngebot$1,350controllimousine1999manuell101a3150,000km11benzinaudinein2016-03-23 00:00:000120432016-04-01 14:17:13
142016-03-23 11:50:46Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...privatAngebot$3,999testkleinwagen2007manuell75clio150,000km9benzinrenaultNaN2016-03-23 00:00:000817372016-04-01 15:46:47
152016-04-01 12:06:20Corvette_C3_Coupe_T_Top_Crossfire_InjectionprivatAngebot$18,900testcoupe1982automatik203NaN80,000km6benzinsonstige_autosnein2016-04-01 00:00:000612762016-04-02 21:10:48
162016-03-16 14:59:02Opel_Vectra_B_KombiprivatAngebot$350testkombi1999manuell101vectra150,000km5benzinopelnein2016-03-16 00:00:000572992016-03-18 05:29:37
172016-03-29 11:46:22Volkswagen_Scirocco_2_G60privatAngebot$5,500testcoupe1990manuell205scirocco150,000km6benzinvolkswagennein2016-03-29 00:00:000748212016-04-05 20:46:26
182016-03-26 19:57:44Verkaufen_mein_bmw_e36_320_i_touringprivatAngebot$300controlbus1995manuell1503er150,000km0benzinbmwNaN2016-03-26 00:00:000543292016-04-02 12:16:41
192016-03-17 13:36:21mazda_tribute_2.0_mit_gas_und_tuev_neu_2018privatAngebot$4,150controlsuv2004manuell124andere150,000km2lpgmazdanein2016-03-17 00:00:000408782016-03-17 14:45:58
202016-03-05 19:57:31Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*...privatAngebot$3,500testkombi2003manuell131a4150,000km5dieselaudiNaN2016-03-05 00:00:000539132016-03-07 05:46:46
212016-03-06 19:07:10Porsche_911_Carrera_4S_CabrioprivatAngebot$41,500testcabrio2004manuell320911150,000km4benzinporschenein2016-03-06 00:00:000654282016-04-05 23:46:19
222016-03-28 20:50:54MINI_Cooper_S_CabrioprivatAngebot$25,450controlcabrio2015manuell184cooper10,000km1benzinmininein2016-03-28 00:00:000447892016-04-01 06:45:30
232016-03-10 19:55:34Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_KlimaprivatAngebot$7,999controlbus2010manuell120NaN150,000km2dieselpeugeotnein2016-03-10 00:00:000309002016-03-17 08:45:17
242016-04-03 11:57:02BMW_535i_xDrive_Sport_Aut.privatAngebot$48,500controllimousine2014automatik3065er30,000km12benzinbmwnein2016-04-03 00:00:000225472016-04-07 13:16:50
252016-03-21 21:56:18Ford_escort_kombi_an_bastler_mit_ghia_ausstattungprivatAngebot$90controlkombi1996manuell116NaN150,000km4benzinfordja2016-03-21 00:00:000275742016-04-01 05:16:49
262016-04-03 22:46:28Volkswagen_Polo_FoxprivatAngebot$777controlkleinwagen1992manuell54polo125,000km2benzinvolkswagennein2016-04-03 00:00:000381102016-04-05 23:46:48
272016-03-27 18:45:01Hat_einer_Ahnung_mit_Ford_Galaxy_HILFEprivatAngebot$0controlNaN2005NaN0NaN150,000km0NaNfordNaN2016-03-27 00:00:000667012016-03-27 18:45:01
282016-03-19 21:56:19MINI_Cooper_DprivatAngebot$5,250controlkleinwagen2007manuell110cooper150,000km7dieselminija2016-03-19 00:00:000157452016-04-07 14:58:48
292016-04-02 12:45:44Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...privatAngebot$4,999testkombi2004automatik204e_klasse150,000km10dieselmercedes_benznein2016-04-02 00:00:000476382016-04-02 12:45:44
...............................................................
499702016-03-21 22:47:37c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp...privatAngebot$15,800controlbus2010automatik136c460,000km4dieselcitroennein2016-03-21 00:00:000149472016-04-07 04:17:34
499712016-03-29 14:54:12W.Lupo_1.0privatAngebot$950testkleinwagen2001manuell50lupo150,000km4benzinvolkswagennein2016-03-29 00:00:000651972016-03-29 20:41:51
499722016-03-26 22:25:23Mercedes_Benz_Vito_115_CDI_Extralang_Aut.privatAngebot$3,300controlbus2004automatik150vito150,000km10dieselmercedes_benzja2016-03-26 00:00:000653262016-03-28 11:28:18
499732016-03-27 05:32:39Mercedes_Benz_SLK_200_KompressorprivatAngebot$6,000controlcabrio2004manuell163slk150,000km11benzinmercedes_benznein2016-03-27 00:00:000535672016-03-27 08:25:24
499742016-03-20 10:52:31Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing...privatAngebot$0controlcabrio1983manuell70golf150,000km2benzinvolkswagennein2016-03-20 00:00:00082092016-03-27 19:48:16
499752016-03-27 20:51:39Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_ComfortprivatAngebot$9,700controlkleinwagen2012automatik88jazz100,000km11hybridhondanein2016-03-27 00:00:000843852016-04-05 19:45:34
499762016-03-19 18:56:05Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga...privatAngebot$5,900testkombi1992automatik15080150,000km12benzinaudinein2016-03-19 00:00:000361002016-04-07 06:16:44
499772016-03-31 18:37:18Mercedes_Benz_C200_Cdi_W203privatAngebot$5,500controllimousine2003manuell116c_klasse150,000km2dieselmercedes_benznein2016-03-31 00:00:000337392016-04-06 12:16:11
499782016-04-04 10:37:14Mercedes_Benz_E_200_ClassicprivatAngebot$900controllimousine1996automatik136e_klasse150,000km9benzinmercedes_benzja2016-04-04 00:00:000244052016-04-06 12:44:20
499792016-03-20 18:38:40Volkswagen_Polo_1.6_TDI_StyleprivatAngebot$11,000testkleinwagen2011manuell90polo70,000km11dieselvolkswagennein2016-03-20 00:00:000484552016-04-07 01:45:12
499802016-03-12 10:55:54Ford_Escort_Turnier_16VprivatAngebot$400controlkombi1995manuell105escort125,000km3benzinfordNaN2016-03-12 00:00:000562182016-04-06 17:16:49
499812016-03-15 09:38:21Opel_Astra_Kombi_mit_AnhaengerkupplungprivatAngebot$2,000controlkombi1998manuell115astra150,000km12benzinopelnein2016-03-15 00:00:000868592016-04-05 17:21:46
499822016-03-29 18:51:08Skoda_Fabia_4_Tuerer_Bj:2004__85.000TkmprivatAngebot$1,950controlkleinwagen2004manuell0fabia90,000km7benzinskodaNaN2016-03-29 00:00:000458842016-03-29 18:51:08
499832016-03-06 12:43:04Ford_focus_99privatAngebot$600testkleinwagen1999manuell101focus150,000km4benzinfordNaN2016-03-06 00:00:000524772016-03-09 06:16:08
499842016-03-31 22:48:48Student_sucht_ein__Anfaengerauto___ab_2000_BJ_...privatAngebot$0testNaN2000NaN0NaN150,000km0NaNsonstige_autosNaN2016-03-31 00:00:000121032016-04-02 19:44:53
499852016-04-02 16:38:23Verkaufe_meinen_vw_vento!privatAngebot$1,000controlNaN1995automatik0NaN150,000km0benzinvolkswagenNaN2016-04-02 00:00:000309002016-04-06 15:17:52
499862016-04-04 20:46:02Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst...privatAngebot$15,900controllimousine2010automatik218300c125,000km11dieselchryslernein2016-04-04 00:00:000735272016-04-06 23:16:00
499872016-03-22 20:47:27Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__...privatAngebot$21,990controllimousine2013manuell150a350,000km11dieselaudinein2016-03-22 00:00:000943622016-03-26 22:46:06
499882016-03-28 19:49:51BMW_330_CiprivatAngebot$9,550controlcoupe2001manuell2313er150,000km10benzinbmwnein2016-03-28 00:00:000836462016-04-07 02:17:40
499892016-03-11 19:50:37VW_Polo_zum_Ausschlachten_oder_WiederaufbauprivatAngebot$150testkleinwagen1997manuell0polo150,000km5benzinvolkswagenja2016-03-11 00:00:000212442016-03-12 10:17:55
499902016-03-21 19:54:19Mercedes_Benz_A_200__BlueEFFICIENCY__UrbanprivatAngebot$17,500testlimousine2012manuell156a_klasse30,000km12benzinmercedes_benznein2016-03-21 00:00:000582392016-04-06 22:46:57
499912016-03-06 15:25:19KleinwagenprivatAngebot$500controlNaN2016manuell0twingo150,000km0benzinrenaultNaN2016-03-06 00:00:000613502016-03-06 18:24:19
499922016-03-10 19:37:38Fiat_Grande_Punto_1.4_T_Jet_16V_SportprivatAngebot$4,800controlkleinwagen2009manuell120andere125,000km9lpgfiatnein2016-03-10 00:00:000686422016-03-13 01:44:51
499932016-03-15 18:47:35Audi_A3__1_8l__Silber;_schoenes_FahrzeugprivatAngebot$1,650controlkleinwagen1997manuell0NaN150,000km7benzinaudiNaN2016-03-15 00:00:000652032016-04-06 19:46:53
499942016-03-22 17:36:42Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...privatAngebot$5,000controlkombi2001automatik299a6150,000km1benzinaudinein2016-03-22 00:00:000465372016-04-06 08:16:39
499952016-03-27 14:38:19Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__XenonprivatAngebot$24,900controllimousine2011automatik239q5100,000km1dieselaudinein2016-03-27 00:00:000821312016-04-01 13:47:40
499962016-03-28 10:50:25Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...privatAngebot$1,980controlcabrio1996manuell75astra150,000km5benzinopelnein2016-03-28 00:00:000448072016-04-02 14:18:02
499972016-04-02 14:44:48Fiat_500_C_1.2_Dualogic_LoungeprivatAngebot$13,200testcabrio2014automatik695005,000km11benzinfiatnein2016-04-02 00:00:000734302016-04-04 11:47:27
499982016-03-08 19:25:42Audi_A3_2.0_TDI_Sportback_AmbitionprivatAngebot$22,900controlkombi2013manuell150a340,000km11dieselaudinein2016-03-08 00:00:000356832016-04-05 16:45:07
499992016-03-14 00:42:12Opel_Vectra_1.6_16VprivatAngebot$1,250controllimousine1996manuell101vectra150,000km1benzinopelnein2016-03-13 00:00:000458972016-04-06 21:18:48
\n", "

50000 rows × 20 columns

\n", "
" ], "text/plain": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... \n", "6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... \n", "7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS \n", "8 2016-03-22 16:51:34 Seat_Arosa \n", "9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage \n", "10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau \n", "11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden \n", "12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... \n", "13 2016-03-23 10:48:32 Audi_A3_1.6_tuning \n", "14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... \n", "15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection \n", "16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi \n", "17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 \n", "18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring \n", "19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 \n", "20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... \n", "21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio \n", "22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio \n", "23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima \n", "24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. \n", "25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung \n", "26 2016-04-03 22:46:28 Volkswagen_Polo_Fox \n", "27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE \n", "28 2016-03-19 21:56:19 MINI_Cooper_D \n", "29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... \n", "... ... ... \n", "49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... \n", "49971 2016-03-29 14:54:12 W.Lupo_1.0 \n", "49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. \n", "49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor \n", "49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... \n", "49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort \n", "49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... \n", "49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 \n", "49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic \n", "49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style \n", "49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V \n", "49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung \n", "49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm \n", "49983 2016-03-06 12:43:04 Ford_focus_99 \n", "49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... \n", "49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! \n", "49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... \n", "49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... \n", "49988 2016-03-28 19:49:51 BMW_330_Ci \n", "49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau \n", "49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban \n", "49991 2016-03-06 15:25:19 Kleinwagen \n", "49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport \n", "49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug \n", "49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... \n", "49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon \n", "49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... \n", "49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge \n", "49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition \n", "49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "5 privat Angebot $7,900 test bus 2006 \n", "6 privat Angebot $300 test limousine 1995 \n", "7 privat Angebot $1,990 control limousine 1998 \n", "8 privat Angebot $250 test NaN 2000 \n", "9 privat Angebot $590 control bus 1997 \n", "10 privat Angebot $999 test NaN 2017 \n", "11 privat Angebot $350 control NaN 2000 \n", "12 privat Angebot $5,299 control kleinwagen 2010 \n", "13 privat Angebot $1,350 control limousine 1999 \n", "14 privat Angebot $3,999 test kleinwagen 2007 \n", "15 privat Angebot $18,900 test coupe 1982 \n", "16 privat Angebot $350 test kombi 1999 \n", "17 privat Angebot $5,500 test coupe 1990 \n", "18 privat Angebot $300 control bus 1995 \n", "19 privat Angebot $4,150 control suv 2004 \n", "20 privat Angebot $3,500 test kombi 2003 \n", "21 privat Angebot $41,500 test cabrio 2004 \n", "22 privat Angebot $25,450 control cabrio 2015 \n", "23 privat Angebot $7,999 control bus 2010 \n", "24 privat Angebot $48,500 control limousine 2014 \n", "25 privat Angebot $90 control kombi 1996 \n", "26 privat Angebot $777 control kleinwagen 1992 \n", "27 privat Angebot $0 control NaN 2005 \n", "28 privat Angebot $5,250 control kleinwagen 2007 \n", "29 privat Angebot $4,999 test kombi 2004 \n", "... ... ... ... ... ... ... \n", "49970 privat Angebot $15,800 control bus 2010 \n", "49971 privat Angebot $950 test kleinwagen 2001 \n", "49972 privat Angebot $3,300 control bus 2004 \n", "49973 privat Angebot $6,000 control cabrio 2004 \n", "49974 privat Angebot $0 control cabrio 1983 \n", "49975 privat Angebot $9,700 control kleinwagen 2012 \n", "49976 privat Angebot $5,900 test kombi 1992 \n", "49977 privat Angebot $5,500 control limousine 2003 \n", "49978 privat Angebot $900 control limousine 1996 \n", "49979 privat Angebot $11,000 test kleinwagen 2011 \n", "49980 privat Angebot $400 control kombi 1995 \n", "49981 privat Angebot $2,000 control kombi 1998 \n", "49982 privat Angebot $1,950 control kleinwagen 2004 \n", "49983 privat Angebot $600 test kleinwagen 1999 \n", "49984 privat Angebot $0 test NaN 2000 \n", "49985 privat Angebot $1,000 control NaN 1995 \n", "49986 privat Angebot $15,900 control limousine 2010 \n", "49987 privat Angebot $21,990 control limousine 2013 \n", "49988 privat Angebot $9,550 control coupe 2001 \n", "49989 privat Angebot $150 test kleinwagen 1997 \n", "49990 privat Angebot $17,500 test limousine 2012 \n", "49991 privat Angebot $500 control NaN 2016 \n", "49992 privat Angebot $4,800 control kleinwagen 2009 \n", "49993 privat Angebot $1,650 control kleinwagen 1997 \n", "49994 privat Angebot $5,000 control kombi 2001 \n", "49995 privat Angebot $24,900 control limousine 2011 \n", "49996 privat Angebot $1,980 control cabrio 1996 \n", "49997 privat Angebot $13,200 test cabrio 2014 \n", "49998 privat Angebot $22,900 control kombi 2013 \n", "49999 privat Angebot $1,250 control limousine 1996 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "5 automatik 150 voyager 150,000km 4 diesel \n", "6 manuell 90 golf 150,000km 8 benzin \n", "7 manuell 90 golf 150,000km 12 diesel \n", "8 manuell 0 arosa 150,000km 10 NaN \n", "9 manuell 90 megane 150,000km 7 benzin \n", "10 manuell 90 NaN 150,000km 4 benzin \n", "11 NaN 0 NaN 150,000km 0 benzin \n", "12 automatik 71 fortwo 50,000km 9 benzin \n", "13 manuell 101 a3 150,000km 11 benzin \n", "14 manuell 75 clio 150,000km 9 benzin \n", "15 automatik 203 NaN 80,000km 6 benzin \n", "16 manuell 101 vectra 150,000km 5 benzin \n", "17 manuell 205 scirocco 150,000km 6 benzin \n", "18 manuell 150 3er 150,000km 0 benzin \n", "19 manuell 124 andere 150,000km 2 lpg \n", "20 manuell 131 a4 150,000km 5 diesel \n", "21 manuell 320 911 150,000km 4 benzin \n", "22 manuell 184 cooper 10,000km 1 benzin \n", "23 manuell 120 NaN 150,000km 2 diesel \n", "24 automatik 306 5er 30,000km 12 benzin \n", "25 manuell 116 NaN 150,000km 4 benzin \n", "26 manuell 54 polo 125,000km 2 benzin \n", "27 NaN 0 NaN 150,000km 0 NaN \n", "28 manuell 110 cooper 150,000km 7 diesel \n", "29 automatik 204 e_klasse 150,000km 10 diesel \n", "... ... ... ... ... ... ... \n", "49970 automatik 136 c4 60,000km 4 diesel \n", "49971 manuell 50 lupo 150,000km 4 benzin \n", "49972 automatik 150 vito 150,000km 10 diesel \n", "49973 manuell 163 slk 150,000km 11 benzin \n", "49974 manuell 70 golf 150,000km 2 benzin \n", "49975 automatik 88 jazz 100,000km 11 hybrid \n", "49976 automatik 150 80 150,000km 12 benzin \n", "49977 manuell 116 c_klasse 150,000km 2 diesel \n", "49978 automatik 136 e_klasse 150,000km 9 benzin \n", "49979 manuell 90 polo 70,000km 11 diesel \n", "49980 manuell 105 escort 125,000km 3 benzin \n", "49981 manuell 115 astra 150,000km 12 benzin \n", "49982 manuell 0 fabia 90,000km 7 benzin \n", "49983 manuell 101 focus 150,000km 4 benzin \n", "49984 NaN 0 NaN 150,000km 0 NaN \n", "49985 automatik 0 NaN 150,000km 0 benzin \n", "49986 automatik 218 300c 125,000km 11 diesel \n", "49987 manuell 150 a3 50,000km 11 diesel \n", "49988 manuell 231 3er 150,000km 10 benzin \n", "49989 manuell 0 polo 150,000km 5 benzin \n", "49990 manuell 156 a_klasse 30,000km 12 benzin \n", "49991 manuell 0 twingo 150,000km 0 benzin \n", "49992 manuell 120 andere 125,000km 9 lpg \n", "49993 manuell 0 NaN 150,000km 7 benzin \n", "49994 automatik 299 a6 150,000km 1 benzin \n", "49995 automatik 239 q5 100,000km 1 diesel \n", "49996 manuell 75 astra 150,000km 5 benzin \n", "49997 automatik 69 500 5,000km 11 benzin \n", "49998 manuell 150 a3 40,000km 11 diesel \n", "49999 manuell 101 vectra 150,000km 1 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "5 chrysler NaN 2016-03-21 00:00:00 0 \n", "6 volkswagen NaN 2016-03-20 00:00:00 0 \n", "7 volkswagen nein 2016-03-16 00:00:00 0 \n", "8 seat nein 2016-03-22 00:00:00 0 \n", "9 renault nein 2016-03-16 00:00:00 0 \n", "10 volkswagen nein 2016-03-14 00:00:00 0 \n", "11 mercedes_benz NaN 2016-03-16 00:00:00 0 \n", "12 smart nein 2016-03-31 00:00:00 0 \n", "13 audi nein 2016-03-23 00:00:00 0 \n", "14 renault NaN 2016-03-23 00:00:00 0 \n", "15 sonstige_autos nein 2016-04-01 00:00:00 0 \n", "16 opel nein 2016-03-16 00:00:00 0 \n", "17 volkswagen nein 2016-03-29 00:00:00 0 \n", "18 bmw NaN 2016-03-26 00:00:00 0 \n", "19 mazda nein 2016-03-17 00:00:00 0 \n", "20 audi NaN 2016-03-05 00:00:00 0 \n", "21 porsche nein 2016-03-06 00:00:00 0 \n", "22 mini nein 2016-03-28 00:00:00 0 \n", "23 peugeot nein 2016-03-10 00:00:00 0 \n", "24 bmw nein 2016-04-03 00:00:00 0 \n", "25 ford ja 2016-03-21 00:00:00 0 \n", "26 volkswagen nein 2016-04-03 00:00:00 0 \n", "27 ford NaN 2016-03-27 00:00:00 0 \n", "28 mini ja 2016-03-19 00:00:00 0 \n", "29 mercedes_benz nein 2016-04-02 00:00:00 0 \n", "... ... ... ... ... \n", "49970 citroen nein 2016-03-21 00:00:00 0 \n", "49971 volkswagen nein 2016-03-29 00:00:00 0 \n", "49972 mercedes_benz ja 2016-03-26 00:00:00 0 \n", "49973 mercedes_benz nein 2016-03-27 00:00:00 0 \n", "49974 volkswagen nein 2016-03-20 00:00:00 0 \n", "49975 honda nein 2016-03-27 00:00:00 0 \n", "49976 audi nein 2016-03-19 00:00:00 0 \n", "49977 mercedes_benz nein 2016-03-31 00:00:00 0 \n", "49978 mercedes_benz ja 2016-04-04 00:00:00 0 \n", "49979 volkswagen nein 2016-03-20 00:00:00 0 \n", "49980 ford NaN 2016-03-12 00:00:00 0 \n", "49981 opel nein 2016-03-15 00:00:00 0 \n", "49982 skoda NaN 2016-03-29 00:00:00 0 \n", "49983 ford NaN 2016-03-06 00:00:00 0 \n", "49984 sonstige_autos NaN 2016-03-31 00:00:00 0 \n", "49985 volkswagen NaN 2016-04-02 00:00:00 0 \n", "49986 chrysler nein 2016-04-04 00:00:00 0 \n", "49987 audi nein 2016-03-22 00:00:00 0 \n", "49988 bmw nein 2016-03-28 00:00:00 0 \n", "49989 volkswagen ja 2016-03-11 00:00:00 0 \n", "49990 mercedes_benz nein 2016-03-21 00:00:00 0 \n", "49991 renault NaN 2016-03-06 00:00:00 0 \n", "49992 fiat nein 2016-03-10 00:00:00 0 \n", "49993 audi NaN 2016-03-15 00:00:00 0 \n", "49994 audi nein 2016-03-22 00:00:00 0 \n", "49995 audi nein 2016-03-27 00:00:00 0 \n", "49996 opel nein 2016-03-28 00:00:00 0 \n", "49997 fiat nein 2016-04-02 00:00:00 0 \n", "49998 audi nein 2016-03-08 00:00:00 0 \n", "49999 opel nein 2016-03-13 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 \n", "5 22962 2016-04-06 09:45:21 \n", "6 31535 2016-03-23 02:48:59 \n", "7 53474 2016-04-07 03:17:32 \n", "8 7426 2016-03-26 18:18:10 \n", "9 15749 2016-04-06 10:46:35 \n", "10 86157 2016-04-07 03:16:21 \n", "11 17498 2016-03-16 18:45:34 \n", "12 34590 2016-04-06 14:17:52 \n", "13 12043 2016-04-01 14:17:13 \n", "14 81737 2016-04-01 15:46:47 \n", "15 61276 2016-04-02 21:10:48 \n", "16 57299 2016-03-18 05:29:37 \n", "17 74821 2016-04-05 20:46:26 \n", "18 54329 2016-04-02 12:16:41 \n", "19 40878 2016-03-17 14:45:58 \n", "20 53913 2016-03-07 05:46:46 \n", "21 65428 2016-04-05 23:46:19 \n", "22 44789 2016-04-01 06:45:30 \n", "23 30900 2016-03-17 08:45:17 \n", "24 22547 2016-04-07 13:16:50 \n", "25 27574 2016-04-01 05:16:49 \n", "26 38110 2016-04-05 23:46:48 \n", "27 66701 2016-03-27 18:45:01 \n", "28 15745 2016-04-07 14:58:48 \n", "29 47638 2016-04-02 12:45:44 \n", "... ... ... \n", "49970 14947 2016-04-07 04:17:34 \n", "49971 65197 2016-03-29 20:41:51 \n", "49972 65326 2016-03-28 11:28:18 \n", "49973 53567 2016-03-27 08:25:24 \n", "49974 8209 2016-03-27 19:48:16 \n", "49975 84385 2016-04-05 19:45:34 \n", "49976 36100 2016-04-07 06:16:44 \n", "49977 33739 2016-04-06 12:16:11 \n", "49978 24405 2016-04-06 12:44:20 \n", "49979 48455 2016-04-07 01:45:12 \n", "49980 56218 2016-04-06 17:16:49 \n", "49981 86859 2016-04-05 17:21:46 \n", "49982 45884 2016-03-29 18:51:08 \n", "49983 52477 2016-03-09 06:16:08 \n", "49984 12103 2016-04-02 19:44:53 \n", "49985 30900 2016-04-06 15:17:52 \n", "49986 73527 2016-04-06 23:16:00 \n", "49987 94362 2016-03-26 22:46:06 \n", "49988 83646 2016-04-07 02:17:40 \n", "49989 21244 2016-03-12 10:17:55 \n", "49990 58239 2016-04-06 22:46:57 \n", "49991 61350 2016-03-06 18:24:19 \n", "49992 68642 2016-03-13 01:44:51 \n", "49993 65203 2016-04-06 19:46:53 \n", "49994 46537 2016-04-06 08:16:39 \n", "49995 82131 2016-04-01 13:47:40 \n", "49996 44807 2016-04-02 14:18:02 \n", "49997 73430 2016-04-04 11:47:27 \n", "49998 35683 2016-04-05 16:45:07 \n", "49999 45897 2016-04-06 21:18:48 \n", "\n", "[50000 rows x 20 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", "dateCrawled 50000 non-null object\n", "name 50000 non-null object\n", "seller 50000 non-null object\n", "offerType 50000 non-null object\n", "price 50000 non-null object\n", "abtest 50000 non-null object\n", "vehicleType 44905 non-null object\n", "yearOfRegistration 50000 non-null int64\n", "gearbox 47320 non-null object\n", "powerPS 50000 non-null int64\n", "model 47242 non-null object\n", "odometer 50000 non-null object\n", "monthOfRegistration 50000 non-null int64\n", "fuelType 45518 non-null object\n", "brand 50000 non-null object\n", "notRepairedDamage 40171 non-null object\n", "dateCreated 50000 non-null object\n", "nrOfPictures 50000 non-null int64\n", "postalCode 50000 non-null int64\n", "lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n" ] } ], "source": [ "autos.info()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateCrawlednamesellerofferTypepriceabtestvehicleTypeyearOfRegistrationgearboxpowerPSmodelodometermonthOfRegistrationfuelTypebrandnotRepairedDamagedateCreatednrOfPicturespostalCodelastSeen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Quick observations:**\n", "\n", " - The dataset contains 20 columns, most of which are strings.\n", " - The columns 'vehicleType', 'gearbox', 'gearbox', 'model', and 'notRepairedDamage' have some null objects. But none of them have more than ~20% null values.\n", " - The data in the rows is in German. We will have to translate it.\n", " - The columns 'price' and 'odometer' contain non-numeric values. We will clean that\n", " - The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n", " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n", " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n", " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n", " 'lastSeen'],\n", " dtype='object')\n" ] } ], "source": [ "print(autos.columns)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# rename the columns\n", "autos.rename({\"yearOfRegistration\":\"registration_year\"}, axis=1, inplace=True)\n", "autos.rename({\"registration_year\":\"registration_year\"}, axis=1, inplace=True)\n", "autos.rename({\"notRepairedDamage\":\"unrepaired_damage\"}, axis=1, inplace = True)\n", "autos.rename({\"monthOfRegistration\":\"registration_month\"}, axis=1, inplace=True)\n", "autos.rename({\"dateCreated\":\"ad_created\"}, axis=1, inplace=True)\n", "autos.rename({'dateCrawled':\"date_crawled\"}, axis=1, inplace=True)\n", "autos.rename({\"offerType\":\"offer_type\"}, axis=1, inplace=True)\n", "autos.rename({\"vehicleType\":\"vehicle_type\"}, axis=1, inplace=True)\n", "autos.rename({\"powerPS\":\"power_ps\"}, axis=1, inplace=True)\n", "autos.rename({\"fuelType\":\"fuel_type\"}, axis=1, inplace=True)\n", "autos.rename({\"nrOfPictures\":\"number_of_pictures\"}, axis=1, inplace=True)\n", "autos.rename({\"postalCode\":\"postal_code\"}, axis=1, inplace=True)\n", "autos.rename({\"lastSeen\":\"last_seen\"}, axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednumber_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created number_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We renamed the columns names using snakecase instead of camelcase to make them more readable.\n", "\n", "Now we'll explore the columns to see if we need to clean data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednumber_of_picturespostal_codelast_seen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-23 19:38:20Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name seller offer_type price abtest \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-23 19:38:20 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicle_type registration_year gearbox power_ps model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fuel_type brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created number_of_pictures postal_code \\\n", "count 50000 50000.0 50000.000000 \n", "unique 76 NaN NaN \n", "top 2016-04-03 00:00:00 NaN NaN \n", "freq 1946 NaN NaN \n", "mean NaN 0.0 50813.627300 \n", "std NaN 0.0 25779.747957 \n", "min NaN 0.0 1067.000000 \n", "25% NaN 0.0 30451.000000 \n", "50% NaN 0.0 49577.000000 \n", "75% NaN 0.0 71540.000000 \n", "max NaN 0.0 99998.000000 \n", "\n", " last_seen \n", "count 50000 \n", "unique 39481 \n", "top 2016-04-07 06:17:27 \n", "freq 8 \n", "mean NaN \n", "std NaN \n", "min NaN \n", "25% NaN \n", "50% NaN \n", "75% NaN \n", "max NaN " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include='all')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "privat 49999\n", "gewerblich 1\n", "Name: seller, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"seller\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Angebot 49999\n", "Gesuch 1\n", "Name: offer_type, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"offer_type\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 50000\n", "Name: number_of_pictures, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['number_of_pictures'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have discovered, that:\n", "\n", "- We can see that almost all values in the columns 'seller' and \"offer_type\" are the same. The column 'number_of_pictures' have only one value, which is '0'. Hence we are going to drop them.\n", "\n", "- What is more, the columns 'price' and 'odometer' cointain numeric values stored as text. And the column 'data_crawled' containd time and date. \n", "\n", "- Columns 'vehicle_type', 'gearbox', 'model', 'fuel_type', 'unrepaired_damage' have missing values.\t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll start by converting the columns 'price' and 'odometer' to a numeric type." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "150,000km 32424\n", "125,000km 5170\n", "100,000km 2169\n", "90,000km 1757\n", "80,000km 1436\n", "70,000km 1230\n", "60,000km 1164\n", "50,000km 1027\n", "5,000km 967\n", "40,000km 819\n", "30,000km 789\n", "20,000km 784\n", "10,000km 264\n", "Name: odometer, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"odometer\"].head()\n", "autos[\"odometer\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# # removing non-numeric characters\n", "autos['odometer'] = autos['odometer'].astype(str)\n", "autos[\"odometer\"] = autos['odometer'].str.replace(\"km\",'').str.replace(',','').astype(int)\n", "autos.rename(columns={'odometer':'odometer_km'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "autos['price'] = autos['price'].astype(str)\n", "autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)\n", "autos.rename(columns={'price':\"price_dollars\"}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000,\n", " 20000, 60000, 5000, 100000, 40000])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# analysing the 'odometer_km' column\n", "autos['odometer_km'].unique()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 50000.000000\n", "mean 125732.700000\n", "std 40042.211706\n", "min 5000.000000\n", "25% 125000.000000\n", "50% 150000.000000\n", "75% 150000.000000\n", "max 150000.000000\n", "Name: odometer_km, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].describe()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "150000 32424\n", "125000 5170\n", "100000 2169\n", "90000 1757\n", "80000 1436\n", "70000 1230\n", "60000 1164\n", "50000 1027\n", "5000 967\n", "40000 819\n", "30000 789\n", "20000 784\n", "10000 264\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].value_counts()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2357,)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price_dollars\"].unique().shape" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 5.000000e+04\n", "mean 9.840044e+03\n", "std 4.811044e+05\n", "min 0.000000e+00\n", "25% 1.100000e+03\n", "50% 2.950000e+03\n", "75% 7.200000e+03\n", "max 1.000000e+08\n", "Name: price_dollars, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price_dollars\"].describe()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1421\n", "500 781\n", "1500 734\n", "2500 643\n", "1000 639\n", "Name: price_dollars, dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price_dollars\"].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "99999999 1\n", "27322222 1\n", "12345678 3\n", "11111111 2\n", "10000000 1\n", "3890000 1\n", "1300000 1\n", "1234566 1\n", "999999 2\n", "999990 1\n", "Name: price_dollars, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price_dollars\"].value_counts().sort_index(ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 1421\n", "1 156\n", "2 3\n", "3 1\n", "5 2\n", "8 1\n", "9 1\n", "10 7\n", "11 2\n", "12 3\n", "Name: price_dollars, dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price_dollars\"].value_counts().sort_index(ascending=True).head(10)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "197000 1\n", "198000 1\n", "220000 1\n", "250000 1\n", "259000 1\n", "265000 1\n", "295000 1\n", "299000 1\n", "345000 1\n", "350000 1\n", "999990 1\n", "999999 2\n", "1234566 1\n", "1300000 1\n", "3890000 1\n", "10000000 1\n", "11111111 2\n", "12345678 3\n", "27322222 1\n", "99999999 1\n", "Name: price_dollars, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price_dollars'].value_counts().sort_index(ascending=True).tail(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see, that the price rises drastically from 350,000 to 999,990. We are going to eliminate the rows where the price equals to '0' or is higher them 350,000 as this price doesn't seem reasonable. However, we'll keep the 1$ price as it may indicate the startinf point for bidding." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#removing outliers\n", "autos = autos[autos['price_dollars'].between(1,350000)]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "151990 1\n", "155000 1\n", "163500 1\n", "163991 1\n", "169000 1\n", "169999 1\n", "175000 1\n", "180000 1\n", "190000 1\n", "194000 1\n", "197000 1\n", "198000 1\n", "220000 1\n", "250000 1\n", "259000 1\n", "265000 1\n", "295000 1\n", "299000 1\n", "345000 1\n", "350000 1\n", "Name: price_dollars, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price_dollars'].value_counts().sort_index(ascending=True).tail(20)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1 156\n", "2 3\n", "3 1\n", "5 2\n", "8 1\n", "9 1\n", "10 7\n", "11 2\n", "12 3\n", "13 2\n", "14 1\n", "15 2\n", "17 3\n", "18 1\n", "20 4\n", "25 5\n", "29 1\n", "30 7\n", "35 1\n", "40 6\n", "Name: price_dollars, dtype: int64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price_dollars'].value_counts().sort_index(ascending=True).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the rows with outliners are gone and the price range seem reasonable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exploring the date columns\n", "\n", "There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself:\n", "\n", "- `date_crawled`: added by the crawler\n", "- `last_seen`: added by the crawler\n", "- `ad_created`: from the website\n", "- `registration_month`: from the website\n", "- `registration_year`: from the website\n", "\n", "Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.\n", "\n", "Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledad_createdlast_seen
02016-03-26 17:47:462016-03-26 00:00:002016-04-06 06:45:54
12016-04-04 13:38:562016-04-04 00:00:002016-04-06 14:45:08
22016-03-26 18:57:242016-03-26 00:00:002016-04-06 20:15:37
32016-03-12 16:58:102016-03-12 00:00:002016-03-15 03:16:28
42016-04-01 14:38:502016-04-01 00:00:002016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled ad_created last_seen\n", "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54\n", "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08\n", "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37\n", "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28\n", "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['date_crawled','ad_created','last_seen']][0:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand the date range, we can extract just the date values, generate a distribution, and then sort by the index." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 2016-03-26\n", "1 2016-04-04\n", "2 2016-03-26\n", "3 2016-03-12\n", "4 2016-04-01\n", "Name: date_crawled, dtype: object\n" ] } ], "source": [ "print(autos['date_crawled'].str[:10].head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Exploring date_crawled **" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2016-03-05 0.025327\n", "2016-03-06 0.014043\n", "2016-03-07 0.036014\n", "2016-03-08 0.033296\n", "2016-03-09 0.033090\n", "2016-03-10 0.032184\n", "2016-03-11 0.032575\n", "2016-03-12 0.036920\n", "2016-03-13 0.015670\n", "2016-03-14 0.036549\n", "2016-03-15 0.034284\n", "2016-03-16 0.029610\n", "2016-03-17 0.031628\n", "2016-03-18 0.012911\n", "2016-03-19 0.034778\n", "2016-03-20 0.037887\n", "2016-03-21 0.037373\n", "2016-03-22 0.032987\n", "2016-03-23 0.032225\n", "2016-03-24 0.029342\n", "2016-03-25 0.031607\n", "2016-03-26 0.032204\n", "2016-03-27 0.031092\n", "2016-03-28 0.034860\n", "2016-03-29 0.034099\n", "2016-03-30 0.033687\n", "2016-03-31 0.031834\n", "2016-04-01 0.033687\n", "2016-04-02 0.035478\n", "2016-04-03 0.038608\n", "2016-04-04 0.036487\n", "2016-04-05 0.013096\n", "2016-04-06 0.003171\n", "2016-04-07 0.001400\n", "Name: date_crawled, dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"date_crawled\"].str[:10].value_counts(normalize=True, dropna=False).sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the data was crawled since the 3d of March,2016 to the 7th of April,2016. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Exploring the ad_created column**" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2015-06-11 0.000021\n", "2015-08-10 0.000021\n", "2015-09-09 0.000021\n", "2015-11-10 0.000021\n", "2015-12-05 0.000021\n", "2015-12-30 0.000021\n", "2016-01-03 0.000021\n", "2016-01-07 0.000021\n", "2016-01-10 0.000041\n", "2016-01-13 0.000021\n", "2016-01-14 0.000021\n", "2016-01-16 0.000021\n", "2016-01-22 0.000021\n", "2016-01-27 0.000062\n", "2016-01-29 0.000021\n", "2016-02-01 0.000021\n", "2016-02-02 0.000041\n", "2016-02-05 0.000041\n", "2016-02-07 0.000021\n", "2016-02-08 0.000021\n", "2016-02-09 0.000021\n", "2016-02-11 0.000021\n", "2016-02-12 0.000041\n", "2016-02-14 0.000041\n", "2016-02-16 0.000021\n", "2016-02-17 0.000021\n", "2016-02-18 0.000041\n", "2016-02-19 0.000062\n", "2016-02-20 0.000041\n", "2016-02-21 0.000062\n", " ... \n", "2016-03-09 0.033151\n", "2016-03-10 0.031895\n", "2016-03-11 0.032904\n", "2016-03-12 0.036755\n", "2016-03-13 0.017008\n", "2016-03-14 0.035190\n", "2016-03-15 0.034016\n", "2016-03-16 0.030125\n", "2016-03-17 0.031278\n", "2016-03-18 0.013590\n", "2016-03-19 0.033687\n", "2016-03-20 0.037949\n", "2016-03-21 0.037579\n", "2016-03-22 0.032801\n", "2016-03-23 0.032060\n", "2016-03-24 0.029280\n", "2016-03-25 0.031751\n", "2016-03-26 0.032266\n", "2016-03-27 0.030989\n", "2016-03-28 0.034984\n", "2016-03-29 0.034037\n", "2016-03-30 0.033501\n", "2016-03-31 0.031875\n", "2016-04-01 0.033687\n", "2016-04-02 0.035149\n", "2016-04-03 0.038855\n", "2016-04-04 0.036858\n", "2016-04-05 0.011819\n", "2016-04-06 0.003253\n", "2016-04-07 0.001256\n", "Name: ad_created, Length: 76, dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2016-03-05 0.001071\n", "2016-03-06 0.004324\n", "2016-03-07 0.005395\n", "2016-03-08 0.007413\n", "2016-03-09 0.009595\n", "2016-03-10 0.010666\n", "2016-03-11 0.012375\n", "2016-03-12 0.023783\n", "2016-03-13 0.008895\n", "2016-03-14 0.012602\n", "2016-03-15 0.015876\n", "2016-03-16 0.016452\n", "2016-03-17 0.028086\n", "2016-03-18 0.007351\n", "2016-03-19 0.015834\n", "2016-03-20 0.020653\n", "2016-03-21 0.020632\n", "2016-03-22 0.021373\n", "2016-03-23 0.018532\n", "2016-03-24 0.019767\n", "2016-03-25 0.019211\n", "2016-03-26 0.016802\n", "2016-03-27 0.015649\n", "2016-03-28 0.020859\n", "2016-03-29 0.022341\n", "2016-03-30 0.024771\n", "2016-03-31 0.023783\n", "2016-04-01 0.022794\n", "2016-04-02 0.024915\n", "2016-04-03 0.025203\n", "2016-04-04 0.024483\n", "2016-04-05 0.124761\n", "2016-04-06 0.221806\n", "2016-04-07 0.131947\n", "Name: last_seen, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After examining these three columns, we can say that this website ads work well, car might has been sold because the ads are no longer on site." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "count 48565.000000\n", "mean 2004.755421\n", "std 88.643887\n", "min 1000.000000\n", "25% 1999.000000\n", "50% 2004.000000\n", "75% 2008.000000\n", "max 9999.000000\n", "Name: registration_year, dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This column clearly has outliers, because th registration year cannot be higher than 2016 and less than 1950. We will only count the rows with registration_year values between 1950 and 2016, because we don't know for sure whether the cars registered earlier are antique or there is just a mistake. What is more, we the date was crowled in 2016, so the year of registration cannot be higher." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1000 1\n", "1001 1\n", "1111 1\n", "1800 2\n", "1910 5\n", "1927 1\n", "1929 1\n", "1931 1\n", "1934 2\n", "1937 4\n", "1938 1\n", "1939 1\n", "1941 2\n", "1943 1\n", "1948 1\n", "1950 3\n", "1951 2\n", "1952 1\n", "1953 1\n", "1954 2\n", "Name: registration_year, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].value_counts().sort_index(ascending=True).head(20)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "9999 3\n", "9000 1\n", "8888 1\n", "6200 1\n", "5911 1\n", "5000 4\n", "4800 1\n", "4500 1\n", "4100 1\n", "2800 1\n", "2019 2\n", "2018 470\n", "2017 1392\n", "2016 1220\n", "2015 392\n", "2014 663\n", "2013 803\n", "2012 1310\n", "2011 1623\n", "2010 1589\n", "Name: registration_year, dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].value_counts().sort_index(ascending=False).head(20)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#removing outliers\n", "autos = autos[autos['registration_year'].between(1950,2016)]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2000 0.067637\n", "2005 0.062922\n", "1999 0.062086\n", "2004 0.057928\n", "2003 0.057843\n", "2006 0.057221\n", "2001 0.056493\n", "2002 0.053278\n", "1998 0.050642\n", "2007 0.048799\n", "2008 0.047470\n", "2009 0.044684\n", "1997 0.041812\n", "2011 0.034783\n", "2010 0.034054\n", "1996 0.029425\n", "2012 0.028075\n", "1995 0.026296\n", "2016 0.026146\n", "2013 0.017209\n", "2014 0.014209\n", "1994 0.013480\n", "1993 0.009108\n", "2015 0.008401\n", "1992 0.007930\n", "1990 0.007437\n", "1991 0.007265\n", "1989 0.003729\n", "1988 0.002893\n", "1985 0.002036\n", " ... \n", "1970 0.000814\n", "1979 0.000729\n", "1972 0.000707\n", "1981 0.000600\n", "1968 0.000557\n", "1967 0.000557\n", "1971 0.000557\n", "1974 0.000514\n", "1973 0.000493\n", "1960 0.000493\n", "1977 0.000471\n", "1966 0.000471\n", "1976 0.000450\n", "1969 0.000407\n", "1975 0.000386\n", "1965 0.000364\n", "1964 0.000257\n", "1963 0.000171\n", "1959 0.000129\n", "1961 0.000129\n", "1958 0.000086\n", "1956 0.000086\n", "1962 0.000086\n", "1950 0.000064\n", "1957 0.000043\n", "1955 0.000043\n", "1954 0.000043\n", "1951 0.000043\n", "1953 0.000021\n", "1952 0.000021\n", "Name: registration_year, Length: 67, dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have removed the rows with the outliers. Now we'll have a closer look at the 'brand' columns." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',\n", " 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',\n", " 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',\n", " 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',\n", " 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',\n", " 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',\n", " 'rover', 'daihatsu', 'lancia'], dtype=object)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].unique()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0
sonstige_autos12336.588889
mini10613.459658
audi9336.687454
mercedes_benz8625.241502
bmw8332.203855
skoda6368.000000
volkswagen5398.709056
hyundai5365.254274
toyota5167.091062
volvo4946.501171
nissan4743.402525
seat4397.230950
mazda4112.596615
citroen3779.139144
ford3737.275767
smart3580.223903
peugeot3094.017229
opel2976.247260
fiat2813.748538
renault2475.717273
\n", "
" ], "text/plain": [ " 0\n", "sonstige_autos 12336.588889\n", "mini 10613.459658\n", "audi 9336.687454\n", "mercedes_benz 8625.241502\n", "bmw 8332.203855\n", "skoda 6368.000000\n", "volkswagen 5398.709056\n", "hyundai 5365.254274\n", "toyota 5167.091062\n", "volvo 4946.501171\n", "nissan 4743.402525\n", "seat 4397.230950\n", "mazda 4112.596615\n", "citroen 3779.139144\n", "ford 3737.275767\n", "smart 3580.223903\n", "peugeot 3094.017229\n", "opel 2976.247260\n", "fiat 2813.748538\n", "renault 2475.717273" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_autos = autos[\"brand\"].value_counts().sort_values(ascending=False).head(20).index\n", "\n", "\n", "top_brands = {}\n", "\n", "for c in top_autos:\n", " selected_rows = autos[autos['brand'] == c]\n", " mean_price = selected_rows['price_dollars'].mean() \n", " top_brands[c] = mean_price\n", " \n", "top_brands_df = pd.DataFrame.from_dict(top_brands, orient='index')\n", "top_brands_df.sort_values(by=0, ascending=False)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The brands with the **highest** mean price:\n", "\n", "- Sonstige Autos \\(\\$12336)\n", "- Mini (\\$10613)\n", "- Audi (\\$9336)\n", "\n", "The brands with the **lowest** mean price:\n", "\n", "- Opel (\\$2976)\n", "- Fiat (\\$2814)\n", "- Renault (\\$2476)\n", " \n", "Now we'll calculate the mean mileage for these brands and for a couple of in-between ones." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_pricemean_mileage
sonstige_autos12336.58888990655.555556
mini10613.45965888105.134474
audi9336.687454129157.386785
mercedes_benz8625.241502130832.037325
bmw8332.203855132597.352025
skoda6368.000000110848.563969
volkswagen5398.709056128710.069973
hyundai5365.254274106442.307692
toyota5167.091062115944.350759
volvo4946.501171138067.915691
nissan4743.402525118330.995792
seat4397.230950121131.301290
mazda4112.596615124464.033850
citroen3779.139144119694.189602
ford3737.275767124374.233129
smart3580.22390399326.777610
peugeot3094.017229127153.625269
opel2976.247260129336.521219
fiat2813.748538117121.971596
renault2475.717273128127.272727
\n", "
" ], "text/plain": [ " mean_price mean_mileage\n", "sonstige_autos 12336.588889 90655.555556\n", "mini 10613.459658 88105.134474\n", "audi 9336.687454 129157.386785\n", "mercedes_benz 8625.241502 130832.037325\n", "bmw 8332.203855 132597.352025\n", "skoda 6368.000000 110848.563969\n", "volkswagen 5398.709056 128710.069973\n", "hyundai 5365.254274 106442.307692\n", "toyota 5167.091062 115944.350759\n", "volvo 4946.501171 138067.915691\n", "nissan 4743.402525 118330.995792\n", "seat 4397.230950 121131.301290\n", "mazda 4112.596615 124464.033850\n", "citroen 3779.139144 119694.189602\n", "ford 3737.275767 124374.233129\n", "smart 3580.223903 99326.777610\n", "peugeot 3094.017229 127153.625269\n", "opel 2976.247260 129336.521219\n", "fiat 2813.748538 117121.971596\n", "renault 2475.717273 128127.272727" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_mileage = {}\n", "\n", "for c in top_autos:\n", " selected_rows = autos[autos['brand'] == c]\n", " mileage = selected_rows['odometer_km'].mean()\n", " mean_mileage[c] = mileage\n", " \n", "tb_series = pd.Series(top_brands)\n", "mm_series = pd.Series(mean_mileage)\n", "\n", "mean_data = pd.DataFrame(tb_series, columns=[\"mean_price\"])\n", "mean_data['mean_mileage'] = mm_series\n", "\n", "mean_data.sort_values(by=[\"mean_price\"], ascending=False)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Next Steps ###\n", "\n", "Data cleaning next steps:\n", "* Identify categorical data that uses german words, translate them and map the values to their english counterparts\n", "* Convert the dates to be uniform numeric data, so \"2016-03-21\" becomes the integer 20160321.\n", "* See if there are particular keywords in the name column that you can extract as new columns\n", "\n", "Analysis next steps:\n", "* Find the most common brand/model combinations\n", "* Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.\n", "* How much cheaper are cars with damage than their non-damaged counterparts?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning Continued ###\n", "\n", "### Translating into English ###\n", "\n", "Some data in the dataset is given in German. We will start by finding out which columns cintain Geraman data." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typeprice_dollarsabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednumber_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot5000controlbus2004manuell158andere1500003lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot8500controllimousine1997automatik2867er1500006benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
\n", "
" ], "text/plain": [ " date_crawled name seller \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat \n", "\n", " offer_type price_dollars abtest vehicle_type registration_year \\\n", "0 Angebot 5000 control bus 2004 \n", "1 Angebot 8500 control limousine 1997 \n", "\n", " gearbox power_ps model odometer_km registration_month fuel_type \\\n", "0 manuell 158 andere 150000 3 lpg \n", "1 automatik 286 7er 150000 6 benzin \n", "\n", " brand unrepaired_damage ad_created number_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll translate the next columns:\n", "\n", "* seller\n", "* offer_type\n", "* vehicle_type\n", "* gearbox\n", "* model\n", "* fuel_type\n", "* unrepaired_damage\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "privat 46660\n", "gewerblich 1\n", "Name: seller, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['seller'].value_counts()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "private 46660\n", "commercial 1\n", "Name: seller, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['seller'] = autos['seller'].str.replace('privat', \"private\").str.replace('gewerblich','commercial')\n", "autos['seller'].value_counts()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Angebot 46661\n", "Name: offer_type, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['offer_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "offer 46661\n", "Name: offer_type, dtype: int64" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['offer_type'] = autos['offer_type'].str.replace('Angebot','offer')\n", "autos['offer_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "limousine 12596\n", "kleinwagen 10584\n", "kombi 8929\n", "bus 4031\n", "cabrio 3009\n", "coupe 2461\n", "suv 1964\n", "andere 387\n", "Name: vehicle_type, dtype: int64" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['vehicle_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "sedan 12596\n", "small car 10584\n", "station wagon 8929\n", "bus 4031\n", "convertible 3009\n", "coupe 2461\n", "suv 1964\n", "other 387\n", "Name: vehicle_type, dtype: int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['vehicle_type'] = autos['vehicle_type'].str.replace(\n", " 'limousine','sedan').str.replace('kleinwagen','small car').str.replace('kombi', 'station wagon').str.replace(\n", " 'cabrio','convertible').str.replace('andere','other')\n", "autos['vehicle_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "manuell 34703\n", "automatik 9856\n", "Name: gearbox, dtype: int64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['gearbox'].value_counts()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "manual 34703\n", "automatic 9856\n", "Name: gearbox, dtype: int64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['gearbox'] = autos['gearbox'].str.replace('manuell','manual').str.replace('automatik','automatic')\n", "autos['gearbox'].value_counts()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "golf 3707\n", "andere 3365\n", "3er 2615\n", "polo 1609\n", "corsa 1591\n", "passat 1349\n", "astra 1348\n", "a4 1231\n", "c_klasse 1136\n", "5er 1132\n", "e_klasse 958\n", "a3 825\n", "a6 797\n", "focus 762\n", "fiesta 722\n", "transporter 674\n", "twingo 615\n", "2_reihe 600\n", "fortwo 550\n", "vectra 544\n", "a_klasse 539\n", "1er 521\n", "3_reihe 486\n", "mondeo 479\n", "clio 473\n", "touran 433\n", "punto 415\n", "zafira 394\n", "ka 349\n", "megane 335\n", " ... \n", "sirion 11\n", "lanos 10\n", "300c 10\n", "terios 10\n", "mii 10\n", "gl 10\n", "lybra 10\n", "move 9\n", "r19 9\n", "range_rover 9\n", "90 8\n", "nubira 8\n", "amarok 6\n", "crossfire 6\n", "exeo 6\n", "range_rover_evoque 5\n", "delta 5\n", "9000 5\n", "lodgy 5\n", "145 4\n", "croma 4\n", "materia 4\n", "charade 3\n", "samara 3\n", "v60 3\n", "kappa 2\n", "200 1\n", "rangerover 1\n", "b_max 1\n", "i3 1\n", "Name: model, Length: 244, dtype: int64" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['model'].value_counts()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": true }, "outputs": [], "source": [ "autos['model'] = autos['model'].str.replace('andere', 'other')" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "benzin 28527\n", "diesel 14032\n", "lpg 649\n", "cng 71\n", "hybrid 37\n", "elektro 19\n", "andere 15\n", "Name: fuel_type, dtype: int64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['fuel_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "gasoline 28527\n", "diesel 14032\n", "lpg 649\n", "cng 71\n", "hybrid 37\n", "electric 19\n", "other 15\n", "Name: fuel_type, dtype: int64" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['fuel_type'] = autos['fuel_type'].str.replace('benzin','gasoline').str.replace(\n", " 'elektro','electric').str.replace('andere','other')\n", "autos['fuel_type'].value_counts()" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "nein 33828\n", "ja 4534\n", "Name: unrepaired_damage, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['unrepaired_damage'].value_counts()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "no 33828\n", "yes 4534\n", "Name: unrepaired_damage, dtype: int64" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein', 'no').str.replace('ja', 'yes')\n", "autos['unrepaired_damage'].value_counts()" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typeprice_dollarsabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednumber_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivateoffer5000controlbus2004manual158other1500003lpgpeugeotno2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optikprivateoffer8500controlsedan1997automatic2867er1500006gasolinebmwno2016-04-04 00:00:000710342016-04-06 14:45:08
\n", "
" ], "text/plain": [ " date_crawled name seller \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD private \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik private \n", "\n", " offer_type price_dollars abtest vehicle_type registration_year \\\n", "0 offer 5000 control bus 2004 \n", "1 offer 8500 control sedan 1997 \n", "\n", " gearbox power_ps model odometer_km registration_month fuel_type \\\n", "0 manual 158 other 150000 3 lpg \n", "1 automatic 286 7er 150000 6 gasoline \n", "\n", " brand unrepaired_damage ad_created number_of_pictures \\\n", "0 peugeot no 2016-03-26 00:00:00 0 \n", "1 bmw no 2016-04-04 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have successfully translated all German data into English. \n", "Now we will convert the dates to the uniform numeric data.\n", "\n", "### Converting the dates ###\n", "\n", "The columns containing dates are:\n", "\n", "* date_crawled\n", "* ad_created\n", "* last_seen\n", "\n", "We'll convert the dates to be uniform numeric data, so \"2016-03-21\" becomes the integer 20160321." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "20160403.0 1810\n", "20160320.0 1774\n", "20160321.0 1741\n", "20160312.0 1718\n", "20160404.0 1708\n", "Name: date_crawled, dtype: int64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'] = autos['date_crawled'].str[:10]\n", "autos['date_crawled'] = autos['date_crawled'].str.replace('-','')\n", "autos['date_crawled'] = autos['date_crawled'].astype(float)\n", "\n", "autos['date_crawled'].value_counts().head()\n" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "20160403 1821\n", "20160320 1776\n", "20160321 1751\n", "20160404 1724\n", "20160312 1710\n", "Name: ad_created, dtype: int64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'] = autos['ad_created'].str[:10]\n", "autos['ad_created'] = autos['ad_created'].str.replace('-','')\n", "autos['ad_created'] = autos['ad_created'].astype(int)\n", "\n", "autos['ad_created'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "20160406 10420\n", "20160407 6194\n", "20160405 5852\n", "20160317 1311\n", "20160403 1173\n", "Name: last_seen, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'] = autos['last_seen'].str[:10]\n", "autos['last_seen'] = autos['last_seen'].str.replace('-','')\n", "autos['last_seen'] = autos['last_seen'].astype(int)\n", "\n", "autos['last_seen'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now when we converted text and numeric values, the dataset is clen and looks like this:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typeprice_dollarsabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednumber_of_picturespostal_codelast_seen
020160326.0Peugeot_807_160_NAVTECH_ON_BOARDprivateoffer5000controlbus2004manual158other1500003lpgpeugeotno2016032607958820160406
120160404.0BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optikprivateoffer8500controlsedan1997automatic2867er1500006gasolinebmwno2016040407103420160406
220160326.0Volkswagen_Golf_1.6_Unitedprivateoffer8990testsedan2009manual102golf700007gasolinevolkswagenno2016032603539420160406
320160312.0Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privateoffer4350controlsmall car2007automatic71fortwo700006gasolinesmartno2016031203372920160315
420160401.0Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privateoffer1350teststation wagon2003manual0focus1500007gasolinefordno2016040103921820160401
\n", "
" ], "text/plain": [ " date_crawled name seller \\\n", "0 20160326.0 Peugeot_807_160_NAVTECH_ON_BOARD private \n", "1 20160404.0 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik private \n", "2 20160326.0 Volkswagen_Golf_1.6_United private \n", "3 20160312.0 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... private \n", "4 20160401.0 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... private \n", "\n", " offer_type price_dollars abtest vehicle_type registration_year \\\n", "0 offer 5000 control bus 2004 \n", "1 offer 8500 control sedan 1997 \n", "2 offer 8990 test sedan 2009 \n", "3 offer 4350 control small car 2007 \n", "4 offer 1350 test station wagon 2003 \n", "\n", " gearbox power_ps model odometer_km registration_month fuel_type \\\n", "0 manual 158 other 150000 3 lpg \n", "1 automatic 286 7er 150000 6 gasoline \n", "2 manual 102 golf 70000 7 gasoline \n", "3 automatic 71 fortwo 70000 6 gasoline \n", "4 manual 0 focus 150000 7 gasoline \n", "\n", " brand unrepaired_damage ad_created number_of_pictures postal_code \\\n", "0 peugeot no 20160326 0 79588 \n", "1 bmw no 20160404 0 71034 \n", "2 volkswagen no 20160326 0 35394 \n", "3 smart no 20160312 0 33729 \n", "4 ford no 20160401 0 39218 \n", "\n", " last_seen \n", "0 20160406 \n", "1 20160406 \n", "2 20160406 \n", "3 20160315 \n", "4 20160401 " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis Continued ##\n", "\n", "First we will find the most common brand/model combinations." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brand_counts = autos['brand'].value_counts(normalize=True)*100\n", "top_brand = brand_counts[brand_counts > 5].index\n", " \n", "top_brand\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'audi': 'a4',\n", " 'bmw': '3er',\n", " 'ford': 'focus',\n", " 'mercedes_benz': 'c_klasse',\n", " 'opel': 'corsa',\n", " 'volkswagen': 'golf'}" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brand_model = {}\n", "\n", "for c in top_brand:\n", " a_brand = autos[autos['brand'] == c]\n", " a_model = a_brand['model'].describe()['top']\n", " brand_model[c] = a_model\n", " \n", "brand_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll see if average prices follows any patterns based on the milage." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(121000.0, 150000.0], (63000.0, 92000.0], (34000.0, 63000.0], (4854.999, 34000.0], (92000.0, 121000.0]]\n", " closed='right',\n", " dtype='interval[float64]')" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# we split the 'odometer_km' values in five bins (ranges)\n", "ranges = autos['odometer_km'].value_counts(bins=5).index\n", "ranges" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{(4854.999, 34000.0): 15146,\n", " (34000.0, 63000.0): 13712,\n", " (63000.0, 92000.0): 9564,\n", " (92000.0, 121000.0): 8120,\n", " (121000.0, 150000.0): 4107}" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ranges_tuple = ranges.to_tuples()\n", "\n", "km_price = {}\n", "\n", "#we calculated the mean price for each odometer bin\n", "for r in ranges_tuple:\n", " by_km = autos[autos['odometer_km'].between(r[0], r[1])]\n", " price = by_km['price_dollars'].mean()\n", " km_price[r] = int(price)\n", " \n", "km_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can conclude, that the the higher the mileage is, the cheaper the price gets.\n", "\n", "Now we'll check How much cheaper cars are with damage than their non-damaged counterparts." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{nan: nan, 'no': 7162.054984036892, 'yes': 2222.3098808998675}" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "damage = autos['unrepaired_damage'].unique()\n", "\n", "damaged_price = {}\n", "\n", "for d in damage:\n", " by_d = autos[autos['unrepaired_damage'] == d]\n", " price = by_d['price_dollars'].mean()\n", " damaged_price[d] = price \n", "\n", "damaged_price" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the price for cars with unrepaired damage is 30% lower than for the undamaged cars." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 2 }