{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# eBay Used Car Sales Data Analysis\n", "\n", "In this project I will explore the vast used car data spanning from June 2015 to April 2016, available at [eBay Kleinanzeigen](https://www.kaggle.com/orgesleka/used-cars-database/data), to answer the following questions:\n", "\n", "### Which car brands are the most expensive listings on average?\n", "\n", "### What is the average milage for those brands?\n", "\n", "### What does this say about our assumptions about this data?\n", "\n", "### Does this data provide enough information to make an educated car purchase on eBay? \n", "\n", "\n", "To accomplish this I will do the following:\n", "\n", "1. Explore the Data\n", "2. Clean the Data\n", "3. Analyze the Data\n", "4. Make Conclusions from my analysis\n", "\n", "### 1. Explore the Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", "dateCrawled 50000 non-null object\n", "name 50000 non-null object\n", "seller 50000 non-null object\n", "offerType 50000 non-null object\n", "price 50000 non-null object\n", "abtest 50000 non-null object\n", "vehicleType 44905 non-null object\n", "yearOfRegistration 50000 non-null int64\n", "gearbox 47320 non-null object\n", "powerPS 50000 non-null int64\n", "model 47242 non-null object\n", "odometer 50000 non-null object\n", "monthOfRegistration 50000 non-null int64\n", "fuelType 45518 non-null object\n", "brand 50000 non-null object\n", "notRepairedDamage 40171 non-null object\n", "dateCreated 50000 non-null object\n", "nrOfPictures 50000 non-null int64\n", "postalCode 50000 non-null int64\n", "lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n", "None\n", " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 \n", " dateCrawled name \\\n", "49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon \n", "49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... \n", "49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge \n", "49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition \n", "49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "49995 privat Angebot $24,900 control limousine 2011 \n", "49996 privat Angebot $1,980 control cabrio 1996 \n", "49997 privat Angebot $13,200 test cabrio 2014 \n", "49998 privat Angebot $22,900 control kombi 2013 \n", "49999 privat Angebot $1,250 control limousine 1996 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "49995 automatik 239 q5 100,000km 1 diesel \n", "49996 manuell 75 astra 150,000km 5 benzin \n", "49997 automatik 69 500 5,000km 11 benzin \n", "49998 manuell 150 a3 40,000km 11 diesel \n", "49999 manuell 101 vectra 150,000km 1 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures postalCode \\\n", "49995 audi nein 2016-03-27 00:00:00 0 82131 \n", "49996 opel nein 2016-03-28 00:00:00 0 44807 \n", "49997 fiat nein 2016-04-02 00:00:00 0 73430 \n", "49998 audi nein 2016-03-08 00:00:00 0 35683 \n", "49999 opel nein 2016-03-13 00:00:00 0 45897 \n", "\n", " lastSeen \n", "49995 2016-04-01 13:47:40 \n", "49996 2016-04-02 14:18:02 \n", "49997 2016-04-04 11:47:27 \n", "49998 2016-04-05 16:45:07 \n", "49999 2016-04-06 21:18:48 \n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "autos = pd.read_csv('autos.csv', encoding='Latin-1')\n", "\n", "print(autos.info())\n", "print(autos.head())\n", "print(autos.tail())" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearOfRegistrationpowerPSmonthOfRegistrationnrOfPicturespostalCode
count50000.00000050000.00000050000.00000050000.050000.000000
mean2005.073280116.3559205.7233600.050813.627300
std105.712813209.2166273.7119840.025779.747957
min1000.0000000.0000000.0000000.01067.000000
25%1999.00000070.0000003.0000000.030451.000000
50%2003.000000105.0000006.0000000.049577.000000
75%2008.000000150.0000009.0000000.071540.000000
max9999.00000017700.00000012.0000000.099998.000000
\n", "
" ], "text/plain": [ " yearOfRegistration powerPS monthOfRegistration nrOfPictures \\\n", "count 50000.000000 50000.000000 50000.000000 50000.0 \n", "mean 2005.073280 116.355920 5.723360 0.0 \n", "std 105.712813 209.216627 3.711984 0.0 \n", "min 1000.000000 0.000000 0.000000 0.0 \n", "25% 1999.000000 70.000000 3.000000 0.0 \n", "50% 2003.000000 105.000000 6.000000 0.0 \n", "75% 2008.000000 150.000000 9.000000 0.0 \n", "max 9999.000000 17700.000000 12.000000 0.0 \n", "\n", " postalCode \n", "count 50000.000000 \n", "mean 50813.627300 \n", "std 25779.747957 \n", "min 1067.000000 \n", "25% 30451.000000 \n", "50% 49577.000000 \n", "75% 71540.000000 \n", "max 99998.000000 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this brief exploration of the data I can see the following: \n", "\n", "\n", "- The dataset contains 20 columns, most of which are strings\n", "- Some columns have null values, but none have more than ~20% null values (`notRepairedDamage` has 40171 non-null)\n", "- The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores\n", "\n", "### 2. Clean the Data\n", "\n", "Below I will do the following: \n", "\n", "1. convert the column names from camelcase to snakecase \n", "2. reword some of the column names based on the data dictionary to be more descriptive" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n", " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n", " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n", " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n", " 'lastSeen'],\n", " dtype='object')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.columns" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos_copy = autos.copy()\n", "columns = autos_copy.columns\n", "\n", "mapping_columns = {\n", " 'dateCrawled':'date_crawled',\n", " 'name':'name',\n", " 'seller':'seller',\n", " 'offerType':'offer_type',\n", " 'price':'price',\n", " 'abtest':'abtest',\n", " 'vehicleType':'vehicle_type',\n", " 'yearOfRegistration':'registration_year',\n", " 'gearbox':'gearbox',\n", " 'powerPS':'power_ps',\n", " 'model':'model',\n", " 'odometer':'odometer',\n", " 'monthOfRegistration':'registration_month',\n", " 'fuelType':'fuel_type',\n", " 'brand':'brand',\n", " 'notRepairedDamage':'unrepaired_damage',\n", " 'dateCreated':'ad_created',\n", " 'nrOfPictures':'nr_of_pictures',\n", " 'postalCode':'postal_code',\n", " 'lastSeen':'last_seen'\n", "}\n", "\n", "autos_copy.columns = pd.Series(columns).map(mapping_columns)\n", "autos_copy.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.columns = autos_copy.columns\n", "autos.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-09 11:54:38Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name seller offer_type price abtest \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-09 11:54:38 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicle_type registration_year gearbox power_ps model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fuel_type brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created nr_of_pictures postal_code last_seen \n", "count 50000 50000.0 50000.000000 50000 \n", "unique 76 NaN NaN 39481 \n", "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN NaN 8 \n", "mean NaN 0.0 50813.627300 NaN \n", "std NaN 0.0 25779.747957 NaN \n", "min NaN 0.0 1067.000000 NaN \n", "25% NaN 0.0 30451.000000 NaN \n", "50% NaN 0.0 49577.000000 NaN \n", "75% NaN 0.0 71540.000000 NaN \n", "max NaN 0.0 99998.000000 NaN " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include='all')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Angebot 49999\n", "Gesuch 1\n", "Name: offer_type, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['offer_type'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Clean the Price Data\n", "\n", "From my previous exploration of the `price` column as described by the `autos.info()` output, I can see that the data is expressed as strings. \n", "\n", "I will further explore the `price` data looking for the following:\n", "\n", "- patterns, outliers, unrealistic values or other discrepancies in the data\n", "\n", "I will clean that data by doing the following:\n", "\n", "- renaming the column, and/or dropping values as `nan`, if necessary\n", "\n", "- It will also be important to determine if the data uses one unit of measure to convert the values into integers" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000\n", "unique 2357\n", "top $0\n", "freq 1421\n", "Name: price, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].describe()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "object\n", "['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995']\n", "['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995']\n" ] } ], "source": [ "print(autos['price'].dtype)\n", "print(autos['price'].unique())\n", "\n", "unique_price = autos['price'].unique()\n", "print(unique_price)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['5000' '8500' '8990' ... '385' '22200' '16995']\n" ] } ], "source": [ "autos['price'] = autos[\"price\"].str.replace('$','')\n", "autos['price'] = autos[\"price\"].str.replace(',','')\n", "unique_price = autos[\"price\"].unique()\n", "print(unique_price)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 5000 8500 8990 ... 385 22200 16995]\n" ] } ], "source": [ "autos['price'] = autos['price'].astype(int)\n", "dtypes = autos.dtypes\n", "price_unique = autos['price'].unique()\n", "print(price_unique)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2357,)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].unique().shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 5.000000e+04\n", "mean 9.840044e+03\n", "std 4.811044e+05\n", "min 0.000000e+00\n", "25% 1.100000e+03\n", "50% 2.950000e+03\n", "75% 7.200000e+03\n", "max 1.000000e+08\n", "Name: price, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].describe()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1421\n", "1 156\n", "2 3\n", "3 1\n", "5 2\n", "8 1\n", "9 1\n", "10 7\n", "11 2\n", "12 3\n", "Name: price, dtype: int64\n", "\n", "\n", "999990 1\n", "999999 2\n", "1234566 1\n", "1300000 1\n", "3890000 1\n", "10000000 1\n", "11111111 2\n", "12345678 3\n", "27322222 1\n", "99999999 1\n", "Name: price, dtype: int64\n" ] } ], "source": [ "print(autos[\"price\"].value_counts().sort_index().head(10))\n", "print(\"\\n\")\n", "print(autos[\"price\"].value_counts().sort_index().tail(10))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "price_bool = (autos[\"price\"] < 500 ) | (autos[\"price\"] > 250000)\n", "autos.loc[price_bool,\"price\"] = np.nan" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2203,)\n" ] }, { "data": { "text/plain": [ "count 45091.000000\n", "mean 6281.293074\n", "std 8601.170170\n", "min 500.000000\n", "25% 1500.000000\n", "50% 3500.000000\n", "75% 7900.000000\n", "max 250000.000000\n", "Name: price, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(autos[\"price\"].unique().shape)\n", "autos[\"price\"].describe()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "500.0 781\n", "1500.0 734\n", "2500.0 643\n", "1200.0 639\n", "1000.0 639\n", "Name: price, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 5000.0\n", "1 8500.0\n", "2 8990.0\n", "3 4350.0\n", "4 1350.0\n", "Name: price, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].sort_index(ascending=True).head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "49999 1250.0\n", "49998 22900.0\n", "49997 13200.0\n", "49996 1980.0\n", "49995 24900.0\n", "Name: price, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['price'].sort_index(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my exploration of the `price` column, I can see the following:\n", "\n", "- There were outliers representing unrealistic prices for cars, and cleaned that data by dropping those values seeling items for less than 500 and more than 250,000.\n", "- 2357-2203=154 prices were removed from the dataset, because they fell outside the more acceptable range of 500 to 250,000 dollars" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Clean the Odometer Data\n", "\n", "From my previous exploration of the `odometer` column as described by the `autos.info()` output, I can see that the data is expressed as strings as was the `price` data above. \n", "\n", "I will further explore the `odometer` data looking for the following:\n", "\n", "- patterns, outliers, unrealistic values or other discrepancies in the data\n", "\n", "I will clean that data by doing the following:\n", "\n", "- renaming the column, and/or dropping values, if necessary\n", "\n", "- It will also be important to determine if the data uses one unit of measure to convert the values into integers" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000\n", "unique 13\n", "top 150,000km\n", "freq 32424\n", "Name: odometer, dtype: object" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer'].describe()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "object\n", "['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km'\n", " '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km'\n", " '40,000km']\n", "['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km'\n", " '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km'\n", " '40,000km']\n" ] } ], "source": [ "print(autos['odometer'].dtype)\n", "print(autos['odometer'].unique())\n", "\n", "unique_price = autos['odometer'].unique()\n", "print(unique_price)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['150000' '70000' '50000' '80000' '10000' '30000' '125000' '90000' '20000'\n", " '60000' '5000' '100000' '40000']\n" ] } ], "source": [ "autos['odometer'] = autos[\"odometer\"].str.replace('km','')\n", "autos['odometer'] = autos[\"odometer\"].str.replace(',','')\n", "unique_odometer = autos[\"odometer\"].unique()\n", "print(unique_odometer)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[150000 70000 50000 80000 10000 30000 125000 90000 20000 60000\n", " 5000 100000 40000]\n" ] } ], "source": [ "autos['odometer'] = autos['odometer'].astype(int)\n", "dtypes = autos.dtypes\n", "odometer_unique = autos['odometer'].unique()\n", "print(odometer_unique)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "autos_copy2 = autos.rename(columns={\"odometer\": \"odometer_km\"})" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceabtestvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednr_of_picturespostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot5000.0controlbus2004manuell158andere1500003lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot8500.0controllimousine1997automatik2867er1500006benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot8990.0testlimousine2009manuell102golf700007benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot4350.0controlkleinwagen2007automatik71fortwo700006benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot1350.0testkombi2003manuell0focus1500007benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price abtest vehicle_type registration_year \\\n", "0 privat Angebot 5000.0 control bus 2004 \n", "1 privat Angebot 8500.0 control limousine 1997 \n", "2 privat Angebot 8990.0 test limousine 2009 \n", "3 privat Angebot 4350.0 control kleinwagen 2007 \n", "4 privat Angebot 1350.0 test kombi 2003 \n", "\n", " gearbox power_ps model odometer_km registration_month fuel_type \\\n", "0 manuell 158 andere 150000 3 lpg \n", "1 automatik 286 7er 150000 6 benzin \n", "2 manuell 102 golf 70000 7 benzin \n", "3 automatik 71 fortwo 70000 6 benzin \n", "4 manuell 0 focus 150000 7 benzin \n", "\n", " brand unrepaired_damage ad_created nr_of_pictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postal_code last_seen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos = autos_copy2\n", "autos.head()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(13,)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].unique().shape" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000.000000\n", "mean 125732.700000\n", "std 40042.211706\n", "min 5000.000000\n", "25% 125000.000000\n", "50% 150000.000000\n", "75% 150000.000000\n", "max 150000.000000\n", "Name: odometer_km, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].describe()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150000 32424\n", "125000 5170\n", "100000 2169\n", "90000 1757\n", "80000 1436\n", "70000 1230\n", "60000 1164\n", "50000 1027\n", "5000 967\n", "40000 819\n", "30000 789\n", "20000 784\n", "10000 264\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].value_counts().head(13)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 150000\n", "1 150000\n", "2 70000\n", "3 70000\n", "4 150000\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].sort_index(ascending=True).head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "49999 150000\n", "49998 40000\n", "49997 5000\n", "49996 150000\n", "49995 100000\n", "Name: odometer_km, dtype: int64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['odometer_km'].sort_index(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my exploration of the `odometer_km` column, I can see the following:\n", "\n", "- The common unit measure was km. As a result, I extracted that string data, converted it to `int` information and renamed the column to specify the unit measure\n", "- There are only 13 unique values ranging from 150,000 to 5000\n", "- There is no need to remove any values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Clean the Date Data\n", "\n", "From my previous exploration of the `date_crawled` `ad_created` `last_seen` columns as described by the `autos.info()` output, I can see that the data is expressed as strings as was the `odometer_km` data above. \n", "\n", "I will further explore the date data looking for the following:\n", "\n", "- patterns, outliers, unrealistic values or other discrepancies in the data\n", "\n", "I will clean that data by doing the following:\n", "\n", "- renaming the column, and/or dropping values, if necessary\n", "\n", "- It will also be important to find a way to represent the dara quatitatively to better be able to explore and analyze it" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledad_createdlast_seen
02016-03-26 17:47:462016-03-26 00:00:002016-04-06 06:45:54
12016-04-04 13:38:562016-04-04 00:00:002016-04-06 14:45:08
22016-03-26 18:57:242016-03-26 00:00:002016-04-06 20:15:37
32016-03-12 16:58:102016-03-12 00:00:002016-03-15 03:16:28
42016-04-01 14:38:502016-04-01 00:00:002016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled ad_created last_seen\n", "0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54\n", "1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08\n", "2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37\n", "3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28\n", "4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['date_crawled','ad_created','last_seen']][0:5]" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-04-03 0.03868\n", "2016-03-20 0.03782\n", "2016-03-21 0.03752\n", "2016-03-12 0.03678\n", "2016-03-14 0.03662\n", "2016-04-04 0.03652\n", "2016-03-07 0.03596\n", "2016-04-02 0.03540\n", "2016-03-19 0.03490\n", "2016-03-28 0.03484\n", "Name: date_crawled, dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-03-05 0.02538\n", "2016-03-06 0.01394\n", "2016-03-07 0.03596\n", "2016-03-08 0.03330\n", "2016-03-09 0.03322\n", "Name: date_crawled, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-04-07 0.00142\n", "2016-04-06 0.00318\n", "2016-04-05 0.01310\n", "2016-04-04 0.03652\n", "2016-04-03 0.03868\n", "Name: date_crawled, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2016-03-26\n", "1 2016-04-04\n", "2 2016-03-26\n", "3 2016-03-12\n", "4 2016-04-01\n", "Name: date_crawled, dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'].str[:10].sort_index(ascending=True).head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "49999 2016-03-14\n", "49998 2016-03-08\n", "49997 2016-04-02\n", "49996 2016-03-28\n", "49995 2016-03-27\n", "Name: date_crawled, dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['date_crawled'].str[:10].sort_index(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my exploration of the `date_crawled` column, I can see the following:\n", "\n", "- The crawler updates multiple car listings on a single day, in a consistent manner (~3%)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-04-03 0.03892\n", "2016-03-20 0.03786\n", "2016-03-21 0.03772\n", "2016-04-04 0.03688\n", "2016-03-12 0.03662\n", "2016-03-14 0.03522\n", "2016-04-02 0.03508\n", "2016-03-28 0.03496\n", "2016-03-07 0.03474\n", "2016-03-29 0.03414\n", "Name: ad_created, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2015-06-11 0.00002\n", "2015-08-10 0.00002\n", "2015-09-09 0.00002\n", "2015-11-10 0.00002\n", "2015-12-05 0.00002\n", "Name: ad_created, dtype: float64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-04-07 0.00128\n", "2016-04-06 0.00326\n", "2016-04-05 0.01184\n", "2016-04-04 0.03688\n", "2016-04-03 0.03892\n", "Name: ad_created, dtype: float64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2016-03-26\n", "1 2016-04-04\n", "2 2016-03-26\n", "3 2016-03-12\n", "4 2016-04-01\n", "Name: ad_created, dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'].str[:10].sort_index(ascending=True).head()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "49999 2016-03-13\n", "49998 2016-03-08\n", "49997 2016-04-02\n", "49996 2016-03-28\n", "49995 2016-03-27\n", "Name: ad_created, dtype: object" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['ad_created'].str[:10].sort_index(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my exploration of the `ad_created` column, I can see the following:\n", "\n", "- unlike the crawling data, the ads appear to be created throughout the dates in this data on a less seemingly consistent manner" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-04-06 0.22100\n", "2016-04-07 0.13092\n", "2016-04-05 0.12428\n", "2016-03-17 0.02792\n", "2016-04-03 0.02536\n", "2016-04-02 0.02490\n", "2016-03-30 0.02484\n", "2016-04-04 0.02462\n", "2016-03-31 0.02384\n", "2016-03-12 0.02382\n", "Name: last_seen, dtype: float64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-03-05 0.00108\n", "2016-03-06 0.00442\n", "2016-03-07 0.00536\n", "2016-03-08 0.00760\n", "2016-03-09 0.00986\n", "Name: last_seen, dtype: float64" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2016-04-07 0.13092\n", "2016-04-06 0.22100\n", "2016-04-05 0.12428\n", "2016-04-04 0.02462\n", "2016-04-03 0.02536\n", "Name: last_seen, dtype: float64" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head()" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2016-04-06\n", "1 2016-04-06\n", "2 2016-04-06\n", "3 2016-03-15\n", "4 2016-04-01\n", "Name: last_seen, dtype: object" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'].str[:10].sort_index(ascending=True).head()" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "49999 2016-04-06\n", "49998 2016-04-05\n", "49997 2016-04-04\n", "49996 2016-04-02\n", "49995 2016-04-01\n", "Name: last_seen, dtype: object" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['last_seen'].str[:10].sort_index(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my exploration of the `last_seen` column, I can see the following:\n", "\n", "- the information is consistent with the assumption that ads have experienced traffic after being listed on the site, and as a result, the `last_seen` dates hold more proportion of the last views in more recent dates." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000.000000\n", "mean 2005.073280\n", "std 105.712813\n", "min 1000.000000\n", "25% 1999.000000\n", "50% 2003.000000\n", "75% 2008.000000\n", "max 9999.000000\n", "Name: registration_year, dtype: float64" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Clean the Registration Year Data\n", "\n", "From my exploration of the `registration_year` column, I can see the following:\n", "\n", "- The minimum value is 1000\n", "- The maximum value is 9999\n", "\n", "Both are unrealistic numbers. I will limit registration to the time where cars were invented in the early 1900s, to 2016, the latest a car must've been registered to have made it accurately listed on the ad." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 2017\n", "55 2017\n", "65 2017\n", "68 2017\n", "84 2018\n", "113 2017\n", "164 2018\n", "197 2017\n", "253 2017\n", "348 2017\n", "390 2018\n", "438 2017\n", "453 4500\n", "454 2017\n", "457 2017\n", "477 2017\n", "545 2017\n", "548 2017\n", "557 2017\n", "568 2017\n", "577 2017\n", "582 2017\n", "654 2017\n", "740 2017\n", "754 2017\n", "758 2017\n", "765 2017\n", "802 2018\n", "871 2017\n", "889 2017\n", " ... \n", "49154 2017\n", "49178 2018\n", "49185 2019\n", "49218 2018\n", "49245 2017\n", "49259 2017\n", "49261 2017\n", "49262 2018\n", "49266 2017\n", "49283 1001\n", "49342 2017\n", "49343 2017\n", "49347 2017\n", "49354 2018\n", "49389 2017\n", "49411 2018\n", "49522 2017\n", "49557 2017\n", "49561 2017\n", "49653 2017\n", "49662 2017\n", "49689 2017\n", "49696 2017\n", "49731 2017\n", "49770 2018\n", "49796 2017\n", "49841 2017\n", "49880 2017\n", "49910 9000\n", "49935 2017\n", "Name: registration_year, Length: 1972, dtype: int64\n" ] } ], "source": [ "print(autos.loc[(autos[\"registration_year\"] > 2016 ) | (autos[\"registration_year\"] < 1900) , \"registration_year\"])" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000.0 0.069834\n", "2005.0 0.062776\n", "1999.0 0.062464\n", "2004.0 0.056988\n", "2003.0 0.056779\n", "2006.0 0.056384\n", "2001.0 0.056280\n", "2002.0 0.052740\n", "1998.0 0.051074\n", "2007.0 0.047972\n", "Name: registration_year, dtype: float64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.loc[(autos[\"registration_year\"] > 2016 ) | (autos[\"registration_year\"] < 1900) , \"registration_year\"] = np.nan\n", "\n", "autos[\"registration_year\"].value_counts(normalize=True,dropna=True).head(10)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000.0 0.06708\n", "2005.0 0.06030\n", "1999.0 0.06000\n", "2004.0 0.05474\n", "2003.0 0.05454\n", "2006.0 0.05416\n", "2001.0 0.05406\n", "2002.0 0.05066\n", "1998.0 0.04906\n", "2007.0 0.04608\n", "2008.0 0.04462\n", "2009.0 0.04196\n", "1997.0 0.04056\n", "NaN 0.03944\n", "2011.0 0.03268\n", "2010.0 0.03194\n", "1996.0 0.02888\n", "2012.0 0.02646\n", "2016.0 0.02632\n", "1995.0 0.02626\n", "2013.0 0.01612\n", "2014.0 0.01332\n", "1994.0 0.01320\n", "1993.0 0.00890\n", "2015.0 0.00798\n", "1990.0 0.00790\n", "1992.0 0.00782\n", "1991.0 0.00712\n", "1989.0 0.00362\n", "1988.0 0.00284\n", " ... \n", "1977.0 0.00044\n", "1966.0 0.00044\n", "1975.0 0.00038\n", "1969.0 0.00038\n", "1965.0 0.00034\n", "1964.0 0.00024\n", "1910.0 0.00018\n", "1963.0 0.00018\n", "1959.0 0.00014\n", "1961.0 0.00012\n", "1956.0 0.00010\n", "1958.0 0.00008\n", "1962.0 0.00008\n", "1937.0 0.00008\n", "1950.0 0.00006\n", "1951.0 0.00004\n", "1954.0 0.00004\n", "1957.0 0.00004\n", "1941.0 0.00004\n", "1955.0 0.00004\n", "1934.0 0.00004\n", "1953.0 0.00002\n", "1952.0 0.00002\n", "1948.0 0.00002\n", "1939.0 0.00002\n", "1938.0 0.00002\n", "1931.0 0.00002\n", "1929.0 0.00002\n", "1927.0 0.00002\n", "1943.0 0.00002\n", "Name: registration_year, Length: 79, dtype: float64" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].value_counts(normalize=True, dropna=False)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 48028.00000\n", "mean 2002.80351\n", "std 7.31085\n", "min 1910.00000\n", "25% 1999.00000\n", "50% 2003.00000\n", "75% 2008.00000\n", "max 2016.00000\n", "Name: registration_year, dtype: float64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['registration_year'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my new exploration of the `registration_year` column, I can see the following:\n", "\n", "- almost 2,000 entries fall outside of the registration date boundaries I placed (1900-2016)\n", "- the mean year is now lower, closer to 2003 versus 2005\n", "\n", "Given that the statistics for this data wasn't affected significantly and represents a more realistic presence of cars in history, I will continue my analysis given the boudaries I've set for registration year.\n", "\n", "### 3. Analyze the Data\n", "\n", "I will begin my analysis of this data by aggregating the `brand`, `price` and `milage` columns to explore the kinds of cars that are available, their prices relative to each other, and their usage to date. Analyzing this information will better inform any potential purchase." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
brandpriceodometer_km
0peugeot5000.0150000
1bmw8500.0150000
2volkswagen8990.070000
3smart4350.070000
4ford1350.0150000
5chrysler7900.0150000
6volkswagenNaN150000
7volkswagen1990.0150000
8seatNaN150000
9renault590.0150000
\n", "
" ], "text/plain": [ " brand price odometer_km\n", "0 peugeot 5000.0 150000\n", "1 bmw 8500.0 150000\n", "2 volkswagen 8990.0 70000\n", "3 smart 4350.0 70000\n", "4 ford 1350.0 150000\n", "5 chrysler 7900.0 150000\n", "6 volkswagen NaN 150000\n", "7 volkswagen 1990.0 150000\n", "8 seat NaN 150000\n", "9 renault 590.0 150000" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[['brand','price','odometer_km']][0:10]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 50000\n", "unique 40\n", "top volkswagen\n", "freq 10687\n", "Name: brand, dtype: object" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"brand\"].describe()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True 50000\n", "Name: brand, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"brand\"].notnull().value_counts()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',\n", " 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',\n", " 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',\n", " 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',\n", " 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',\n", " 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',\n", " 'rover', 'daihatsu', 'lancia'], dtype=object)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].unique()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "volkswagen 10687\n", "opel 5461\n", "bmw 5429\n", "mercedes_benz 4734\n", "audi 4283\n", "ford 3479\n", "renault 2404\n", "peugeot 1456\n", "fiat 1308\n", "seat 941\n", "skoda 786\n", "mazda 757\n", "nissan 754\n", "smart 701\n", "citroen 701\n", "toyota 617\n", "sonstige_autos 546\n", "hyundai 488\n", "volvo 457\n", "mini 424\n", "mitsubishi 406\n", "honda 399\n", "kia 356\n", "alfa_romeo 329\n", "porsche 294\n", "suzuki 293\n", "chevrolet 283\n", "chrysler 181\n", "dacia 129\n", "daihatsu 128\n", "jeep 110\n", "subaru 109\n", "land_rover 99\n", "saab 80\n", "daewoo 79\n", "trabant 78\n", "jaguar 77\n", "rover 69\n", "lancia 57\n", "lada 31\n", "Name: brand, dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].value_counts()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "volkswagen 0.21374\n", "opel 0.10922\n", "bmw 0.10858\n", "mercedes_benz 0.09468\n", "audi 0.08566\n", "ford 0.06958\n", "renault 0.04808\n", "peugeot 0.02912\n", "fiat 0.02616\n", "seat 0.01882\n", "Name: brand, dtype: float64" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"brand\"].value_counts(normalize = True).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my new exploration of the `brand` column, I can see the following:\n", "\n", "- there are 40 unique brands\n", "- there are no null brand values\n", "- about 75% of the total listings are made up of the top 10 brands\n", "\n", "To further explore the top 10 brands, I will aggregate their `price` data averages." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',\n", " 'peugeot', 'fiat', 'seat'],\n", " dtype='object')" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"brand\"].value_counts().index[0:10]" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "audi : 9484\n", "mercedes_benz : 8670\n", "bmw : 8465\n", "volkswagen : 5701\n", "seat : 4709\n", "ford : 4267\n", "opel : 3348\n", "peugeot : 3329\n", "fiat : 3211\n", "renault : 2760\n" ] } ], "source": [ "brand_10_price = {}\n", "\n", "for index in autos[\"brand\"].value_counts().index[0:10]:\n", " subset = autos.loc[autos[\"brand\"] == index,\"price\"]\n", " brand_price = subset.sum()/subset.count()\n", " brand_10_price[index] = int(brand_price)\n", "\n", "for index in sorted(brand_10_price,key=brand_10_price.get, reverse = True): \n", " print(index,\":\",brand_10_price[index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From my new exploration of the top 10 `brand` and `price` aggregation, I can see the following:\n", "\n", "- there is a 2,764 price gap between the top 3 brands are , at 9,484 and 8,670 and 8,465 followed by the fourth brand at 5,701 and the rest of the brands\n", "- 9 out of 10 highest priced listings are from international, and more specifically, European brands\n", "- the Amrican company Ford makes the top 10 highest priced listings in 6th place, with an average 4,267\n", "\n", "Below, I will explore the `brand` and `odometer_km` averages in aggregate." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bmw : 132521\n", "mercedes_benz : 130886\n", "audi : 129643\n", "opel : 129298\n", "volkswagen : 128955\n", "renault : 128223\n", "peugeot : 127352\n", "ford : 124131\n", "seat : 122061\n", "fiat : 117037\n" ] } ], "source": [ "brand_10_milage = {}\n", "\n", "for index in autos[\"brand\"].value_counts().index[0:10]:\n", " subset = autos.loc[autos[\"brand\"] == index,\"odometer_km\"]\n", " brand_milage = subset.sum()/subset.count()\n", " brand_10_milage[index] = int(brand_milage)\n", "\n", "for index in sorted(brand_10_milage,key=brand_10_milage.get, reverse = True): \n", " print(index,\":\",brand_10_milage[index])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This display of both aggregated series objects has a few limitations:\n", "\n", "- it makes it difficult to compare more than two aggregate series objects if we want to extend to more columns\n", "- I can't easily compare more than a few rows from each series object\n", "- I can only sort by the index, in this case, `brand`, of both series objects so we can easily make visual comparisons\n", "\n", "As a result, I will combine the data from both series objects into a single dataframe, with a shared index, and display the dataframe directly." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "volkswagen 5701\n", "opel 3348\n", "bmw 8465\n", "mercedes_benz 8670\n", "audi 9484\n", "ford 4267\n", "renault 2760\n", "peugeot 3329\n", "fiat 3211\n", "seat 4709\n", "dtype: int64\n" ] } ], "source": [ "bmp_series = pd.Series(brand_10_price)\n", "print(bmp_series)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " mean_price\n", "volkswagen 5701\n", "opel 3348\n", "bmw 8465\n", "mercedes_benz 8670\n", "audi 9484\n", "ford 4267\n", "renault 2760\n", "peugeot 3329\n", "fiat 3211\n", "seat 4709\n" ] } ], "source": [ "df = pd.DataFrame(bmp_series, columns=['mean_price'])\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "volkswagen 128955\n", "opel 129298\n", "bmw 132521\n", "mercedes_benz 130886\n", "audi 129643\n", "ford 124131\n", "renault 128223\n", "peugeot 127352\n", "fiat 117037\n", "seat 122061\n", "dtype: int64\n" ] } ], "source": [ "bmd_km_series = pd.Series(brand_10_milage)\n", "print(bmd_km_series)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " mean_price mean_km\n", "volkswagen 5701 128955\n", "opel 3348 129298\n", "bmw 8465 132521\n", "mercedes_benz 8670 130886\n", "audi 9484 129643\n", "ford 4267 124131\n", "renault 2760 128223\n", "peugeot 3329 127352\n", "fiat 3211 117037\n", "seat 4709 122061\n" ] } ], "source": [ "df['mean_km'] = bmd_km_series\n", "print(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://user-images.githubusercontent.com/50001708/85905681-e6ae2e00-b7c0-11ea-8f78-d7f180ea5e70.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Analyze the Brand, Price and Milage Data\n", "\n", "Interestingly, the top 3 most expensive cars on average are also those with the most amount of milage on average. \n", "\n", "| Brand | Price (dollars) | Milage (km) |\n", "|:------------:|:-----------:|:-----------:|\n", "| Audi | 9484 | 129643 |\n", "| Mercedes | 8670 | 130886 |\n", "| BMW | 8465 | 132521 |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Conclusion\n", "\n", "I have briefly analyzed the vast eBay car listing data available at [eBay Kleinanzeigen](https://www.kaggle.com/orgesleka/used-cars-database/data). My analysis is the follwing:\n", "\n", "\n", "#### Which car brands are the most expensive listings on average?\n", "\n", "On average, the top 10 most expensive car listings are from the following brands:\n", "\n", "| Brand | Price (dollars) |\n", "|:------------:|:-----------:|\n", "| audi | 9484 |\n", "| mercedes_benz | 8670 |\n", "| bmw | 8465 |\n", "| volkswagen | 5701 |\n", "| seat | 4709 |\n", "| ford | 4267 |\n", "| opel | 3348 |\n", "| peugeot | 3329 |\n", "| fiat | 3211 |\n", "| renault | 2760 |\n", "\n", "It is important to note that 9 out of the top 10, are European brands. The American brand Ford joins the top 10 most expensive listings at number 6 with an average of 4,267 dollars. This data appears to correlate with the cost of new cars for these brands. As a result, some important questions to ask are the following:\n", "\n", "- Is the distribution of prices of new cars for the brands listed in this data, similar to what we see in the used car listings data?\n", "- If so, does this information provide a better picture of what the potential return on investment may be after purchasing a car from one of these brands?\n", "- What other factors could determine the potential return on investment?\n", "\n", "#### What is the average milage for those brands?\n", "\n", "On average, the top 10 most expensive car listings had the following average milage per brand:\n", "\n", "| Brand | Milage (km) |\n", "|:------------:|:-----------:|\n", "| bmw | 132521 |\n", "| mercedes_benz | 130886 |\n", "| audi | 129643 |\n", "| opel | 129298 |\n", "| volkswagen | 128955 |\n", "| renault | 128223 |\n", "| peugeot | 127352 |\n", "| ford | 124131 |\n", "| seat | 122061 |\n", "| fiat | 117037 |\n", "\n", "As I mentioned above, the top 3 of those expenive car brands with most milage on average also listed the highest prices. Overall, the range of milage of these 10 brands was 15,484 km, ranging from 117,037 km to 132,521 km. The cars listing the most milage on average don't list lower prices. The oposite is true. As a result, there is no clear correlation between car milage and prices from my analysis of the top 10 most expensive brands. This provides an opportunity for further analysis to ask the following: \n", "\n", "- If we analyze the least expensive brands, do we begin to see a correlation with our assumption that more car milage on average will result in less expensive car listings? \n", "- If so, what does this say about the return on investment of buying cars from Audi, Mercedes Benz, and BMW, and that of less expensive brands?\n", "- Why is this important, and why is this so?\n", "- Do the most expensive brands listed in this data tend to last longer than those that are less expensive on average?\n", "- How much can you expect to spend on insurance, gas, repairs, maintenance for both expensive and inexpensive clusters of brands?\n", "- How can this data help a prospective buyer, and how does it fall short?\n", "\n", "#### What does this say about our assumptions about this data?\n", "\n", "This data correlates with the assumption that typically expensive car brands will sell at higher prices than less expensive brands, and in this case despite small variations in milage averages. This data does not showcase a correlation with higher car milage averages and lower car prices. This may be because the milage averages in my analysis are not significant enough to showcase differences in price listings. As a reault, it asks for further analysis to potentially find correlations in this assumption. \n", "\n", "#### Does this data provide enough information to make an educated car purchase on eBay? \n", "\n", "This data does provide a potential buyer with the opportunity to make an educated purchase on eBay, assuming that the data is fully accurate. A potential buyer may want to further explore the questions I have provided in this conclusion as well as look further into the repair data available, to make a more educated decision." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }