{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing Used Car Listings on eBay Kleinanzeigen\n", "We will be working on a dataset of used cars from eBay Kleinanzeigen, a [classifieds](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.\n", "\n", "The dataset was originally [scraped](https://en.wikipedia.org/wiki/Web_scraping) and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.\n", "\n", "The data dictionary provided with data is as follows:\n", "- **dateCrawled** - When this ad was first crawled. All field-values are taken from this date.\n", "- **name** - Name of the car.\n", "- **seller** - Whether the seller is private or a dealer.\n", "- **offerType** - The type of listing\n", "- **price** - The price on the ad to sell the car.\n", "* **abtest** - Whether the listing is included in an A/B test.\n", "* **vehicleType** - The vehicle Type.\n", "* **yearOfRegistration** - The year in which the car was first registered.\n", "* **gearbox** - The transmission type.\n", "* **powerPS** - The power of the car in PS.\n", "* **model** - The car model name.\n", "* **kilometer** - How many kilometers the car has driven.\n", "* **monthOfRegistration** - The month in which the car was first registered.\n", "* **fuelType** - What type of fuel the car uses.\n", "* **brand** - The brand of the car.\n", "* **notRepairedDamage** - If the car has a damage which is not yet repaired.\n", "* **dateCreated** - The date on which the eBay listing was created.\n", "* **nrOfPictures** - The number of pictures in the ad.\n", "* **postalCode** - The postal code for the location of the vehicle.\n", "* **lastSeenOnline** - When the crawler saw this ad last online.\n", "\n", "The aim of this project is to clean the data and analyze the included used car listings." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "#autos = pd.read_csv(\"autos.csv\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 50000 entries, 0 to 49999\n", "Data columns (total 20 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 dateCrawled 50000 non-null object\n", " 1 name 50000 non-null object\n", " 2 seller 50000 non-null object\n", " 3 offerType 50000 non-null object\n", " 4 price 50000 non-null object\n", " 5 abtest 50000 non-null object\n", " 6 vehicleType 44905 non-null object\n", " 7 yearOfRegistration 50000 non-null int64 \n", " 8 gearbox 47320 non-null object\n", " 9 powerPS 50000 non-null int64 \n", " 10 model 47242 non-null object\n", " 11 odometer 50000 non-null object\n", " 12 monthOfRegistration 50000 non-null int64 \n", " 13 fuelType 45518 non-null object\n", " 14 brand 50000 non-null object\n", " 15 notRepairedDamage 40171 non-null object\n", " 16 dateCreated 50000 non-null object\n", " 17 nrOfPictures 50000 non-null int64 \n", " 18 postalCode 50000 non-null int64 \n", " 19 lastSeen 50000 non-null object\n", "dtypes: int64(5), object(15)\n", "memory usage: 7.6+ MB\n", "None\n" ] } ], "source": [ "autos = pd.read_csv(\"autos.csv\",encoding = \"Latin-1\")\n", "print(autos.info())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our dataset we have *50000 entries* , *20 columns* specially 5 columns of *int64* and 15 columns of *object* (they are string) and the size is *7.6 MB* . " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " dateCrawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offerType price abtest vehicleType yearOfRegistration \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox powerPS model odometer monthOfRegistration fuelType \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand notRepairedDamage dateCreated nrOfPictures \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 \n", "1 bmw nein 2016-04-04 00:00:00 0 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 \n", "3 smart nein 2016-03-12 00:00:00 0 \n", "4 ford nein 2016-04-01 00:00:00 0 \n", "\n", " postalCode lastSeen \n", "0 79588 2016-04-06 06:45:54 \n", "1 71034 2016-04-06 14:45:08 \n", "2 35394 2016-04-06 20:15:37 \n", "3 33729 2016-03-15 03:16:28 \n", "4 39218 2016-04-01 14:38:50 \n" ] } ], "source": [ "print(autos.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.\n", "\n", "## Cleaning data\n", "### 1. Rename columns \n", "- Convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.\n", "- Change a few wordings to more accurately describe the columns." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',\n", " 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',\n", " 'odometer', 'monthOfRegistration', 'fuelType', 'brand',\n", " 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',\n", " 'lastSeen'],\n", " dtype='object')\n" ] } ], "source": [ "print(autos.columns)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceab_testvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednum_photospostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price ab_test vehicle_type registration_year \\\n", "0 privat Angebot $5,000 control bus 2004 \n", "1 privat Angebot $8,500 control limousine 1997 \n", "2 privat Angebot $8,990 test limousine 2009 \n", "3 privat Angebot $4,350 control kleinwagen 2007 \n", "4 privat Angebot $1,350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer registration_month fuel_type \\\n", "0 manuell 158 andere 150,000km 3 lpg \n", "1 automatik 286 7er 150,000km 6 benzin \n", "2 manuell 102 golf 70,000km 7 benzin \n", "3 automatik 71 fortwo 70,000km 6 benzin \n", "4 manuell 0 focus 150,000km 7 benzin \n", "\n", " brand unrepaired_damage ad_created num_photos postal_code \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 79588 \n", "1 bmw nein 2016-04-04 00:00:00 0 71034 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 35394 \n", "3 smart nein 2016-03-12 00:00:00 0 33729 \n", "4 ford nein 2016-04-01 00:00:00 0 39218 \n", "\n", " last_seen \n", "0 2016-04-06 06:45:54 \n", "1 2016-04-06 14:45:08 \n", "2 2016-04-06 20:15:37 \n", "3 2016-03-15 03:16:28 \n", "4 2016-04-01 14:38:50 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',\n", " 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',\n", " 'odometer', 'registration_month', 'fuel_type', 'brand',\n", " 'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',\n", " 'last_seen']\n", "autos.columns = new_columns\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do that modifications for the understanding of our data set by the columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Data exploration & Cleaning\n", "Now let's do some basic data exploration to determine what other cleaning tasks need to be done." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceab_testvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_creatednum_photospostal_codelast_seen
count5000050000500005000050000500004490550000.0000004732050000.000000472425000050000.0000004551850000401715000050000.050000.00000050000
unique482133875422235728NaN2NaN24513NaN740276NaNNaN39481
top2016-03-19 17:36:18Ford_FiestaprivatAngebot$0testlimousineNaNmanuellNaNgolf150,000kmNaNbenzinvolkswagennein2016-04-03 00:00:00NaNNaN2016-04-07 06:17:27
freq378499994999914212575612859NaN36993NaN402432424NaN3010710687352321946NaNNaN8
meanNaNNaNNaNNaNNaNNaNNaN2005.073280NaN116.355920NaNNaN5.723360NaNNaNNaNNaN0.050813.627300NaN
stdNaNNaNNaNNaNNaNNaNNaN105.712813NaN209.216627NaNNaN3.711984NaNNaNNaNNaN0.025779.747957NaN
minNaNNaNNaNNaNNaNNaNNaN1000.000000NaN0.000000NaNNaN0.000000NaNNaNNaNNaN0.01067.000000NaN
25%NaNNaNNaNNaNNaNNaNNaN1999.000000NaN70.000000NaNNaN3.000000NaNNaNNaNNaN0.030451.000000NaN
50%NaNNaNNaNNaNNaNNaNNaN2003.000000NaN105.000000NaNNaN6.000000NaNNaNNaNNaN0.049577.000000NaN
75%NaNNaNNaNNaNNaNNaNNaN2008.000000NaN150.000000NaNNaN9.000000NaNNaNNaNNaN0.071540.000000NaN
maxNaNNaNNaNNaNNaNNaNNaN9999.000000NaN17700.000000NaNNaN12.000000NaNNaNNaNNaN0.099998.000000NaN
\n", "
" ], "text/plain": [ " date_crawled name seller offer_type price ab_test \\\n", "count 50000 50000 50000 50000 50000 50000 \n", "unique 48213 38754 2 2 2357 2 \n", "top 2016-03-19 17:36:18 Ford_Fiesta privat Angebot $0 test \n", "freq 3 78 49999 49999 1421 25756 \n", "mean NaN NaN NaN NaN NaN NaN \n", "std NaN NaN NaN NaN NaN NaN \n", "min NaN NaN NaN NaN NaN NaN \n", "25% NaN NaN NaN NaN NaN NaN \n", "50% NaN NaN NaN NaN NaN NaN \n", "75% NaN NaN NaN NaN NaN NaN \n", "max NaN NaN NaN NaN NaN NaN \n", "\n", " vehicle_type registration_year gearbox power_ps model \\\n", "count 44905 50000.000000 47320 50000.000000 47242 \n", "unique 8 NaN 2 NaN 245 \n", "top limousine NaN manuell NaN golf \n", "freq 12859 NaN 36993 NaN 4024 \n", "mean NaN 2005.073280 NaN 116.355920 NaN \n", "std NaN 105.712813 NaN 209.216627 NaN \n", "min NaN 1000.000000 NaN 0.000000 NaN \n", "25% NaN 1999.000000 NaN 70.000000 NaN \n", "50% NaN 2003.000000 NaN 105.000000 NaN \n", "75% NaN 2008.000000 NaN 150.000000 NaN \n", "max NaN 9999.000000 NaN 17700.000000 NaN \n", "\n", " odometer registration_month fuel_type brand unrepaired_damage \\\n", "count 50000 50000.000000 45518 50000 40171 \n", "unique 13 NaN 7 40 2 \n", "top 150,000km NaN benzin volkswagen nein \n", "freq 32424 NaN 30107 10687 35232 \n", "mean NaN 5.723360 NaN NaN NaN \n", "std NaN 3.711984 NaN NaN NaN \n", "min NaN 0.000000 NaN NaN NaN \n", "25% NaN 3.000000 NaN NaN NaN \n", "50% NaN 6.000000 NaN NaN NaN \n", "75% NaN 9.000000 NaN NaN NaN \n", "max NaN 12.000000 NaN NaN NaN \n", "\n", " ad_created num_photos postal_code last_seen \n", "count 50000 50000.0 50000.000000 50000 \n", "unique 76 NaN NaN 39481 \n", "top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n", "freq 1946 NaN NaN 8 \n", "mean NaN 0.0 50813.627300 NaN \n", "std NaN 0.0 25779.747957 NaN \n", "min NaN 0.0 1067.000000 NaN \n", "25% NaN 0.0 30451.000000 NaN \n", "50% NaN 0.0 49577.000000 NaN \n", "75% NaN 0.0 71540.000000 NaN \n", "max NaN 0.0 99998.000000 NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.describe(include = 'all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After exploring the data set we see :\n", "- some columns are stored as text but they are numeric data (example: *price , odometer*)\n", "- almost all the data of *name column and seller column* are same\n", "- The num_photos column looks odd, we'll need to investigate this further" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 50000\n", "Name: num_photos, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"num_photos\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "privat 49999\n", "gewerblich 1\n", "Name: seller, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"seller\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's drop those columns because they have mostly one value." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawledoffer_typepriceab_testvehicle_typeregistration_yeargearboxpower_psmodelodometerregistration_monthfuel_typebrandunrepaired_damagead_createdpostal_codelast_seen
02016-03-26 17:47:46Angebot$5,000controlbus2004manuell158andere150,000km3lpgpeugeotnein2016-03-26 00:00:00795882016-04-06 06:45:54
12016-04-04 13:38:56Angebot$8,500controllimousine1997automatik2867er150,000km6benzinbmwnein2016-04-04 00:00:00710342016-04-06 14:45:08
22016-03-26 18:57:24Angebot$8,990testlimousine2009manuell102golf70,000km7benzinvolkswagennein2016-03-26 00:00:00353942016-04-06 20:15:37
32016-03-12 16:58:10Angebot$4,350controlkleinwagen2007automatik71fortwo70,000km6benzinsmartnein2016-03-12 00:00:00337292016-03-15 03:16:28
42016-04-01 14:38:50Angebot$1,350testkombi2003manuell0focus150,000km7benzinfordnein2016-04-01 00:00:00392182016-04-01 14:38:50
......................................................
499952016-03-27 14:38:19Angebot$24,900controllimousine2011automatik239q5100,000km1dieselaudinein2016-03-27 00:00:00821312016-04-01 13:47:40
499962016-03-28 10:50:25Angebot$1,980controlcabrio1996manuell75astra150,000km5benzinopelnein2016-03-28 00:00:00448072016-04-02 14:18:02
499972016-04-02 14:44:48Angebot$13,200testcabrio2014automatik695005,000km11benzinfiatnein2016-04-02 00:00:00734302016-04-04 11:47:27
499982016-03-08 19:25:42Angebot$22,900controlkombi2013manuell150a340,000km11dieselaudinein2016-03-08 00:00:00356832016-04-05 16:45:07
499992016-03-14 00:42:12Angebot$1,250controllimousine1996manuell101vectra150,000km1benzinopelnein2016-03-13 00:00:00458972016-04-06 21:18:48
\n", "

50000 rows × 17 columns

\n", "
" ], "text/plain": [ " date_crawled offer_type price ab_test vehicle_type \\\n", "0 2016-03-26 17:47:46 Angebot $5,000 control bus \n", "1 2016-04-04 13:38:56 Angebot $8,500 control limousine \n", "2 2016-03-26 18:57:24 Angebot $8,990 test limousine \n", "3 2016-03-12 16:58:10 Angebot $4,350 control kleinwagen \n", "4 2016-04-01 14:38:50 Angebot $1,350 test kombi \n", "... ... ... ... ... ... \n", "49995 2016-03-27 14:38:19 Angebot $24,900 control limousine \n", "49996 2016-03-28 10:50:25 Angebot $1,980 control cabrio \n", "49997 2016-04-02 14:44:48 Angebot $13,200 test cabrio \n", "49998 2016-03-08 19:25:42 Angebot $22,900 control kombi \n", "49999 2016-03-14 00:42:12 Angebot $1,250 control limousine \n", "\n", " registration_year gearbox power_ps model odometer \\\n", "0 2004 manuell 158 andere 150,000km \n", "1 1997 automatik 286 7er 150,000km \n", "2 2009 manuell 102 golf 70,000km \n", "3 2007 automatik 71 fortwo 70,000km \n", "4 2003 manuell 0 focus 150,000km \n", "... ... ... ... ... ... \n", "49995 2011 automatik 239 q5 100,000km \n", "49996 1996 manuell 75 astra 150,000km \n", "49997 2014 automatik 69 500 5,000km \n", "49998 2013 manuell 150 a3 40,000km \n", "49999 1996 manuell 101 vectra 150,000km \n", "\n", " registration_month fuel_type brand unrepaired_damage \\\n", "0 3 lpg peugeot nein \n", "1 6 benzin bmw nein \n", "2 7 benzin volkswagen nein \n", "3 6 benzin smart nein \n", "4 7 benzin ford nein \n", "... ... ... ... ... \n", "49995 1 diesel audi nein \n", "49996 5 benzin opel nein \n", "49997 11 benzin fiat nein \n", "49998 11 diesel audi nein \n", "49999 1 benzin opel nein \n", "\n", " ad_created postal_code last_seen \n", "0 2016-03-26 00:00:00 79588 2016-04-06 06:45:54 \n", "1 2016-04-04 00:00:00 71034 2016-04-06 14:45:08 \n", "2 2016-03-26 00:00:00 35394 2016-04-06 20:15:37 \n", "3 2016-03-12 00:00:00 33729 2016-03-15 03:16:28 \n", "4 2016-04-01 00:00:00 39218 2016-04-01 14:38:50 \n", "... ... ... ... \n", "49995 2016-03-27 00:00:00 82131 2016-04-01 13:47:40 \n", "49996 2016-03-28 00:00:00 44807 2016-04-02 14:18:02 \n", "49997 2016-04-02 00:00:00 73430 2016-04-04 11:47:27 \n", "49998 2016-03-08 00:00:00 35683 2016-04-05 16:45:07 \n", "49999 2016-03-13 00:00:00 45897 2016-04-06 21:18:48 \n", "\n", "[50000 rows x 17 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos.drop([\"num_photos\",\"seller\",\"name\"],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Columns price and odometer\n", "Let's clean and convert price and odometer" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date_crawlednameselleroffer_typepriceab_testvehicle_typeregistration_yeargearboxpower_psmodelodometer_kmregistration_monthfuel_typebrandunrepaired_damagead_creatednum_photospostal_codelast_seen
02016-03-26 17:47:46Peugeot_807_160_NAVTECH_ON_BOARDprivatAngebot5000controlbus2004manuell158andere150000.03lpgpeugeotnein2016-03-26 00:00:000795882016-04-06 06:45:54
12016-04-04 13:38:56BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_OptikprivatAngebot8500controllimousine1997automatik2867er150000.06benzinbmwnein2016-04-04 00:00:000710342016-04-06 14:45:08
22016-03-26 18:57:24Volkswagen_Golf_1.6_UnitedprivatAngebot8990testlimousine2009manuell102golf70000.07benzinvolkswagennein2016-03-26 00:00:000353942016-04-06 20:15:37
32016-03-12 16:58:10Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...privatAngebot4350controlkleinwagen2007automatik71fortwo70000.06benzinsmartnein2016-03-12 00:00:000337292016-03-15 03:16:28
42016-04-01 14:38:50Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...privatAngebot1350testkombi2003manuell0focus150000.07benzinfordnein2016-04-01 00:00:000392182016-04-01 14:38:50
\n", "
" ], "text/plain": [ " date_crawled name \\\n", "0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n", "1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n", "2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n", "3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n", "4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n", "\n", " seller offer_type price ab_test vehicle_type registration_year \\\n", "0 privat Angebot 5000 control bus 2004 \n", "1 privat Angebot 8500 control limousine 1997 \n", "2 privat Angebot 8990 test limousine 2009 \n", "3 privat Angebot 4350 control kleinwagen 2007 \n", "4 privat Angebot 1350 test kombi 2003 \n", "\n", " gearbox power_ps model odometer_km registration_month fuel_type \\\n", "0 manuell 158 andere 150000.0 3 lpg \n", "1 automatik 286 7er 150000.0 6 benzin \n", "2 manuell 102 golf 70000.0 7 benzin \n", "3 automatik 71 fortwo 70000.0 6 benzin \n", "4 manuell 0 focus 150000.0 7 benzin \n", "\n", " brand unrepaired_damage ad_created num_photos postal_code \\\n", "0 peugeot nein 2016-03-26 00:00:00 0 79588 \n", "1 bmw nein 2016-04-04 00:00:00 0 71034 \n", "2 volkswagen nein 2016-03-26 00:00:00 0 35394 \n", "3 smart nein 2016-03-12 00:00:00 0 33729 \n", "4 ford nein 2016-04-01 00:00:00 0 39218 \n", "\n", " last_seen \n", "0 2016-04-06 06:45:54 \n", "1 2016-04-06 14:45:08 \n", "2 2016-04-06 20:15:37 \n", "3 2016-03-15 03:16:28 \n", "4 2016-04-01 14:38:50 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos[\"price\"] = (autos[\"price\"].str.replace(\"$\",\"\")\n", " .str.replace(\",\",\"\")\n", " .astype(int) ) \n", "autos[\"odometer\"] = (autos[\"odometer\"].str.replace(\"km\",\"\")\n", " .str.replace(\",\",\"\")\n", " .astype(float) ) \n", "#autos[\"odometer\"] = autos[\"odometer\"].str.replace(\"km\",\"\").astype(float)\n", "autos.rename(columns= {\"odometer\":\"odometer_km\"}, inplace = True)\n", "autos.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do a deep analysis on *price* and *odometer_km*" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2357,)\n", "count 5.000000e+04\n", "mean 9.840044e+03\n", "std 4.811044e+05\n", "min 0.000000e+00\n", "25% 1.100000e+03\n", "50% 2.950000e+03\n", "75% 7.200000e+03\n", "max 1.000000e+08\n", "Name: price, dtype: float64\n", "\n", "\n", "0 1421\n", "500 781\n", "1500 734\n", "2500 643\n", "1000 639\n", " ... \n", "20790 1\n", "8970 1\n", "846 1\n", "2895 1\n", "33980 1\n", "Name: price, Length: 2357, dtype: int64\n" ] } ], "source": [ "print(autos[\"price\"].unique().shape) #to see how many unique values\n", "print(autos[\"price\"].describe() ) #to view min/max/median/mean etc \n", "print(\"\\n\")\n", "print(autos[\"price\"].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We noticed there are 1421 cars with 0 that is error data so we should remove these data. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 1421\n", "1 156\n", "2 3\n", "3 1\n", "5 2\n", "8 1\n", "9 1\n", "10 7\n", "11 2\n", "12 3\n", "13 2\n", "14 1\n", "15 2\n", "17 3\n", "18 1\n", "20 4\n", "25 5\n", "29 1\n", "30 7\n", "35 1\n", "Name: price, dtype: int64\n", "\n", "\n", "99999999 1\n", "27322222 1\n", "12345678 3\n", "11111111 2\n", "10000000 1\n", "3890000 1\n", "1300000 1\n", "1234566 1\n", "999999 2\n", "999990 1\n", "350000 1\n", "345000 1\n", "299000 1\n", "295000 1\n", "265000 1\n", "259000 1\n", "250000 1\n", "220000 1\n", "198000 1\n", "197000 1\n", "Name: price, dtype: int64\n" ] } ], "source": [ "print(autos[\"price\"].value_counts().sort_index(ascending = True).head(20))\n", "print(\"\\n\")\n", "print(autos[\"price\"].value_counts().sort_index(ascending = False).head(20))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We noticed there are very high price who reach over the 1 million $. \n", "\n", "So will keep the price between 1$ and 350000$ " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(13,)\n", "count 48565.000000\n", "mean 125770.101925\n", "std 39788.636804\n", "min 5000.000000\n", "25% 125000.000000\n", "50% 150000.000000\n", "75% 150000.000000\n", "max 150000.000000\n", "Name: odometer_km, dtype: float64\n", "\n", "\n", "150000.0 31414\n", "125000.0 5057\n", "100000.0 2115\n", "90000.0 1734\n", "80000.0 1415\n", "70000.0 1217\n", "60000.0 1155\n", "50000.0 1012\n", "5000.0 836\n", "40000.0 815\n", "30000.0 780\n", "20000.0 762\n", "10000.0 253\n", "Name: odometer_km, dtype: int64\n" ] } ], "source": [ "autos = autos[autos[\"price\"].between(1,350000)]\n", "print(autos[\"odometer_km\"].unique().shape) #to see how many unique values\n", "print(autos[\"odometer_km\"].describe() ) #to view min/max/median/mean etc\n", "print(\"\\n\")\n", "print(autos[\"odometer_km\"].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For *odometer_km* we didn't find there are outlier just there are more high milieage than low milieage .\n", "\n", "#### Date columns\n", "There are 5 columns that should represent date values:\n", "- `date_crawled`: added by the crawler\n", "- `last_seen`: added by the crawler\n", "- `ad_created`: from the website\n", "- `registration_month`: from the website\n", "- `registration_year`: from the website" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2016-03-05 0.025327\n", "2016-03-06 0.014043\n", "2016-03-07 0.036014\n", "2016-03-08 0.033296\n", "2016-03-09 0.033090\n", "2016-03-10 0.032184\n", "2016-03-11 0.032575\n", "2016-03-12 0.036920\n", "2016-03-13 0.015670\n", "2016-03-14 0.036549\n", "2016-03-15 0.034284\n", "2016-03-16 0.029610\n", "2016-03-17 0.031628\n", "2016-03-18 0.012911\n", "2016-03-19 0.034778\n", "2016-03-20 0.037887\n", "2016-03-21 0.037373\n", "2016-03-22 0.032987\n", "2016-03-23 0.032225\n", "2016-03-24 0.029342\n", "2016-03-25 0.031607\n", "2016-03-26 0.032204\n", "2016-03-27 0.031092\n", "2016-03-28 0.034860\n", "2016-03-29 0.034099\n", "2016-03-30 0.033687\n", "2016-03-31 0.031834\n", "2016-04-01 0.033687\n", "2016-04-02 0.035478\n", "2016-04-03 0.038608\n", "2016-04-04 0.036487\n", "2016-04-05 0.013096\n", "2016-04-06 0.003171\n", "2016-04-07 0.001400\n", "Name: date_crawled, dtype: float64\n" ] } ], "source": [ "print(autos['date_crawled'].str[:10].value_counts(normalize= True , dropna = False).sort_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distribution of values in the `date_crawled` for one month (From March to April), show the add on each day is almost same." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2015-06-11 0.000021\n", "2015-08-10 0.000021\n", "2015-09-09 0.000021\n", "2015-11-10 0.000021\n", "2015-12-05 0.000021\n", " ... \n", "2016-04-03 0.038855\n", "2016-04-04 0.036858\n", "2016-04-05 0.011819\n", "2016-04-06 0.003253\n", "2016-04-07 0.001256\n", "Name: ad_created, Length: 76, dtype: float64\n" ] } ], "source": [ "print(autos['ad_created'].str[:10].value_counts(normalize= True , dropna = False).sort_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a lot of date (76 different dates),most of them have one or two months . But since 2016 ,the ad_create became more frequent." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2016-03-05 0.001071\n", "2016-03-06 0.004324\n", "2016-03-07 0.005395\n", "2016-03-08 0.007413\n", "2016-03-09 0.009595\n", "2016-03-10 0.010666\n", "2016-03-11 0.012375\n", "2016-03-12 0.023783\n", "2016-03-13 0.008895\n", "2016-03-14 0.012602\n", "2016-03-15 0.015876\n", "2016-03-16 0.016452\n", "2016-03-17 0.028086\n", "2016-03-18 0.007351\n", "2016-03-19 0.015834\n", "2016-03-20 0.020653\n", "2016-03-21 0.020632\n", "2016-03-22 0.021373\n", "2016-03-23 0.018532\n", "2016-03-24 0.019767\n", "2016-03-25 0.019211\n", "2016-03-26 0.016802\n", "2016-03-27 0.015649\n", "2016-03-28 0.020859\n", "2016-03-29 0.022341\n", "2016-03-30 0.024771\n", "2016-03-31 0.023783\n", "2016-04-01 0.022794\n", "2016-04-02 0.024915\n", "2016-04-03 0.025203\n", "2016-04-04 0.024483\n", "2016-04-05 0.124761\n", "2016-04-06 0.221806\n", "2016-04-07 0.131947\n", "Name: last_seen, dtype: float64\n" ] } ], "source": [ "print(autos['last_seen'].str[:10].value_counts(normalize= True , dropna = False).sort_index())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.\n", "The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "count 48565.000000\n", "mean 2004.755421\n", "std 88.643887\n", "min 1000.000000\n", "25% 1999.000000\n", "50% 2004.000000\n", "75% 2008.000000\n", "max 9999.000000\n", "Name: registration_year, dtype: float64\n" ] } ], "source": [ "print(autos['registration_year'].describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can noticed, there is some rows who don't have a `registration_year` and have error values because the min is 1000 and max 9999. \n", "#### Exploring `registration_year`\n", "First, let's choose the period we'll use to analyse `registration_year`. Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.\n", "\n", "Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.038793369710697" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(~ autos['registration_year'].between(1900,2016)).sum() / autos.shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see it's less than 4% of our data so we can remove them." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000 0.067608\n", "2005 0.062895\n", "1999 0.062060\n", "2004 0.057904\n", "2003 0.057818\n", " ... \n", "1938 0.000021\n", "1948 0.000021\n", "1927 0.000021\n", "1931 0.000021\n", "1952 0.000021\n", "Name: registration_year, Length: 78, dtype: float64\n" ] } ], "source": [ "autos = autos[autos['registration_year'].between(1900,2016)]\n", "print(autos['registration_year'].value_counts(normalize = True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It show , most of the cars were registered in 2000 . \n", "\n", "### Exploring `brand` column " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "volkswagen 0.211264\n", "bmw 0.110045\n", "opel 0.107581\n", "mercedes_benz 0.096463\n", "audi 0.086566\n", "ford 0.069900\n", "renault 0.047150\n", "peugeot 0.029841\n", "fiat 0.025642\n", "seat 0.018273\n", "skoda 0.016409\n", "nissan 0.015274\n", "mazda 0.015188\n", "smart 0.014160\n", "citroen 0.014010\n", "toyota 0.012703\n", "hyundai 0.010025\n", "sonstige_autos 0.009811\n", "volvo 0.009147\n", "mini 0.008762\n", "mitsubishi 0.008226\n", "honda 0.007840\n", "kia 0.007069\n", "alfa_romeo 0.006641\n", "porsche 0.006127\n", "suzuki 0.005934\n", "chevrolet 0.005698\n", "chrysler 0.003513\n", "dacia 0.002635\n", "daihatsu 0.002506\n", "jeep 0.002271\n", "subaru 0.002142\n", "land_rover 0.002099\n", "saab 0.001649\n", "jaguar 0.001564\n", "daewoo 0.001500\n", "trabant 0.001392\n", "rover 0.001328\n", "lancia 0.001071\n", "lada 0.000578\n", "Name: brand, dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "autos['brand'].value_counts(normalize = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the result, volkswagen is the most popular brand without any discussion. We have a lot of brand in our result let's limit to know the top 10 . " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',\n", " 'peugeot', 'fiat', 'seat'],\n", " dtype='object')\n", "{'volkswagen': 5402, 'bmw': 8332, 'opel': 2975, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749, 'renault': 2474, 'peugeot': 3094, 'fiat': 2813, 'seat': 4397}\n" ] } ], "source": [ "top_10_brand_mean = {}\n", "brands = autos['brand'].value_counts(normalize = True).head(10).index\n", "##brands = autos['brand'].sort_index()\n", "print(brands)\n", "for brand in brands :\n", " only_b = autos[autos['brand'] == brand]\n", " mean = only_b['price'].mean()\n", " top_10_brand_mean[brand] = int(mean)\n", "print(top_10_brand_mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the Top 10 we can see , **audi** it's the more expensive after came **mercedes_benz and bmw** . volskswagen is the fourth one so not too expensine and not very cheap, that is can be the reason for his *popularity* .\n", "\n", "### Exploring mileage" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'volkswagen': 128707, 'bmw': 132572, 'opel': 129310, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266, 'renault': 128071, 'peugeot': 127153, 'fiat': 117121, 'seat': 121131}\n" ] } ], "source": [ "top_10_brand_mileage_mean = {}\n", "for brand in brands :\n", " only_b = autos[autos['brand'] == brand]\n", " mean = only_b['odometer_km'].mean()\n", " top_10_brand_mileage_mean[brand] = int(mean)\n", "print(top_10_brand_mileage_mean)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " mean_mileage mean_price\n", "bmw 132572 8332\n", "mercedes_benz 130788 8628\n", "opel 129310 2975\n", "audi 129157 9336\n", "volkswagen 128707 5402\n", "renault 128071 2474\n", "peugeot 127153 3094\n", "ford 124266 3749\n", "seat 121131 4397\n", "fiat 117121 2813\n" ] } ], "source": [ "serie_price = pd.Series(top_10_brand_mean).sort_values()\n", "serie_mileage = pd.Series(top_10_brand_mileage_mean).sort_values(ascending= False)\n", "price_mileage = pd.DataFrame(serie_mileage,columns=['mean_mileage'])\n", "price_mileage['mean_price'] = serie_price\n", "print(price_mileage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analyzing the aggregate data, we can conclure :\n", "- for the expensive cars **bmw and mercedes_benz** are a good choice \n", "- for the cheap cars it's **opel** because it have a good mileage even more than *audi*\n", "- for the cars between cheap and expensive it's still **volkswagen**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "data_science", "language": "python", "name": "data_science" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }