{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Exploring E-bay Car Sales Data\n",
"- This Data is taken from [Kaggle Competitions](https://www.kaggle.com/orgesleka/used-cars-database/data)\n",
"\n",
"### Data Dictionary\n",
"\n",
"> dateCrawled - When this ad was first crawled. All field-values are taken from this date. \n",
"> name - Name of the car. \n",
"> seller - Whether the seller is private or a dealer. \n",
"> offerType - The type of listing \n",
"> price - The price on the ad to sell the car. \n",
"> abtest - Whether the listing is included in an A/B test. \n",
"> vehicleType - The vehicle Type. \n",
"> yearOfRegistration - The year in which which year the car was first registered. \n",
"> gearbox - The transmission type. \n",
"> powerPS - The power of the car in PS. \n",
"> model - The car model name. \n",
"> kilometer - How many kilometers the car has driven. \n",
"> monthOfRegistration - The month in which which year the car was first registered. \n",
"> fuelType - What type of fuel the car uses.\n",
"> brand- The brand of the car. \n",
"> notRepairedDamage - If the car has a damage which is not yet repaired. \n",
"> dateCreated - The date on which the eBay listing was created. \n",
"> nrOfPictures - The number of pictures in the ad. \n",
"> postalCode - The postal code for the location of the vehicle. \n",
"> lastSeenOnline - When the crawler saw this ad last online. \n",
"\n",
"### Aim\n",
"We aim to clean the data and analyze the included used car listings usinfg `pandas` and `matplotlib`\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Introduction to Data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" date_crawled name \\\n",
"0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD \n",
"1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik \n",
"2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United \n",
"3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... \n",
"4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... \n",
"\n",
" seller offer_type price abtest vehicle_type registration_year \\\n",
"0 privat Angebot $5,000 control bus 2004 \n",
"1 privat Angebot $8,500 control limousine 1997 \n",
"2 privat Angebot $8,990 test limousine 2009 \n",
"3 privat Angebot $4,350 control kleinwagen 2007 \n",
"4 privat Angebot $1,350 test kombi 2003 \n",
"\n",
" gearbox power_ps model odometer registration_month fuel_type \\\n",
"0 manuell 158 andere 150,000km 3 lpg \n",
"1 automatik 286 7er 150,000km 6 benzin \n",
"2 manuell 102 golf 70,000km 7 benzin \n",
"3 automatik 71 fortwo 70,000km 6 benzin \n",
"4 manuell 0 focus 150,000km 7 benzin \n",
"\n",
" brand unrepaired_damage ad_created nr_of_pictures \\\n",
"0 peugeot nein 2016-03-26 00:00:00 0 \n",
"1 bmw nein 2016-04-04 00:00:00 0 \n",
"2 volkswagen nein 2016-03-26 00:00:00 0 \n",
"3 smart nein 2016-03-12 00:00:00 0 \n",
"4 ford nein 2016-04-01 00:00:00 0 \n",
"\n",
" postal_code last_seen \n",
"0 79588 2016-04-06 06:45:54 \n",
"1 71034 2016-04-06 14:45:08 \n",
"2 35394 2016-04-06 20:15:37 \n",
"3 33729 2016-03-15 03:16:28 \n",
"4 39218 2016-04-01 14:38:50 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"\n",
"def clean_col(col):\n",
" col.strip()\n",
" col = col.replace(\"yearOfRegistration\",\n",
" \"registration_year\")\n",
" col = col.replace(\"monthOfRegistration\", \n",
" \"registration_month\")\n",
" col = col.replace(\"notRepairedDamage\", \n",
" \"unrepaired_damage\")\n",
" col = col.replace(\"dateCreated\", \n",
" \"ad_created\")\n",
" return re.sub('([a-z0-9])([A-Z])', r'\\1_\\2',col).lower()\n",
" \n",
"autos.columns = [clean_col(c) for c in autos.columns]\n",
"print(autos.columns)\n",
"autos.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Changed Names for all the columnms from camelCase to snake_case. For e.g. 'nrOfPictures' became 'nr_of_pictures'***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Intitial Exploration and Cleaning\n",
"\n",
"Some other cleaning tasks could be:\n",
"- We will look for Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. \n",
"- Examples of numeric data stored as text which can be cleaned and converted."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
date_crawled
\n",
"
name
\n",
"
seller
\n",
"
offer_type
\n",
"
price
\n",
"
abtest
\n",
"
vehicle_type
\n",
"
registration_year
\n",
"
gearbox
\n",
"
power_ps
\n",
"
model
\n",
"
odometer
\n",
"
registration_month
\n",
"
fuel_type
\n",
"
brand
\n",
"
unrepaired_damage
\n",
"
ad_created
\n",
"
nr_of_pictures
\n",
"
postal_code
\n",
"
last_seen
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
50000
\n",
"
50000
\n",
"
50000
\n",
"
50000
\n",
"
50000
\n",
"
50000
\n",
"
44905
\n",
"
50000.000000
\n",
"
47320
\n",
"
50000.000000
\n",
"
47242
\n",
"
50000
\n",
"
50000.000000
\n",
"
45518
\n",
"
50000
\n",
"
40171
\n",
"
50000
\n",
"
50000.0
\n",
"
50000.000000
\n",
"
50000
\n",
"
\n",
"
\n",
"
unique
\n",
"
48213
\n",
"
38754
\n",
"
2
\n",
"
2
\n",
"
2357
\n",
"
2
\n",
"
8
\n",
"
NaN
\n",
"
2
\n",
"
NaN
\n",
"
245
\n",
"
13
\n",
"
NaN
\n",
"
7
\n",
"
40
\n",
"
2
\n",
"
76
\n",
"
NaN
\n",
"
NaN
\n",
"
39481
\n",
"
\n",
"
\n",
"
top
\n",
"
2016-03-08 10:40:35
\n",
"
Ford_Fiesta
\n",
"
privat
\n",
"
Angebot
\n",
"
$0
\n",
"
test
\n",
"
limousine
\n",
"
NaN
\n",
"
manuell
\n",
"
NaN
\n",
"
golf
\n",
"
150,000km
\n",
"
NaN
\n",
"
benzin
\n",
"
volkswagen
\n",
"
nein
\n",
"
2016-04-03 00:00:00
\n",
"
NaN
\n",
"
NaN
\n",
"
2016-04-07 06:17:27
\n",
"
\n",
"
\n",
"
freq
\n",
"
3
\n",
"
78
\n",
"
49999
\n",
"
49999
\n",
"
1421
\n",
"
25756
\n",
"
12859
\n",
"
NaN
\n",
"
36993
\n",
"
NaN
\n",
"
4024
\n",
"
32424
\n",
"
NaN
\n",
"
30107
\n",
"
10687
\n",
"
35232
\n",
"
1946
\n",
"
NaN
\n",
"
NaN
\n",
"
8
\n",
"
\n",
"
\n",
"
mean
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
2005.073280
\n",
"
NaN
\n",
"
116.355920
\n",
"
NaN
\n",
"
NaN
\n",
"
5.723360
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
50813.627300
\n",
"
NaN
\n",
"
\n",
"
\n",
"
std
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
105.712813
\n",
"
NaN
\n",
"
209.216627
\n",
"
NaN
\n",
"
NaN
\n",
"
3.711984
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
25779.747957
\n",
"
NaN
\n",
"
\n",
"
\n",
"
min
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1000.000000
\n",
"
NaN
\n",
"
0.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
0.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
1067.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
25%
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1999.000000
\n",
"
NaN
\n",
"
70.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
3.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
30451.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
50%
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
2003.000000
\n",
"
NaN
\n",
"
105.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
6.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
49577.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
75%
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
2008.000000
\n",
"
NaN
\n",
"
150.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
9.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
71540.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
max
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
9999.000000
\n",
"
NaN
\n",
"
17700.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
12.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
0.0
\n",
"
99998.000000
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date_crawled name seller offer_type price abtest \\\n",
"count 50000 50000 50000 50000 50000 50000 \n",
"unique 48213 38754 2 2 2357 2 \n",
"top 2016-03-08 10:40:35 Ford_Fiesta privat Angebot $0 test \n",
"freq 3 78 49999 49999 1421 25756 \n",
"mean NaN NaN NaN NaN NaN NaN \n",
"std NaN NaN NaN NaN NaN NaN \n",
"min NaN NaN NaN NaN NaN NaN \n",
"25% NaN NaN NaN NaN NaN NaN \n",
"50% NaN NaN NaN NaN NaN NaN \n",
"75% NaN NaN NaN NaN NaN NaN \n",
"max NaN NaN NaN NaN NaN NaN \n",
"\n",
" vehicle_type registration_year gearbox power_ps model \\\n",
"count 44905 50000.000000 47320 50000.000000 47242 \n",
"unique 8 NaN 2 NaN 245 \n",
"top limousine NaN manuell NaN golf \n",
"freq 12859 NaN 36993 NaN 4024 \n",
"mean NaN 2005.073280 NaN 116.355920 NaN \n",
"std NaN 105.712813 NaN 209.216627 NaN \n",
"min NaN 1000.000000 NaN 0.000000 NaN \n",
"25% NaN 1999.000000 NaN 70.000000 NaN \n",
"50% NaN 2003.000000 NaN 105.000000 NaN \n",
"75% NaN 2008.000000 NaN 150.000000 NaN \n",
"max NaN 9999.000000 NaN 17700.000000 NaN \n",
"\n",
" odometer registration_month fuel_type brand unrepaired_damage \\\n",
"count 50000 50000.000000 45518 50000 40171 \n",
"unique 13 NaN 7 40 2 \n",
"top 150,000km NaN benzin volkswagen nein \n",
"freq 32424 NaN 30107 10687 35232 \n",
"mean NaN 5.723360 NaN NaN NaN \n",
"std NaN 3.711984 NaN NaN NaN \n",
"min NaN 0.000000 NaN NaN NaN \n",
"25% NaN 3.000000 NaN NaN NaN \n",
"50% NaN 6.000000 NaN NaN NaN \n",
"75% NaN 9.000000 NaN NaN NaN \n",
"max NaN 12.000000 NaN NaN NaN \n",
"\n",
" ad_created nr_of_pictures postal_code last_seen \n",
"count 50000 50000.0 50000.000000 50000 \n",
"unique 76 NaN NaN 39481 \n",
"top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 \n",
"freq 1946 NaN NaN 8 \n",
"mean NaN 0.0 50813.627300 NaN \n",
"std NaN 0.0 25779.747957 NaN \n",
"min NaN 0.0 1067.000000 NaN \n",
"25% NaN 0.0 30451.000000 NaN \n",
"50% NaN 0.0 49577.000000 NaN \n",
"75% NaN 0.0 71540.000000 NaN \n",
"max NaN 0.0 99998.000000 NaN "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### _Key Observations:_\n",
"- **nr_of_pictures**, **seller** and **offer_type** columns have mostly a single value. And they should be dropped.\n",
"- Columns **price** and **odometer** are shown as text (Object) type, whereas they should be numeric (int) type and should be converted."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 5000\n",
"1 8500\n",
"2 8990\n",
"3 4350\n",
"4 1350\n",
"Name: price, dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos[\"price\"] = autos[\"price\"].replace({'\\$':'',',':''}, regex=True).astype(int)\n",
"autos[\"price\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 150000\n",
"1 150000\n",
"2 70000\n",
"3 70000\n",
"4 150000\n",
"Name: odometer_km, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos[\"odometer\"] = autos[\"odometer\"].replace({'km':'',',':''}, regex=True).astype(int)\n",
"autos.rename(columns={'odometer':'odometer_km'}, inplace=True)\n",
"autos['odometer_km'].head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
date_crawled
\n",
"
name
\n",
"
price
\n",
"
abtest
\n",
"
vehicle_type
\n",
"
registration_year
\n",
"
gearbox
\n",
"
power_ps
\n",
"
model
\n",
"
odometer_km
\n",
"
registration_month
\n",
"
fuel_type
\n",
"
brand
\n",
"
unrepaired_damage
\n",
"
ad_created
\n",
"
postal_code
\n",
"
last_seen
\n",
"
\n",
" \n",
" \n",
"
\n",
"
count
\n",
"
50000
\n",
"
50000
\n",
"
5.000000e+04
\n",
"
50000
\n",
"
44905
\n",
"
50000.000000
\n",
"
47320
\n",
"
50000.000000
\n",
"
47242
\n",
"
50000.000000
\n",
"
50000.000000
\n",
"
45518
\n",
"
50000
\n",
"
40171
\n",
"
50000
\n",
"
50000.000000
\n",
"
50000
\n",
"
\n",
"
\n",
"
unique
\n",
"
48213
\n",
"
38754
\n",
"
NaN
\n",
"
2
\n",
"
8
\n",
"
NaN
\n",
"
2
\n",
"
NaN
\n",
"
245
\n",
"
NaN
\n",
"
NaN
\n",
"
7
\n",
"
40
\n",
"
2
\n",
"
76
\n",
"
NaN
\n",
"
39481
\n",
"
\n",
"
\n",
"
top
\n",
"
2016-03-08 10:40:35
\n",
"
Ford_Fiesta
\n",
"
NaN
\n",
"
test
\n",
"
limousine
\n",
"
NaN
\n",
"
manuell
\n",
"
NaN
\n",
"
golf
\n",
"
NaN
\n",
"
NaN
\n",
"
benzin
\n",
"
volkswagen
\n",
"
nein
\n",
"
2016-04-03 00:00:00
\n",
"
NaN
\n",
"
2016-04-07 06:17:27
\n",
"
\n",
"
\n",
"
freq
\n",
"
3
\n",
"
78
\n",
"
NaN
\n",
"
25756
\n",
"
12859
\n",
"
NaN
\n",
"
36993
\n",
"
NaN
\n",
"
4024
\n",
"
NaN
\n",
"
NaN
\n",
"
30107
\n",
"
10687
\n",
"
35232
\n",
"
1946
\n",
"
NaN
\n",
"
8
\n",
"
\n",
"
\n",
"
mean
\n",
"
NaN
\n",
"
NaN
\n",
"
9.840044e+03
\n",
"
NaN
\n",
"
NaN
\n",
"
2005.073280
\n",
"
NaN
\n",
"
116.355920
\n",
"
NaN
\n",
"
125732.700000
\n",
"
5.723360
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
50813.627300
\n",
"
NaN
\n",
"
\n",
"
\n",
"
std
\n",
"
NaN
\n",
"
NaN
\n",
"
4.811044e+05
\n",
"
NaN
\n",
"
NaN
\n",
"
105.712813
\n",
"
NaN
\n",
"
209.216627
\n",
"
NaN
\n",
"
40042.211706
\n",
"
3.711984
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
25779.747957
\n",
"
NaN
\n",
"
\n",
"
\n",
"
min
\n",
"
NaN
\n",
"
NaN
\n",
"
0.000000e+00
\n",
"
NaN
\n",
"
NaN
\n",
"
1000.000000
\n",
"
NaN
\n",
"
0.000000
\n",
"
NaN
\n",
"
5000.000000
\n",
"
0.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
1067.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
25%
\n",
"
NaN
\n",
"
NaN
\n",
"
1.100000e+03
\n",
"
NaN
\n",
"
NaN
\n",
"
1999.000000
\n",
"
NaN
\n",
"
70.000000
\n",
"
NaN
\n",
"
125000.000000
\n",
"
3.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
30451.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
50%
\n",
"
NaN
\n",
"
NaN
\n",
"
2.950000e+03
\n",
"
NaN
\n",
"
NaN
\n",
"
2003.000000
\n",
"
NaN
\n",
"
105.000000
\n",
"
NaN
\n",
"
150000.000000
\n",
"
6.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
49577.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
75%
\n",
"
NaN
\n",
"
NaN
\n",
"
7.200000e+03
\n",
"
NaN
\n",
"
NaN
\n",
"
2008.000000
\n",
"
NaN
\n",
"
150.000000
\n",
"
NaN
\n",
"
150000.000000
\n",
"
9.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
71540.000000
\n",
"
NaN
\n",
"
\n",
"
\n",
"
max
\n",
"
NaN
\n",
"
NaN
\n",
"
1.000000e+08
\n",
"
NaN
\n",
"
NaN
\n",
"
9999.000000
\n",
"
NaN
\n",
"
17700.000000
\n",
"
NaN
\n",
"
150000.000000
\n",
"
12.000000
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
99998.000000
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date_crawled name price abtest vehicle_type \\\n",
"count 50000 50000 5.000000e+04 50000 44905 \n",
"unique 48213 38754 NaN 2 8 \n",
"top 2016-03-08 10:40:35 Ford_Fiesta NaN test limousine \n",
"freq 3 78 NaN 25756 12859 \n",
"mean NaN NaN 9.840044e+03 NaN NaN \n",
"std NaN NaN 4.811044e+05 NaN NaN \n",
"min NaN NaN 0.000000e+00 NaN NaN \n",
"25% NaN NaN 1.100000e+03 NaN NaN \n",
"50% NaN NaN 2.950000e+03 NaN NaN \n",
"75% NaN NaN 7.200000e+03 NaN NaN \n",
"max NaN NaN 1.000000e+08 NaN NaN \n",
"\n",
" registration_year gearbox power_ps model odometer_km \\\n",
"count 50000.000000 47320 50000.000000 47242 50000.000000 \n",
"unique NaN 2 NaN 245 NaN \n",
"top NaN manuell NaN golf NaN \n",
"freq NaN 36993 NaN 4024 NaN \n",
"mean 2005.073280 NaN 116.355920 NaN 125732.700000 \n",
"std 105.712813 NaN 209.216627 NaN 40042.211706 \n",
"min 1000.000000 NaN 0.000000 NaN 5000.000000 \n",
"25% 1999.000000 NaN 70.000000 NaN 125000.000000 \n",
"50% 2003.000000 NaN 105.000000 NaN 150000.000000 \n",
"75% 2008.000000 NaN 150.000000 NaN 150000.000000 \n",
"max 9999.000000 NaN 17700.000000 NaN 150000.000000 \n",
"\n",
" registration_month fuel_type brand unrepaired_damage \\\n",
"count 50000.000000 45518 50000 40171 \n",
"unique NaN 7 40 2 \n",
"top NaN benzin volkswagen nein \n",
"freq NaN 30107 10687 35232 \n",
"mean 5.723360 NaN NaN NaN \n",
"std 3.711984 NaN NaN NaN \n",
"min 0.000000 NaN NaN NaN \n",
"25% 3.000000 NaN NaN NaN \n",
"50% 6.000000 NaN NaN NaN \n",
"75% 9.000000 NaN NaN NaN \n",
"max 12.000000 NaN NaN NaN \n",
"\n",
" ad_created postal_code last_seen \n",
"count 50000 50000.000000 50000 \n",
"unique 76 NaN 39481 \n",
"top 2016-04-03 00:00:00 NaN 2016-04-07 06:17:27 \n",
"freq 1946 NaN 8 \n",
"mean NaN 50813.627300 NaN \n",
"std NaN 25779.747957 NaN \n",
"min NaN 1067.000000 NaN \n",
"25% NaN 30451.000000 NaN \n",
"50% NaN 49577.000000 NaN \n",
"75% NaN 71540.000000 NaN \n",
"max NaN 99998.000000 NaN "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)\n",
"autos.describe(include='all')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exploring the Odometer and Price Columns\n",
"\n",
"- We continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns. \n",
"- We will analyze the columns using minimum and maximum values. And, look for any values that look unrealistically high or low (outliers) that we might want to remove."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2357,)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos['price'].unique().shape"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 5.000000e+04\n",
"mean 9.840044e+03\n",
"std 4.811044e+05\n",
"min 0.000000e+00\n",
"25% 1.100000e+03\n",
"50% 2.950000e+03\n",
"75% 7.200000e+03\n",
"max 1.000000e+08\n",
"Name: price, dtype: float64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos['price'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1421\n",
"1 156\n",
"2 3\n",
"3 1\n",
"5 2\n",
"8 1\n",
"9 1\n",
"10 7\n",
"11 2\n",
"12 3\n",
"Name: price, dtype: int64\n"
]
},
{
"data": {
"text/plain": [
"99999999 1\n",
"27322222 1\n",
"12345678 3\n",
"11111111 2\n",
"10000000 1\n",
"3890000 1\n",
"1300000 1\n",
"1234566 1\n",
"999999 2\n",
"999990 1\n",
"Name: price, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(autos['price'].value_counts().sort_index().head(10))\n",
"autos['price'].value_counts().sort_index(ascending=False).head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*We can observe outliers (below 100 or above 1000000), in `price` column and we will remove them.*"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2312,)\n"
]
},
{
"data": {
"text/plain": [
"100 134\n",
"110 3\n",
"111 2\n",
"115 2\n",
"117 1\n",
"120 39\n",
"122 1\n",
"125 8\n",
"129 1\n",
"130 15\n",
"135 1\n",
"139 1\n",
"140 9\n",
"145 2\n",
"149 7\n",
"150 224\n",
"156 2\n",
"160 8\n",
"170 7\n",
"173 1\n",
"175 12\n",
"179 1\n",
"180 35\n",
"185 1\n",
"188 1\n",
"190 16\n",
"193 1\n",
"195 2\n",
"198 1\n",
"199 41\n",
" ... \n",
"120000 2\n",
"128000 1\n",
"129000 1\n",
"130000 1\n",
"135000 1\n",
"137999 1\n",
"139997 1\n",
"145000 1\n",
"151990 1\n",
"155000 1\n",
"163500 1\n",
"163991 1\n",
"169000 1\n",
"169999 1\n",
"175000 1\n",
"180000 1\n",
"190000 1\n",
"194000 1\n",
"197000 1\n",
"198000 1\n",
"220000 1\n",
"250000 1\n",
"259000 1\n",
"265000 1\n",
"295000 1\n",
"299000 1\n",
"345000 1\n",
"350000 1\n",
"999990 1\n",
"999999 2\n",
"Name: price, Length: 2312, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Removing Outliers\n",
"autos = autos[autos['price'].between(100,1000000)]\n",
"\n",
"print(autos['price'].unique().shape)\n",
"autos['price'].value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(13,)\n"
]
},
{
"data": {
"text/plain": [
"count 48227.000000\n",
"mean 125920.127729\n",
"std 39542.413981\n",
"min 5000.000000\n",
"25% 125000.000000\n",
"50% 150000.000000\n",
"75% 150000.000000\n",
"max 150000.000000\n",
"Name: odometer_km, dtype: float64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(autos['odometer_km'].unique().shape)\n",
"autos['odometer_km'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5000 760\n",
"10000 245\n",
"20000 757\n",
"30000 777\n",
"40000 814\n",
"50000 1009\n",
"60000 1153\n",
"70000 1214\n",
"80000 1412\n",
"90000 1733\n",
"100000 2101\n",
"125000 5038\n",
"150000 31214\n",
"Name: odometer_km, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos['odometer_km'].value_counts().sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***We removed outliers from price column and found no outliers in odometer_km column.***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exploring the Date Columns\n",
"\n",
"There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
date_crawled
\n",
"
ad_created
\n",
"
last_seen
\n",
"
registration_month
\n",
"
registration_year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
2016-03-26 17:47:46
\n",
"
2016-03-26 00:00:00
\n",
"
2016-04-06 06:45:54
\n",
"
3
\n",
"
2004
\n",
"
\n",
"
\n",
"
1
\n",
"
2016-04-04 13:38:56
\n",
"
2016-04-04 00:00:00
\n",
"
2016-04-06 14:45:08
\n",
"
6
\n",
"
1997
\n",
"
\n",
"
\n",
"
2
\n",
"
2016-03-26 18:57:24
\n",
"
2016-03-26 00:00:00
\n",
"
2016-04-06 20:15:37
\n",
"
7
\n",
"
2009
\n",
"
\n",
"
\n",
"
3
\n",
"
2016-03-12 16:58:10
\n",
"
2016-03-12 00:00:00
\n",
"
2016-03-15 03:16:28
\n",
"
6
\n",
"
2007
\n",
"
\n",
"
\n",
"
4
\n",
"
2016-04-01 14:38:50
\n",
"
2016-04-01 00:00:00
\n",
"
2016-04-01 14:38:50
\n",
"
7
\n",
"
2003
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date_crawled ad_created last_seen \\\n",
"0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54 \n",
"1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08 \n",
"2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37 \n",
"3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28 \n",
"4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50 \n",
"\n",
" registration_month registration_year \n",
"0 3 2004 \n",
"1 6 1997 \n",
"2 7 2009 \n",
"3 6 2007 \n",
"4 7 2003 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos[['date_crawled','ad_created','last_seen', 'registration_month', 'registration_year']][0:5]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"date_crawled object\n",
"ad_created object\n",
"last_seen object\n",
"registration_month int64\n",
"registration_year int64\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"autos[['date_crawled','ad_created','last_seen', 'registration_month', 'registration_year']].dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`date_crawled`, `last_seen`, and `ad_created` columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. \n",
"\n",
"`registration_month` and `registration_year` are represented as numeric values, so we can use methods like Series.describe() and draw graphs to understand the distribution without any extra data processing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cleaning `date_crawled`, `last_seen`, and `ad_created`\n",
"We will clean and analyze the date columns first. We can notice that the first 10 characters represent the day (e.g. 2016-03-12). To understand the date range, we can extract just the date value by doing some string formatting"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"print(autos['ad_created'].unique())\n",
"autos['ad_created'].groupby(autos['ad_created']).count().plot(kind='bar', figsize=(16, 4))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***AD CREATED - The above distribbution look like a left skewed ditribution.Which means that most of the ads were created recently. It could be beacuse the cars that were posted earlier (like a month or two ago) would have got sold, hence their ad was pulled down. Leaving only a very small number of ads (of unsold cars) posted from a long period of time.***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Analyzing `last_seen`**"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[20160406 20160315 20160401 20160323 20160407 20160326 20160316 20160402\n",
" 20160318 20160405 20160317 20160307 20160328 20160312 20160324 20160404\n",
" 20160330 20160331 20160320 20160319 20160403 20160314 20160310 20160327\n",
" 20160322 20160329 20160311 20160325 20160313 20160309 20160321 20160306\n",
" 20160308 20160305]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA0UAAAD9CAYAAABz9twwAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XuU5GV95/H3RwZUQO4j6AAOKF4wBsQJaEx2uRhAcIVs5JJjhBhyiAajJjmrmDWB9baYm9FsJIcVFBOjIphARIMIeEk2oMNFRxiRERRGbqMzIIaoIN/9o56Wnp5uprur+lL1e7/OqdNVT/3qXc/0TPXU07+qX6WqkCRJkqSuetxCT0CSJEmSFpKLIkmSJEmd5qJIkiRJUqe5KJIkSZLUaS6KJEmSJHWaiyJJkiRJneaiSJIkSVKnuSiSJEmS1GkuiiRJkiR12pKFnsBs7bLLLrV8+fKFnoYkSZKkReraa6/9XlUt3dx2Q7soWr58OStXrlzoaUiSJElapJJ8Zzrb+fI5SZIkSZ3mokiSJElSp7kokiRJktRpLookSZIkdZqLIkmSJEmd5qJIkiRJUqcN7SG5JUmSJC2s5adfOu1tv33W0XM4k/64p0iSJElSp7kokiRJktRpLookSZIkdZqLIkmSJEmd5qJIkiRJUqe5KJIkSZLUaZtdFCU5L8m9Sb4+bmynJJcnuaV93bGNJ8n7kqxJ8rUkB4y7zclt+1uSnDxu/AVJVrXbvC9JBv2HlCRJkqSpTGdP0YeAIyeMnQ5cUVX7AFe0ywAvBfZpp1OBs6G3iALOAA4CDgTOGFtItW1OHXe7ifclSZIkSXNms4uiqvoisH7C8DHA+e38+cCx48Y/XD1XAzskeQpwBHB5Va2vqg3A5cCR7brtqurfq6qAD49rSZIkSdKcm+17inatqrsA2tcnt/FlwB3jtlvbxh5rfO0k45NKcmqSlUlWrlu3bpZTlyRJkqRHDfpAC5O9H6hmMT6pqjqnqlZU1YqlS5fOcoqSJEmS9KjZLoruaS99o329t42vBfYYt93uwJ2bGd99knFJkiRJmhezXRRdAowdQe5k4OJx4ye1o9C9ELi/vbzuMuDwJDu2AywcDlzWrnsgyQvbUedOGteSJEmSpDm3ZHMbJPkocDCwS5K19I4idxZwQZJTgNuB49rmnwaOAtYADwKvBqiq9UneDnylbfe2qho7eMNr6R3h7onAZ9pJkiRJkubFZhdFVfXrU1x12CTbFnDaFJ3zgPMmGV8J/Nzm5iFJkiRJc2HQB1qQJEmSpKHiokiSJElSp7kokiRJktRpLookSZIkdZqLIkmSJEmd5qJIkiRJUqe5KJIkSZLUaS6KJEmSJHWaiyJJkiRJneaiSJIkSVKnuSiSJEmS1GkuiiRJkiR1mosiSZIkSZ3mokiSJElSp7kokiRJktRpLookSZIkdZqLIkmSJEmd5qJIkiRJUqe5KJIkSZLUaS6KJEmSJHVaX4uiJL+f5MYkX0/y0SRPSLJXkmuS3JLk40m2ats+vl1e065fPq7zljZ+c5Ij+vsjSZIkSdL0zXpRlGQZ8HpgRVX9HLAFcCLwbuA9VbUPsAE4pd3kFGBDVT0DeE/bjiT7tts9FzgSeH+SLWY7L0mSJEmaiX5fPrcEeGKSJcDWwF3AocCF7frzgWPb+WPaZdr1hyVJG/9YVf24qm4D1gAH9jkvSZIkSZqWWS+Kquq7wJ8Dt9NbDN0PXAvcV1UPt83WAsva+WXAHe22D7ftdx4/PsltNpLk1CQrk6xct27dbKcuSZIkST/Tz8vndqS3l2cv4KnANsBLJ9m0xm4yxXVTjW86WHVOVa2oqhVLly6d+aQlSZIkaYJ+Xj73EuC2qlpXVQ8BnwR+EdihvZwOYHfgznZ+LbAHQLt+e2D9+PFJbiNJkiRJc6qfRdHtwAuTbN3eG3QYcBNwFfCKts3JwMXt/CXtMu36K6uq2viJ7eh0ewH7AF/uY16SJEmSNG1LNr/J5KrqmiQXAtcBDwPXA+cAlwIfS/KONnZuu8m5wN8lWUNvD9GJrXNjkgvoLageBk6rqp/Odl6SJEmSNBOzXhQBVNUZwBkThm9lkqPHVdWPgOOm6LwTeGc/c5EkSZKk2ej3kNySJEmSNNRcFEmSJEnqNBdFkiRJkjrNRZEkSZKkTnNRJEmSJKnTXBRJkiRJ6jQXRZIkSZI6zUWRJEmSpE5zUSRJkiSp01wUSZIkSeo0F0WSJEmSOs1FkSRJkqROc1EkSZIkqdNcFEmSJEnqNBdFkiRJkjrNRZEkSZKkTnNRJEmSJKnTXBRJkiRJ6rQlCz0BSZIkSZpo+emXTnvbb591dF/35Z4iSZIkSZ3W16IoyQ5JLkzyjSSrk7woyU5JLk9yS/u6Y9s2Sd6XZE2SryU5YFzn5Lb9LUlO7vcPJUmSJEnT1e+eovcC/1JVzwb2A1YDpwNXVNU+wBXtMsBLgX3a6VTgbIAkOwFnAAcBBwJnjC2kJEmSJGmuzXpRlGQ74L8A5wJU1U+q6j7gGOD8ttn5wLHt/DHAh6vnamCHJE8BjgAur6r1VbUBuBw4crbzkiRJkqSZ6OdAC3sD64APJtkPuBZ4A7BrVd0FUFV3JXly234ZcMe4269tY1ONbyLJqfT2MrHnnnv2MXVJkiSpG3a76oZpb3v3IfvP4UwWr35ePrcEOAA4u6qeD/wHj75UbjKZZKweY3zTwapzqmpFVa1YunTpTOcrSZIkSZvoZ1G0FlhbVde0yxfSWyTd014WR/t677jt9xh3+92BOx9jXJIkSZLm3KwXRVV1N3BHkme1ocOAm4BLgLEjyJ0MXNzOXwKc1I5C90Lg/vYyu8uAw5Ps2A6wcHgbkyRJkqQ51++Ht/4e8JEkWwG3Aq+mt9C6IMkpwO3AcW3bTwNHAWuAB9u2VNX6JG8HvtK2e1tVre9zXpIkSZI0LX0tiqrqBmDFJFcdNsm2BZw2Rec84Lx+5iJJkiRJs9Hv5xRJkiRJ0lBzUSRJkiSp01wUSZIkSeo0F0WSJEmSOs1FkSRJkqROc1EkSZIkqdNcFEmSJEnqNBdFkiRJkjrNRZEkSZKkTnNRJEmSJKnTXBRJkiRJ6jQXRZIkSZI6zUWRJEmSpE5zUSRJkiSp01wUSZIkSeo0F0WSJEmSOs1FkSRJkqROc1EkSZIkqdNcFEmSJEnqNBdFkiRJkjqt70VRki2SXJ/kU+3yXkmuSXJLko8n2aqNP75dXtOuXz6u8ZY2fnOSI/qdkyRJkiRN1yD2FL0BWD3u8ruB91TVPsAG4JQ2fgqwoaqeAbynbUeSfYETgecCRwLvT7LFAOYlSZIkSZvV16Ioye7A0cAH2uUAhwIXtk3OB45t549pl2nXH9a2Pwb4WFX9uKpuA9YAB/YzL0mSJEmarn73FP0V8CbgkXZ5Z+C+qnq4XV4LLGvnlwF3ALTr72/b/2x8kttsJMmpSVYmWblu3bo+py5JkiRJfSyKkrwMuLeqrh0/PMmmtZnrHus2Gw9WnVNVK6pqxdKlS2c0X0mSJEmazJI+bvti4OVJjgKeAGxHb8/RDkmWtL1BuwN3tu3XAnsAa5MsAbYH1o8bHzP+NpIkSZI0p2a9p6iq3lJVu1fVcnoHSriyql4JXAW8om12MnBxO39Ju0y7/sqqqjZ+Yjs63V7APsCXZzsvSZIkSZqJfvYUTeXNwMeSvAO4Hji3jZ8L/F2SNfT2EJ0IUFU3JrkAuAl4GDitqn46B/OSJEmSpE0MZFFUVZ8HPt/O38okR4+rqh8Bx01x+3cC7xzEXCRJkiRpJgbxOUWSJEmSNLRcFEmSJEnqNBdFkiRJkjrNRZEkSZKkTnNRJEmSJKnTXBRJkiRJ6jQXRZIkSZI6zUWRJEmSpE5zUSRJkiSp01wUSZIkSeo0F0WSJEmSOs1FkSRJkqROc1EkSZIkqdNcFEmSJEnqNBdFkiRJkjrNRZEkSZKkTnNRJEmSJKnTXBRJkiRJ6jQXRZIkSZI6zUWRJEmSpE6b9aIoyR5JrkqyOsmNSd7QxndKcnmSW9rXHdt4krwvyZokX0tywLjWyW37W5Kc3P8fS5IkSZKmp589RQ8Df1hVzwFeCJyWZF/gdOCKqtoHuKJdBngpsE87nQqcDb1FFHAGcBBwIHDG2EJKkiRJkubarBdFVXVXVV3Xzj8ArAaWAccA57fNzgeObeePAT5cPVcDOyR5CnAEcHlVra+qDcDlwJGznZckSZIkzcRA3lOUZDnwfOAaYNequgt6CyfgyW2zZcAd4262to1NNT7Z/ZyaZGWSlevWrRvE1CVJkiR1XN+LoiTbAhcBb6yqHzzWppOM1WOMbzpYdU5VraiqFUuXLp35ZCVJkiRpgr4WRUm2pLcg+khVfbIN39NeFkf7em8bXwvsMe7muwN3Psa4JEmSJM25fo4+F+BcYHVV/eW4qy4Bxo4gdzJw8bjxk9pR6F4I3N9eXncZcHiSHdsBFg5vY5IkSZI055b0cdsXA68CViW5oY39EXAWcEGSU4DbgePadZ8GjgLWAA8CrwaoqvVJ3g58pW33tqpa38e8JEmSJGnaZr0oqqp/ZfL3AwEcNsn2BZw2Res84LzZzkWSJEmSZmsgR5+TJEmSpGHlokiSJElSp7kokiRJktRpLookSZIkdVo/R5+TpNF15vYz3P7+uZmHJEmac+4pkiRJktRp7imSJEkaUquf/Zxpb/ucb6yew5ksHmtP/9K0t939rF+eUfvMM8+ck20Brrjy6dPe9rBDvzWjtjbPRZE0QDP9ATjT7SVJmi9/85orp73taX976Izaf3HCy6a97R9+/FMzakuz4aJIkjTS5vK3xtJ0PO/8501721Unr5rDmUiaiosiSdK0+Fvj+eVLaebZTA6u4oFVpJHjokiSpFmay/cXzKXdrrph2tvefcj+czgTSVocPPqcJEmSpE5zT5GkObX89EtntP23zzp6UbSHlUei0kKbyeNypo/JuWxL6jb3FEmSJEnqNPcUSZrR+wvA9xhIkqTR4qJInTOTw/OCh+iVJEkadS6KtCjN5PC80J1D9EqSJGnwXBRp1mbymSUw888t0cZm8pkl4OeWSJIkTZeLohE3kyNRgUejkubDTD7dHvyEe0mS5pqLokXAJ0iSJEnSwlk0i6IkRwLvBbYAPlBVZy3wlDZ25vYz3P7+uZmHJEmSpIFaFJ9TlGQL4G+AlwL7Ar+eZN+FnZUkSZKkLlgUiyLgQGBNVd1aVT8BPgYcs8BzkiRJktQBqaqFngNJXgEcWVW/3S6/Cjioql43YbtTgVPbxWcBN0/zLnYBvjeg6dq2bdu2bdu2bS/Wrm3btjf2tKpaurmNFst7ijLJ2Cartao6BzhnxvFkZVWtmM3EbNu2bdu2bdu2h6Vr27bt2VksL59bC+wx7vLuwJ0LNBdJkiRJHbJYFkVfAfZJsleSrYATgUsWeE6SJEmSOmBRvHyuqh5O8jrgMnqH5D6vqm4c4F3M+CV3tm3btm3btm3bc9gexjnbtj2y7UVxoAVJkiRJWiiL5eVzkiRJkrQgXBRJkiRJ6jQXRZIkSZI6zUWRJEmSpE4buUVRkl0mXP6NJO9LcmqSyT4kdibtX02yUzu/NMmHk6xK8vEku/fZ/sskL+6n8RjtnZL8SZLfTs//TPKpJH+WZMcB9A9J8n+SXJzkoiRnJXnGALpLkvxOkn9J8rUkX03ymSSvSbJlv/3HuN++jmqSZIs277dP/DtN8tY+21sneVOS/5HkCUl+M8klSf40ybb9tKe4v28OqPPz485vmeStbd7vSrJ1n+3XjT3ukzwjyReT3JfkmiTP67P9yfYzZC6+t3snOS/JO5Jsm+T/Jvl6kk8kWd5H93FJfivJpe0xc22SjyU5eABz9jG5advH5KZtH5Obtufscdn6RyQ5u/0dXtzOHzmI9mPc558s5nb7npwy8e8uyW/12U2S45Mc184flt7zzN9NMvDn1UmuHFDH58ebu69RO/pckuuq6oB2/q3ALwP/ALwMWFtVv99H+6aq2red/zhwNfAJ4CXAK6vqV/porwO+AywFPg58tKqun21vQvvTwCpgO+A57fwFwK8A+1XVMX20zwJ2Ba4AjgVuA74J/C7wrqr6RB/tjwL3AefT+4Bf6H2w78nATlV1Qh/tnaa6CvhqVc36QZzkA8DWwJeBVwFfqKo/aNf97N/nLNsXAHcATwSeBaym93f534DdqupVfbQfAMZ+IIz9gNwaeBCoqtquj/b4x+VfADsDH6T3b2bnqjqpj/aNVfXcdv5S4ANV9Y/tycY7q2rWP0yTfBf4d+BQ4HPAR4FLq+ons22Oa3+x9bYHfoPe9+MC4HB6P08OnWX3g/R+lnwOeAXwA+BLwJuBi6vqr/uYs4/JTds+Jjdt+5jctD2Xj8u/Ap4JfJiNH5cnAbdU1Rtm297M/d5eVXsuxnaSdwG/BFxH77H4V2Pf4wE85t8PPBnYit7f4+OBfwaOAu7p5/ud5GsTh+j93d4MUFU/v8mNpt/2+fHmVNVInYDrx52/Dtimnd8SWNVn++Zx56+dcN0Ng5g3sA/wx8CNwDeAM4Bn9tm+oX0N8N0Bz3vVuPNLgH9r53cEvj6o7/ck132zz/ZPgVvpLeLGTmOXf9Jn+2sTvifnAJ+k94Pz+j7b4/8u7+bRX2xk/P3Osv3X9P5T3XXc2G39NMd1xj8ubwC2HOC8xz8uvzLV30U/8waeRO/J9KeBdfSeLB0+wO/J7VNd18+/v3b56vb18cDqQX2vJ7nOx6SPyU3+nfiYnPzPPeDH5aSPvfZ3eUuf7R9McXoAeHgRt1cBS9r5Hdq/k/cM6O9yVfu6JfB9YKt2eQn9P8+8BPh74NnA04Dl9H7p8jTgaX22fX68mdPIvXwOeGKS5yd5AbBFVf0HQFU9RO8/3X58PsnbkjyxnT8Wei8fA+7vs10AVXVLVb29er9lOx54Ar0Hcz8el97L5PYAth3blZxkZ3q/6ejHI+N+w/tUeh++S1Vt4NHfbM7WhrZ7+mf/TttLEE4ANvTZvhU4uKr2Gnfau6r2Au7ps/2z72lVPVxVp9J70nElMJCXfFTvJ8Sn29exy33t9q2q3wPeC3w0yevb931Qu5K3b7vXfw14fHs8DmTewIVJPpRkb+Afk7wxyZ5JXg3c3md77Pv7QFX9XVUdRW9vwDXA6X22H0nyzCS/AGydZAX0Xm5EexzN0kNJnt5aBwA/AaiqH9P/99rH5BR8TG7Ex+Sm5vJx+aMkB04y/gvAj/ps3wfsU1XbTTg9CbhrEbeXVNXDAFV1H729Rdsl+QT9P+8Z6z5Eb9E/9nf5MH0+z6yqlwMX0fvFzX5V9W3goar6TlV9p69Z+/x4Gvc0wBXWYjgBV004PaWN7wys7LO9JXAmvR/qtwOP0PuNxj8Ae/bZ7us3F5tp/zq9JxX3AL9Gb/f954DvAqf22T6B3m7Nz7bvydFtfCnwD322l9PbVbqO3kvyvgnc28b26rN9Gr0fOJNd93t9tv8eOHKS8d+m98Otn/YHgG0nGX868K8D+vfyOOD19F7aceeAmh+ccNq1je8GXDGA/m/Se1L0vfaYvAl4F7B9n90vDuLPP0X7MHoviVhN72UeFwFr2r/xY/roHtoei9+kt5floDa+FPjTPufsY3LTho/Jyfs+Jjduz+Xj8oD2vb6J3v/Fn21/hmuAF/TZfgdw4BTXvXsRtz8F/Ncp7vORPtufmeIxvxvw5QH9W9wG+Et6e47WDqjp8+PNnEbuPUVTSbIFvd+GPTig3vb0fhPx/QH1tq2qHw6iNUV/C3ov63g4yRJgf3ovpev3tzFj7wXYG1hTvd/IDFzbq5Wq+t5c9EdBktQAH9BJngI8v6oG+5sYTam9EXZDVfX1W7v2ptmd5/Lx4mNy83xMDr9BPSZba04fl0l2A5bRe5XG2qq6ey7uZxi0PRZU1X9Oct2yqvruHNznNvReknbvAJv7AS+qqr8dVHOS++j08+ON7msUF0XtL+RIej8cCrgTuGwQT9htz2/7Me7zV6rqctu2bc9vN8l2wNKq+taE8Z+vqolvErZt2/b8tHcDqKq7kyyl9yb6b1TVTf10H6N9c1XdOID2UD43sT067fFG7j1FSU6i9wayg+kdpWcb4BDg2nad7SFpb8a5tm3bnt9ukuPpvcH1oiQ3tvdejPmQbdu2F6T9O/SOyHd1ktfSe+nYy+i9n+uUOWp/cgDtoXxuYnt02pvc16jtKUpyM73X6t43YXxH4JqqeqbtoWlfMtVVwKFVtY1t27bnp9vaNwAvraq70ntj94eBP6qqTya5vqqeb9u27XlvrwIOondY+O8Az2h7dXYErqqq/Rdpe1ifm9gekfZESwYVWkTC5EdyeaRdZ3t42r9M77MiJr6WNMBkR9qxbdv23M55i2rvQ6yqL6d3ZKFPpffhfP3+hs22bduz81D13g/yYJJvVXsvUVVtSLKY28P63MT26LQ3MoqLoncC1yX5LL1juwPsSe+DSt9ue6jaVwMPVtUXJl7RfnNg27bt+esCPJDk6WPviWi/9T4Y+CfgubZt216Q9iNJtqzeoZWPHhtM8gT6f5vEXLaH9bmJ7dFpb2TkXj4HP9uldgTjjsJC7w1Z/X6Ohu15bktaPNI7EtKDVXXLhPEtgeOr6iO2bdue9/aewF1t4TJ+fBnwnKr63GJst85QPjexPTrtje5nFBdFEyXZqarW27Zt23YX28M4Z9u2bXey/fKqmup9kbZtz2275ukDkebrBLyY3oeW3UjvzYGX0/uk9DvoHevdtm3btke2PYxztm3bdifb/32S091j523bnuv2Jvc1yNhiOAFfBp4HvIjeJ2n/Uhs/APg327Zt2x7l9jDO2bZt251sP0zvEN/nAR9spwfa1/Ns257r9ib3NcjYYjgB1487v3rCddfZtm3b9ii3h3HOtm3b7mT7F4ArgNfy6Ns5buunadt2P6eR+/BWNj4aylsmXLeVbdu2bY94exjnbNu27Y61q+or9I4gthVwZXqf31T9NG3b7vfORuoEvBzYepLxpwNvsm3btu1Rbg/jnG3btt299oTeMuAC4NZBNW3bnumpE0efkyRJkqSpjNzL55Jsn+SsJN9I8v12Wt3GdrBt27btUW4P45xt27Zt27bt+W5PNHKLInq71TYAB1fVzlW1M3BIG/uEbdu2bY94exjnbNu2bdu2bc93e2Nz8Zq8hTwBN8/mOtu2bdsehfYwztm2bdu2bdue7/bE0yjuKfpOkjcl2XVsIMmuSd5M78PGbNu2bXuU28M4Z9u2bdu2bXu+2xsZxUXRCcDOwBeSrE+yAfg8sBNwvG3btm2PeHsY52zbtm3btm3Pd3sjHn1OkiRJUqeN3J6iJAcl2a6df2KS/5Xkn5O8O8n2tm3btj3K7WGcs23btm3btj3f7YlGblEEnAc82M6/F9geeHcb+6Bt27Ztj3h7GOds27Zt27Ztz3d7Y4M8asNiOAGrx52/bsJ1N9i2bdv2KLeHcc62bdu2bdv2fLcnnkZxT9HXk7y6nf9qkhUASZ4JPGTbtm3bI94exjnbtm3btm3b893e2CBXWIvhRG+32oeAbwHXtG/YrcAXgP1s27Zte5Tbwzhn27Zt27Zte77bE08je/S5JE8C9gaWAGur6h7btm3b7kp7GOds27Zt27Ztz3f7Z/cxqouiySTZtqp+aNu2bdtdbA/jnG3btm3btu35aI/ie4oey022bdu23eH2MM7Ztm3btm3bnvP2kkHGFoMkfzDVVcC2tm3btj3K7WGcs23btm3btj3f7YlGcU/Ru4AdgSdNOG1L/39e27Zt217s7WGcs23btm3btj3f7Y0N8qgNi+EE/D/gBVNcd4dt27Ztj3J7GOds27Zt27Ztz3d74mnkDrSQ5FnA+qpaN8l1u1YfR6uwbdu27cXeHsY527Zt27Zt2/Pd3qQ3aosiSZIkSZqJkXtPUZLtk5yV5BtJvt9Oq9vYDrZt27Y9yu1hnLNt27Zt27Y93+2JRm5RBFwAbAAOrqqdq2pn4JA29gnbtm3bHvH2MM7Ztm3btm3bnu/2xmqAb1BaDCfg5tlcZ9u2bduj0B7GOdu2bdu2bdvz3Z54GsU9Rd9J8qYku44NJNk1yZuBO2zbtm17xNvDOGfbtm3btm17vtsbGcVF0QnAzsAXkmxIsh74PLATcLxt27Ztj3h7GOds27Zt27Ztz3d7Y4Pc7bRYTsCzgZcA204YP9K2bdu2R709jHO2bdu2bdu257u9UW+QscVwAl4P3Az8E/Bt4Jhx111n27Zt26PcHsY527Zt27Zt2/Pd3uS+BhlbDCdgFW0lCSwHVgJvaJevt23btu1Rbg/jnG3btm3btu35bk88LWH0bFFVPwSoqm8nORi4MMnTgNi2bdv2iLeHcc62bdu2bdv2fLc3MooHWrg7yf5jF9o38mXALsDzbNu2bXvE28M4Z9u2bdu2bXu+2xsb5G6nxXACdgd2m+K6F9u2bdv2KLeHcc62bdu2bdv2fLcnntKikiRJktRJo/jyOUmSJEmaNhdFkiRJkjrNRZEkSZKkTnNRJEmaV0l+OMvbvTHJ1oOejyRJLookScPijYCLIknSwLkokiQtiCTbJrkiyXVJViU5po1vk+TSJF9N8vUkJyR5PfBU4KokV03R2yLJh9ptViX5/Tb+9CT/kuTaJF9K8uw2vjTJRUm+0k4vbuNnJjkvyeeT3NruW5I0wpYs9AQkSZ31I+BXq+oHSXYBrk5yCXAkcGdVHQ2QZPuquj/JHwCHVNX3pujtDyyrqp9rt9uhjZ8DvKaqbklyEPB+4FDgvcB7qupfk+wJXAY8p93m2cAhwJOAm5OcXVUPDfjPL0laJFwUSZIWSoB3JfkvwCPAMmBXYBXw50neDXyqqr40zd6twN5J/hq4FPhskm2BXwQ+kWRsu8e3ry8B9h03vl2SJ7Xzl1bVj4EfJ7m3zWvtLP+ckqRFzkWRJGmhvBJYCrygqh5K8m3gCVX1zSQvAI4C/neSz1bV2zYXq6oNSfYDjgBOA46n9z6k+6pq/0lu8jjgRVX1n+MH2yLpx+OGfor/X0rSSPM9RZKkhbI9cG9bEB0CPA0gyVOBB6vq74E/Bw5o2z9A7+Vsk2rEjt0+AAAA2ElEQVQvwXtcVV0E/DFwQFX9ALgtyXFtm7SFE8BngdeNu/1kCydJUgf4my9J0kL5CPDPSVYCNwDfaOPPA/4sySPAQ8Br2/g5wGeS3FVVh0zSWwZ8MMnYL/ze0r6+Ejg7yVuBLYGPAV8FXg/8TZKv0fv/8IvAawb5B5QkDYdU1ULPQZIkSZIWjC+fkyRJktRpvnxOkjR0klzDo0eRG/Oqqlq1EPORJA03Xz4nSZIkqdN8+ZwkSZKkTnNRJEmSJKnTXBRJkiRJ6jQXRZIkSZI6zUWRJEmSpE77/01HKNp4iXbNAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"print(autos['last_seen'].unique())\n",
"autos['last_seen'].groupby(autos['last_seen']).count().plot(kind='bar', figsize=(14, 3))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***LAST SEEN - Here the distribution is heavily left skewed, with almost 50% of the ads being seen in last 3 days of the dataset. We can give a similar analogy to 'ad_created', older car ads are in less quantity in last seen, as they might be sold already. And, as a consequence most of the last seens ads are of recent sellers, who are interested in selling their cars ASAP.***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cleaning, Analyzing `registration_month` and `registration_year`\n",
"\n",
"Removing outliers (if any) from both the columns."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 3 6 7 4 8 12 10 0 9 11 5 2 1]\n"
]
},
{
"data": {
"text/plain": [
"3 4981\n",
"0 4313\n",
"6 4255\n",
"4 4020\n",
"5 4016\n",
"7 3842\n",
"10 3577\n",
"12 3359\n",
"9 3318\n",
"11 3305\n",
"1 3199\n",
"8 3115\n",
"2 2927\n",
"Name: registration_month, dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(autos['registration_month'].unique())\n",
"autos['registration_month'].value_counts() "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 3 6 7 4 8 12 10 9 11 5 2 1]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfAAAADXCAYAAADlcgPcAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAFalJREFUeJzt3X+0XWV95/H3hx8iCPIzUCRgrMYRXI5oM8Aa7BSBwYiO0BlZoF0aKTXLJV04nc5qse0qjNZVmK7KlPpjSiUaHIUCVknRiimIVCtCgEjAYBPRQha/AgGsomjgO3+cJ/UQ7s09N5ycc3fu+7XWWWfvZz977+++XPI5z7777J2qQpIkdcsO4y5AkiRNnwEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHXQTuMuYEv222+/mjdv3rjLkCRpZG655ZaHq2rOVP1mdIDPmzePFStWjLsMSZJGJsm/DNLPU+iSJHWQAS5JUgcNFOBJfpBkVZKVSVa0tn2SLE+ypr3v3dqT5MIka5PcnuS1fdtZ1PqvSbJo2xySJEnbv+mMwF9fVYdX1YI2fzZwbVXNB65t8wBvBOa312Lg49ALfOAc4EjgCOCcTaEvSZKm57mcQj8JWNqmlwIn97VfUj03AnslORB4A7C8qjZU1aPAcmDhc9i/JEmz1qABXsBXktySZHFrO6Cq7gdo7/u39oOAe/vWXdfaJmt/hiSLk6xIsmL9+vWDH4kkSbPIoF8jO7qq7kuyP7A8yV1b6JsJ2moL7c9sqLoIuAhgwYIFz1qurfeqpa8a2rZWLVo1tG1JkqZvoBF4Vd3X3h8CPk/vb9gPtlPjtPeHWvd1wMF9q88F7ttCuyRJmqYpAzzJC5LssWkaOAG4A1gGbLqSfBFwVZteBryzXY1+FPB4O8V+DXBCkr3bxWsntDZJkjRNg5xCPwD4fJJN/T9bVV9OcjNweZIzgHuAU1r/LwEnAmuBJ4DTAapqQ5IPAje3fh+oqg1DOxJJkmaRKQO8qu4GXj1B+yPAcRO0F3DmJNtaAiyZfpmSJKmfd2KTJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpgwxwSZI6yACXJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpgwxwSZI6yACXJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpg3YadwGSBvPnp755KNv53b+5eijbkTRejsAlSeogA1ySpA4ywCVJ6iADXJKkDho4wJPsmOS2JFe3+Zck+VaSNUn+JsnzWvsubX5tWz6vbxvvb+3fTfKGYR+MJEmzxXRG4O8DVvfNnw9cUFXzgUeBM1r7GcCjVfUy4ILWjySHAacBrwQWAh9LsuNzK1+SpNlpoABPMhd4E/CJNh/gWODK1mUpcHKbPqnN05Yf1/qfBFxWVU9W1feBtcARwzgISZJmm0G/B/5/gN8D9mjz+wKPVdXGNr8OOKhNHwTcC1BVG5M83vofBNzYt83+df5NksXAYoBDDjlk4ANRN61+xaFD29ahd62eutMAPvqe64ayHYAz/++xQ9uWJPWbcgSe5M3AQ1V1S3/zBF1rimVbWucXDVUXVdWCqlowZ86cqcqTJGlWGmQEfjTwliQnAs8HXkhvRL5Xkp3aKHwucF/rvw44GFiXZCdgT2BDX/sm/etIkqRpmHIEXlXvr6q5VTWP3kVo11XVbwBfBd7aui0CrmrTy9o8bfl1VVWt/bR2lfpLgPnATUM7EkmSZpHnci/03wcuS/InwG3Axa39YuDTSdbSG3mfBlBVdya5HPgOsBE4s6qeeg77lyRp1ppWgFfV9cD1bfpuJriKvKp+CpwyyfofAj403SIlzUzrzv7HoWxn7nm/OpTtSLOJd2KTJKmDDHBJkjrIAJckqYMMcEmSOsgAlySpgwxwSZI6yACXJKmDDHBJkjroudyJTZLUUb/01ZVD29YDrz98aNvS4ByBS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EFehS5J0hbMO/uLQ9nOD85701C2s4kjcEmSOsgRuCRtY9de99Khbeu4Y783tG3NRDN1tDsTOQKXJKmDDHBJkjpouziF7ikXSdJs4whckqQOMsAlSeqgKU+hJ3k+cAOwS+t/ZVWdk+QlwGXAPsCtwDuq6mdJdgEuAX4FeAQ4tap+0Lb1fuAM4CngrKq6ZviHJGk2O/fcc2fktqRhG2QE/iRwbFW9GjgcWJjkKOB84IKqmg88Si+Yae+PVtXLgAtaP5IcBpwGvBJYCHwsyY7DPBhJkmaLKQO8en7UZndurwKOBa5s7UuBk9v0SW2etvy4JGntl1XVk1X1fWAtcMRQjkKSpFlmoL+BJ9kxyUrgIWA58D3gsara2LqsAw5q0wcB9wK05Y8D+/a3T7COJEmahoECvKqeqqrDgbn0Rs2HTtStvWeSZZO1P0OSxUlWJFmxfv36QcqTJGnWmdZV6FX1GHA9cBSwV5JNF8HNBe5r0+uAgwHa8j2BDf3tE6zTv4+LqmpBVS2YM2fOdMqTJGnWGOQq9DnAz6vqsSS7AsfTuzDtq8Bb6V2Jvgi4qq2yrM1/sy2/rqoqyTLgs0k+DLwImA/cNOTjmTnO3XNI23l8ONuRJG1XBrkT24HA0nbF+A7A5VV1dZLvAJcl+RPgNuDi1v9i4NNJ1tIbeZ8GUFV3Jrkc+A6wETizqp4a7uFIkjQ7TBngVXU78JoJ2u9mgqvIq+qnwCmTbOtDwIemX6YkSernndgkSeogA1ySpA4ywCVJ6iADXJKkDjLAJUnqIANckqQOMsAlSeogA1ySpA4ywCVJ6iADXJKkDjLAJUnqIANckqQOMsAlSeogA1ySpA4ywCVJ6iADXJKkDjLAJUnqIANckqQOMsAlSeogA1ySpA4ywCVJ6iADXJKkDpoywJMcnOSrSVYnuTPJ+1r7PkmWJ1nT3vdu7UlyYZK1SW5P8tq+bS1q/dckWbTtDkuSpO3bICPwjcDvVtWhwFHAmUkOA84Grq2q+cC1bR7gjcD89loMfBx6gQ+cAxwJHAGcsyn0JUnS9EwZ4FV1f1Xd2qb/FVgNHAScBCxt3ZYCJ7fpk4BLqudGYK8kBwJvAJZX1YaqehRYDiwc6tFIkjRLTOtv4EnmAa8BvgUcUFX3Qy/kgf1bt4OAe/tWW9faJmuXJEnTNHCAJ9kd+Bzw36vqh1vqOkFbbaF98/0sTrIiyYr169cPWp4kSbPKQAGeZGd64f2Zqvrb1vxgOzVOe3+ota8DDu5bfS5w3xban6GqLqqqBVW1YM6cOdM5FkmSZo1BrkIPcDGwuqo+3LdoGbDpSvJFwFV97e9sV6MfBTzeTrFfA5yQZO928doJrU2SJE3TTgP0ORp4B7AqycrW9gfAecDlSc4A7gFOacu+BJwIrAWeAE4HqKoNST4I3Nz6faCqNgzlKCRJmmWmDPCq+joT//0a4LgJ+hdw5iTbWgIsmU6BkiTp2bwTmyRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EEGuCRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EEGuCRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR1kgEuS1EEGuCRJHWSAS5LUQQa4JEkdZIBLktRBBrgkSR00ZYAnWZLkoSR39LXtk2R5kjXtfe/WniQXJlmb5PYkr+1bZ1HrvybJom1zOJIkzQ6DjMA/BSzcrO1s4Nqqmg9c2+YB3gjMb6/FwMehF/jAOcCRwBHAOZtCX5IkTd+UAV5VNwAbNms+CVjappcCJ/e1X1I9NwJ7JTkQeAOwvKo2VNWjwHKe/aFAkiQNaGv/Bn5AVd0P0N73b+0HAff29VvX2iZrf5Yki5OsSLJi/fr1W1meJEnbt2FfxJYJ2moL7c9urLqoqhZU1YI5c+YMtThJkrYXWxvgD7ZT47T3h1r7OuDgvn5zgfu20C5JkrbC1gb4MmDTleSLgKv62t/ZrkY/Cni8nWK/Bjghyd7t4rUTWpskSdoKO03VIcmlwDHAfknW0bua/Dzg8iRnAPcAp7TuXwJOBNYCTwCnA1TVhiQfBG5u/T5QVZtfGCdJkgY0ZYBX1dsmWXTcBH0LOHOS7SwBlkyrOkmSNCHvxCZJUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR1kAEuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSBxngkiR10MgDPMnCJN9NsjbJ2aPevyRJ24ORBniSHYGPAm8EDgPeluSwUdYgSdL2YNQj8COAtVV1d1X9DLgMOGnENUiS1HmpqtHtLHkrsLCqfqvNvwM4sqp+u6/PYmBxm/13wHeHtPv9gIeHtK1hsabBzcS6rGkw1jS4mViXNQ1mmDW9uKrmTNVppyHtbFCZoO0ZnyCq6iLgoqHvOFlRVQuGvd3nwpoGNxPrsqbBWNPgZmJd1jSYcdQ06lPo64CD++bnAveNuAZJkjpv1AF+MzA/yUuSPA84DVg24hokSeq8kZ5Cr6qNSX4buAbYEVhSVXeOaPdDPy0/BNY0uJlYlzUNxpoGNxPrsqbBjLymkV7EJkmShsM7sUmS1EEGuCRJHWSAS5LUQQb4CCV5RZLjkuy+WfvCMdZ0RJL/0KYPS/I/kpw4rnomkuSScdewuSSvaz+rE8ZYw5FJXtimd03yv5L8XZLzk+w5pprOSnLw1D1HJ8nzkrwzyfFt/u1JPpLkzCQ7j7Gulyb5n0n+IsmfJ3nPuP67qZtm3UVsSU6vqk+OYb9nAWcCq4HDgfdV1VVt2a1V9dox1HQOvfvS7wQsB44ErgeOB66pqg+NoabNv1YY4PXAdQBV9ZZR1wSQ5KaqOqJNv5vef8vPAycAf1dV542hpjuBV7dvd1wEPAFcCRzX2v/rGGp6HPgx8D3gUuCKqlo/6jo2q+kz9H7HdwMeA3YH/pbezylVtWgMNZ0F/Bfga8CJwErgUeDXgfdW1fWjrkkdVFWz6gXcM6b9rgJ2b9PzgBX0QhzgtjHWtCO9f9h+CLywte8K3D6mmm4F/h9wDPBr7f3+Nv1rY/y9ua1v+mZgTpt+AbBqTDWt7v+5bbZs5bh+TvTO7J0AXAysB74MLAL2GFNNt7f3nYAHgR3bfMb4e76qr47dgOvb9CHj+veg7X9P4DzgLuCR9lrd2vYaV12+Jn5tl6fQk9w+yWsVcMCYytqxqn4EUFU/oBdMb0zyYSa+xewobKyqp6rqCeB7VfXDVt9PgKfHVNMC4BbgD4HHqzcS+UlVfa2qvjammgB2SLJ3kn3pjdrWA1TVj4GNY6rpjiSnt+lvJ1kAkOTlwM/HVFNV1dNV9ZWqOgN4EfAxYCFw95hq2qHdOGoPemG56TT1LsDYTqHzi/tw7EKvNqrqHsZb0+X0zgQcU1X7VtW+9M6APQpcMca6JpTk78e03xcm+dMkn07y9s2WfWxUdYz6XuijcgDwBnq/dP0C/NPoywHggSSHV9VKgKr6UZI3A0uAV42ppp8l2a0F+K9samx/hxtLgFfV08AFSa5o7w8yM35P96T3wSJAJfmlqnqgXc8wrg9gvwX8RZI/ovcQhW8muRe4ty0bh2f8LKrq5/Tutrgsya7jKYmL6Y0od6T3wfCKJHcDR9F7IuI4fAK4OcmNwH8CzgdIMgfYMKaaAOZV1fn9DVX1AHB+kt8cR0FJJvvzYuj9OXIcPgmsAT4H/GaS/wa8vaqepPd7NRLb5d/Ak1wMfLKqvj7Bss9W1dsnWG1b1zSX3oj3gQmWHV1V3xhDTbu0X7jN2/cDDqyqVaOuaYJa3gQcXVV/MO5aJpJkN+CAqvr+GGvYA/hleh901lXVg2Os5eVV9c/j2v9kkrwIoKruS7IXves87qmqm8ZY0yuBQ4E7ququcdXRL8lXgH8Alm76PUpyAPAu4D9X1fFjqOkpetcKTPRB+aiqGvkHwyQrq+rwvvk/pHctw1uA5TWia5q2ywCXJE1fkr2Bs4GTgP1b84P0zqKcV1Wbn9UcRU13AL9eVWsmWHZvVY38Ww9JVgOvbGcMN7UtAn6P3rVOLx5JHQa4JGkqY/wGz1vpXSj63QmWnVxVXxhDTf8b+EpV/cNm7QuBv6yq+SOpwwCXJE0lyT1Vdci46+g3rg8VWzLKmgxwSRLQ+wbPZIuAl1fVLqOsZyoz9EPFyGqaCVf3SpJmhhn3DZ4pPlSM5WvBM6UmA1yStMnV9C7CWrn5giTXj74cYAZ+qGCG1GSAS5IAaDffmWzZyL9+28zEDxUzoib/Bi5JUgdtl7dSlSRpe2eAS5LUQQa4NIMleUuSs7ew/PCteX57knn9D2FIsiDJhVtb56gk2SvJe/vmj0ly9ThrksbFAJdGJD3T+n+uqpbVlp81fji9ezBPtL8tXaQ6D/i3AK+qFVV11nRqG5O9gPdO2UuaBQxwaRtqI93V7RGDtwLvSPLNJLcmuaI9zYwkJya5K8nXk1y4aVSZ5F1JPtKmT0lyR5JvJ7mhPSLzA8CpSVYmOTXJuUkuag+luKTt/x/b/m5N8h9baecBv9rW+53+kWySfZJ8oT2C98Yk/761n5tkSZLrk9ydZNLAb/u9K8knWs2fSXJ8km8kWZPkiK3c13nAS1vdf9badk9yZdvfZ5KM6+lw0miN+4Hkvnxtzy96I92n6T1icD/gBuAFbdnvA38MPJ/eI0Bf0tovBa5u0+8CPtKmVwEHtem9Nl/e5s+l99jTXdv8bsDz2/R8YEWbPmbTPjafB/4SOKdNHwus7Nv2P9F7fvV+wCPAzls47o30HpW7Q6tpCb3vyZ4EfGFr9tW2e8dmdT8OzG37+SbwunH/d/flaxQvR+DStvcvVXUjvRA/DPhGkpXAIuDFwCuAu+sXjyS9dJLtfAP4VJJ303u29WSWVdVP2vTOwF8nWQVc0fY/ldcBnwaoquuAfdN7RjzAF6vqyap6GHiILd916vtVtap6T2y6E7i2qoreB5F5Q9zXTVW1ru1nZd+2pe2aN3KRtr0ft/fQe1bw2/oXJnnNIBupqvckORJ4E7AyyeGTdP1x3/Tv0Hsc5KvpjVB/OsCuJjoFvemGEf3Pj3+KLf8b0t/36b75p/vWG8a+plOTtN1wBC6Nzo3A0UleBpBktyQvB+4CfjnJvNbv1IlWTvLSqvpWVf0x8DBwMPCvwB5b2OeewP1tdPoOfjFy39J6NwC/0fZ5DPBwVf1wkAPcCtPd11THK80aflKVRqSq1id5F3Bpkk1Pdfqjqvrn9tWoLyd5GLhpkk38WZL59Eat1wLfBu4Bzm6n5P90gnU+BnwuySnAV/nF6Px2YGOSbwOfAm7rW+dc4JPtgQ1P0DvVv61Ma19V9Ui7EO4O4O+BL27D2qQZzVupSjNAkt2r6kftCuqPAmuq6oJx1yVp5vIUujQzvLuNou+kd9r7r8Zcj6QZzhG4pK2WZF96p/M3d1xVPTLqeqTZxACXJKmDPIUuSVIHGeCSJHWQAS5JUgcZ4JIkdZABLklSB/1/kHKZANEL9TsAAAAASUVORK5CYII=\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Here all the values for years before 1900 and after 2017 are invlaid or outliers\n",
"\n",
"autos = autos[autos['registration_year'].between(1900,2017)]\n",
"\n",
"print(sorted(autos['registration_year'].unique()))\n",
"autos['registration_year'].groupby(autos['registration_year']).count().plot(kind='bar', figsize=(16, 3))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Observed outliers for 'Registration year' were values below 1900 and above 2017. Removed those outliers. We can observe that the distribution is left skewed.\n",
"Most of the cars were registered between 1998-2008. \n",
"It can be explained from the fact that most cars are used for max. 10-15 years and then sell/discard them.***"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" mean_price\n",
"audi 9718.90\n",
"mercedes_benz 8846.68\n",
"bmw 8637.41\n",
"volkswagen 5943.71\n",
"ford 4289.74\n",
"opel 3173.85"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top_brands_df = pd.DataFrame(bmp_series, columns=['mean_price'])\n",
"top_brands_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Assign the other series as a new column in this dataframe.**"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
mean_price
\n",
"
mean_mileage
\n",
"
\n",
" \n",
" \n",
"
\n",
"
audi
\n",
"
9718.90
\n",
"
128830.76
\n",
"
\n",
"
\n",
"
mercedes_benz
\n",
"
8846.68
\n",
"
131034.56
\n",
"
\n",
"
\n",
"
bmw
\n",
"
8637.41
\n",
"
132575.41
\n",
"
\n",
"
\n",
"
volkswagen
\n",
"
5943.71
\n",
"
128365.67
\n",
"
\n",
"
\n",
"
ford
\n",
"
4289.74
\n",
"
123839.23
\n",
"
\n",
"
\n",
"
opel
\n",
"
3173.85
\n",
"
128805.64
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" mean_price mean_mileage\n",
"audi 9718.90 128830.76\n",
"mercedes_benz 8846.68 131034.56\n",
"bmw 8637.41 132575.41\n",
"volkswagen 5943.71 128365.67\n",
"ford 4289.74 123839.23\n",
"opel 3173.85 128805.64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top_brands_df['mean_mileage'] = bmm_series\n",
"top_brands_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Here we can observe that car mileage/brand does not vary much as compared to the car prices/brand. While mean price difference for most expensive car brand (Audi) and most economical car brand (Opel) was a gigantic 71%. However, same is not true for mean mileage, with total difference between top mileage giving brand(BMW) & least mileage giving brand(Ford) being mere 7%.***"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}