{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Airbnb Homes in Vancouver - Data Cleaning\n",
"#### Arash Tavassoli - April 2019\n",
"---\n",
"### Introduction\n",
"\n",
"In less than 10 years since its official launch, airbnb has now found its place in many of our lives and continues to prove itself as an affordable and convenient alternative to conventional hotel services. This notebook cleans and prepares the data before importing into a Tableau workbook that will visualize airbnb's presence in the City of Vancouver based on publicly available data from airbnb-owned website, Inside-airbnb.\n",
"\n",
"The Tableau visualization is available in [here](https://public.tableau.com/views/AirbnbHomesinVancouver/Dashboard?:embed=y&:display_count=yes).\n",
"\n",
"### Data Source\n",
"\n",
"The data is downloaded from Inside-airbnb on April 29, 2019 and reflects the detailed listing data, calendar data and listing reviews for 4,929 uniquely identified homes in Vancouver, BC, that were active on March 8, 2019.\n",
"\n",
"The data is dowloaded from [Inside Airbnb website](http://insideairbnb.com/get-the-data.html) where information is available under a Creative Commons CC0 1.0 Universal (CC0 1.0) \"Public Domain Dedication\" license.\n",
"\n",
"The datasets are specific to the City of Vancouver and are provided in 5 separate CSV and 1 GeoJSON files:\n",
"\n",
"|File Name | Description \n",
"|:---------|:---------\n",
"|`calendar.csv` | Detailed Calendar Data for listings in Vancouver\n",
"|`listings.csv` | Summary information and metrics for listings in Vancouver\n",
"|`listings-2.csv` | Detailed Listings data for Vancouver\n",
"|`reviews.csv` | Summary Review data and Listing ID\n",
"|`reviews-2.csv` | Detailed Review Data for listings in Vancouver\n",
"|`neighbourhoods.csv` | Neighbourhood list for geo filter (ourced from city or open source GIS files)\n",
"|`neighbourhoods.geojson` | GeoJSON file of neighbourhoods of the city"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Importing libraries:\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Importing raw datasets:\n",
"df_cal = pd.read_csv('Data/calendar.csv')\n",
"df_listings = pd.read_csv('Data/listings.csv')\n",
"df_listings_more = pd.read_csv('Data/listings-2.csv')\n",
"df_neighb = pd.read_csv('Data/neighbourhoods.csv')\n",
"df_reviews = pd.read_csv('Data/reviews.csv')\n",
"df_reviews_more = pd.read_csv('Data/reviews-2.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cleaning the `df_listings` and `df_listings_more` datasets:\n",
"\n",
"We start by looking at the two listing datasets: `df_listings` and `df_listings_more`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of df_listings:\t\t (4938, 16)\n",
"Shape of df_listings_more:\t (4938, 106)\n"
]
}
],
"source": [
"print('Shape of df_listings:\\t\\t', df_listings.shape)\n",
"print('Shape of df_listings_more:\\t', df_listings_more.shape)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" \n",
" \n",
" id | \n",
" 10080 | \n",
" 11400 | \n",
" 13188 | \n",
"
\n",
" \n",
" name | \n",
" D1 - Million Dollar View 2 BR | \n",
" Central Lovely Rm in Victorian Home | \n",
" Garden level studio in ideal loc. | \n",
"
\n",
" \n",
" host_id | \n",
" 30899 | \n",
" 42305 | \n",
" 51466 | \n",
"
\n",
" \n",
" host_name | \n",
" Rami | \n",
" Jaynne | \n",
" Family Guns | \n",
"
\n",
" \n",
" neighbourhood_group | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" neighbourhood | \n",
" Downtown | \n",
" Riley Park | \n",
" Riley Park | \n",
"
\n",
" \n",
" latitude | \n",
" 49.2877 | \n",
" 49.2538 | \n",
" 49.2458 | \n",
"
\n",
" \n",
" longitude | \n",
" -123.121 | \n",
" -123.113 | \n",
" -123.105 | \n",
"
\n",
" \n",
" room_type | \n",
" Entire home/apt | \n",
" Private room | \n",
" Entire home/apt | \n",
"
\n",
" \n",
" price | \n",
" 296 | \n",
" 60 | \n",
" 120 | \n",
"
\n",
" \n",
" minimum_nights | \n",
" 60 | \n",
" 364 | \n",
" 2 | \n",
"
\n",
" \n",
" number_of_reviews | \n",
" 16 | \n",
" 70 | \n",
" 171 | \n",
"
\n",
" \n",
" last_review | \n",
" 2017-02-26 | \n",
" 2016-06-14 | \n",
" 2019-02-24 | \n",
"
\n",
" \n",
" reviews_per_month | \n",
" 0.18 | \n",
" 0.64 | \n",
" 1.55 | \n",
"
\n",
" \n",
" calculated_host_listings_count | \n",
" 33 | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" availability_365 | \n",
" 191 | \n",
" 365 | \n",
" 214 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 \\\n",
"id 10080 \n",
"name D1 - Million Dollar View 2 BR \n",
"host_id 30899 \n",
"host_name Rami \n",
"neighbourhood_group NaN \n",
"neighbourhood Downtown \n",
"latitude 49.2877 \n",
"longitude -123.121 \n",
"room_type Entire home/apt \n",
"price 296 \n",
"minimum_nights 60 \n",
"number_of_reviews 16 \n",
"last_review 2017-02-26 \n",
"reviews_per_month 0.18 \n",
"calculated_host_listings_count 33 \n",
"availability_365 191 \n",
"\n",
" 1 \\\n",
"id 11400 \n",
"name Central Lovely Rm in Victorian Home \n",
"host_id 42305 \n",
"host_name Jaynne \n",
"neighbourhood_group NaN \n",
"neighbourhood Riley Park \n",
"latitude 49.2538 \n",
"longitude -123.113 \n",
"room_type Private room \n",
"price 60 \n",
"minimum_nights 364 \n",
"number_of_reviews 70 \n",
"last_review 2016-06-14 \n",
"reviews_per_month 0.64 \n",
"calculated_host_listings_count 1 \n",
"availability_365 365 \n",
"\n",
" 2 \n",
"id 13188 \n",
"name Garden level studio in ideal loc. \n",
"host_id 51466 \n",
"host_name Family Guns \n",
"neighbourhood_group NaN \n",
"neighbourhood Riley Park \n",
"latitude 49.2458 \n",
"longitude -123.105 \n",
"room_type Entire home/apt \n",
"price 120 \n",
"minimum_nights 2 \n",
"number_of_reviews 171 \n",
"last_review 2019-02-24 \n",
"reviews_per_month 1.55 \n",
"calculated_host_listings_count 1 \n",
"availability_365 214 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings.head(3).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We drop the columns that we do not need in the analysis (not required or repeated on other datasets):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df_listings.drop(['neighbourhood_group', 'last_review', 'reviews_per_month', 'availability_365', 'neighbourhood'], \n",
" axis = 1, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"name 1\n",
"host_id 0\n",
"host_name 0\n",
"latitude 0\n",
"longitude 0\n",
"room_type 0\n",
"price 0\n",
"minimum_nights 0\n",
"number_of_reviews 0\n",
"calculated_host_listings_count 0\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Checking for NaNs:\n",
"df_listings.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll deal with that one missing value later."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" id | \n",
" 10080 | \n",
" 11400 | \n",
"
\n",
" \n",
" listing_url | \n",
" https://www.airbnb.com/rooms/10080 | \n",
" https://www.airbnb.com/rooms/11400 | \n",
"
\n",
" \n",
" scrape_id | \n",
" 20190308052700 | \n",
" 20190308052700 | \n",
"
\n",
" \n",
" last_scraped | \n",
" 2019-03-08 | \n",
" 2019-03-08 | \n",
"
\n",
" \n",
" name | \n",
" D1 - Million Dollar View 2 BR | \n",
" Central Lovely Rm in Victorian Home | \n",
"
\n",
" \n",
" summary | \n",
" Stunning two bedroom, two bathroom apartment. ... | \n",
" Well-appointed room with a view of the garden ... | \n",
"
\n",
" \n",
" space | \n",
" Bed setup: 2 x queen, I can add up to 2 twin s... | \n",
" Centrally-located lovely, quiet home on tree-l... | \n",
"
\n",
" \n",
" description | \n",
" Stunning two bedroom, two bathroom apartment. ... | \n",
" Well-appointed room with a view of the garden ... | \n",
"
\n",
" \n",
" experiences_offered | \n",
" none | \n",
" none | \n",
"
\n",
" \n",
" neighborhood_overview | \n",
" NaN | \n",
" Very quiet residential area, yet only 1-1/2 bl... | \n",
"
\n",
" \n",
" notes | \n",
" 1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU... | \n",
" NaN | \n",
"
\n",
" \n",
" transit | \n",
" NaN | \n",
" Bus stop 1-1/2 blocks away. 15 minutes to down... | \n",
"
\n",
" \n",
" access | \n",
" There is no access to the building ammenities. | \n",
" Access to my home and garden. Shared bathroom.... | \n",
"
\n",
" \n",
" interaction | \n",
" NaN | \n",
" I live here so will interact with the guest wh... | \n",
"
\n",
" \n",
" house_rules | \n",
" 1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O... | \n",
" -No smoking indoors. -Recycling, composting ... | \n",
"
\n",
" \n",
" thumbnail_url | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" medium_url | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" picture_url | \n",
" https://a0.muscache.com/im/pictures/55778229/c... | \n",
" https://a0.muscache.com/im/pictures/74189163/a... | \n",
"
\n",
" \n",
" xl_picture_url | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" host_id | \n",
" 30899 | \n",
" 42305 | \n",
"
\n",
" \n",
" host_url | \n",
" https://www.airbnb.com/users/show/30899 | \n",
" https://www.airbnb.com/users/show/42305 | \n",
"
\n",
" \n",
" host_name | \n",
" Rami | \n",
" Jaynne | \n",
"
\n",
" \n",
" host_since | \n",
" 2009-08-10 | \n",
" 2009-09-30 | \n",
"
\n",
" \n",
" host_location | \n",
" Vancouver, British Columbia, Canada | \n",
" Vancouver, British Columbia, Canada | \n",
"
\n",
" \n",
" host_about | \n",
" I will be happy to host you. | \n",
" I am a self-employed graphic designer, AutoCAD... | \n",
"
\n",
" \n",
" host_response_time | \n",
" within a few hours | \n",
" NaN | \n",
"
\n",
" \n",
" host_response_rate | \n",
" 100% | \n",
" NaN | \n",
"
\n",
" \n",
" host_acceptance_rate | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" host_is_superhost | \n",
" f | \n",
" f | \n",
"
\n",
" \n",
" host_thumbnail_url | \n",
" https://a0.muscache.com/im/pictures/user/55302... | \n",
" https://a0.muscache.com/im/users/42305/profile... | \n",
"
\n",
" \n",
" host_picture_url | \n",
" https://a0.muscache.com/im/pictures/user/55302... | \n",
" https://a0.muscache.com/im/users/42305/profile... | \n",
"
\n",
" \n",
" host_neighbourhood | \n",
" Coal Harbour | \n",
" Riley Park | \n",
"
\n",
" \n",
" host_listings_count | \n",
" 33 | \n",
" 1 | \n",
"
\n",
" \n",
" host_total_listings_count | \n",
" 33 | \n",
" 1 | \n",
"
\n",
" \n",
" host_verifications | \n",
" ['email', 'phone', 'reviews', 'jumio', 'govern... | \n",
" ['email', 'phone', 'reviews', 'jumio', 'govern... | \n",
"
\n",
" \n",
" host_has_profile_pic | \n",
" t | \n",
" t | \n",
"
\n",
" \n",
" host_identity_verified | \n",
" t | \n",
" t | \n",
"
\n",
" \n",
" street | \n",
" Vancouver, BC, Canada | \n",
" Vancouver, BC, Canada | \n",
"
\n",
" \n",
" neighbourhood | \n",
" Coal Harbour | \n",
" Riley Park | \n",
"
\n",
" \n",
" neighbourhood_cleansed | \n",
" Downtown | \n",
" Riley Park | \n",
"
\n",
" \n",
" neighbourhood_group_cleansed | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" city | \n",
" Vancouver | \n",
" Vancouver | \n",
"
\n",
" \n",
" state | \n",
" BC | \n",
" BC | \n",
"
\n",
" \n",
" zipcode | \n",
" V6E 2P4 | \n",
" V5Y | \n",
"
\n",
" \n",
" market | \n",
" Vancouver | \n",
" Vancouver | \n",
"
\n",
" \n",
" smart_location | \n",
" Vancouver, Canada | \n",
" Vancouver, Canada | \n",
"
\n",
" \n",
" country_code | \n",
" CA | \n",
" CA | \n",
"
\n",
" \n",
" country | \n",
" Canada | \n",
" Canada | \n",
"
\n",
" \n",
" latitude | \n",
" 49.2877 | \n",
" 49.2538 | \n",
"
\n",
" \n",
" longitude | \n",
" -123.121 | \n",
" -123.113 | \n",
"
\n",
" \n",
" is_location_exact | \n",
" t | \n",
" t | \n",
"
\n",
" \n",
" property_type | \n",
" Condominium | \n",
" House | \n",
"
\n",
" \n",
" room_type | \n",
" Entire home/apt | \n",
" Private room | \n",
"
\n",
" \n",
" accommodates | \n",
" 6 | \n",
" 1 | \n",
"
\n",
" \n",
" bathrooms | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" bedrooms | \n",
" 2 | \n",
" 1 | \n",
"
\n",
" \n",
" beds | \n",
" 4 | \n",
" 1 | \n",
"
\n",
" \n",
" bed_type | \n",
" Real Bed | \n",
" Real Bed | \n",
"
\n",
" \n",
" amenities | \n",
" {TV,\"Cable TV\",Internet,Wifi,\"Air conditioning... | \n",
" {TV,\"Cable TV\",Internet,Wifi,\"Free parking on ... | \n",
"
\n",
" \n",
" square_feet | \n",
" 1200 | \n",
" 0 | \n",
"
\n",
" \n",
" price | \n",
" $296.00 | \n",
" $60.00 | \n",
"
\n",
" \n",
" weekly_price | \n",
" NaN | \n",
" $385.00 | \n",
"
\n",
" \n",
" monthly_price | \n",
" $4,750.00 | \n",
" $1,400.00 | \n",
"
\n",
" \n",
" security_deposit | \n",
" $0.00 | \n",
" NaN | \n",
"
\n",
" \n",
" cleaning_fee | \n",
" $195.00 | \n",
" NaN | \n",
"
\n",
" \n",
" guests_included | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" extra_people | \n",
" $0.00 | \n",
" $0.00 | \n",
"
\n",
" \n",
" minimum_nights | \n",
" 60 | \n",
" 364 | \n",
"
\n",
" \n",
" maximum_nights | \n",
" 1124 | \n",
" 364 | \n",
"
\n",
" \n",
" minimum_minimum_nights | \n",
" 60 | \n",
" 364 | \n",
"
\n",
" \n",
" maximum_minimum_nights | \n",
" 60 | \n",
" 364 | \n",
"
\n",
" \n",
" minimum_maximum_nights | \n",
" 1124 | \n",
" 364 | \n",
"
\n",
" \n",
" maximum_maximum_nights | \n",
" 1124 | \n",
" 364 | \n",
"
\n",
" \n",
" minimum_nights_avg_ntm | \n",
" 60 | \n",
" 364 | \n",
"
\n",
" \n",
" maximum_nights_avg_ntm | \n",
" 1124 | \n",
" 364 | \n",
"
\n",
" \n",
" calendar_updated | \n",
" today | \n",
" 11 months ago | \n",
"
\n",
" \n",
" has_availability | \n",
" t | \n",
" t | \n",
"
\n",
" \n",
" availability_30 | \n",
" 0 | \n",
" 30 | \n",
"
\n",
" \n",
" availability_60 | \n",
" 0 | \n",
" 60 | \n",
"
\n",
" \n",
" availability_90 | \n",
" 0 | \n",
" 90 | \n",
"
\n",
" \n",
" availability_365 | \n",
" 191 | \n",
" 365 | \n",
"
\n",
" \n",
" calendar_last_scraped | \n",
" 2019-03-08 | \n",
" 2019-03-08 | \n",
"
\n",
" \n",
" number_of_reviews | \n",
" 16 | \n",
" 70 | \n",
"
\n",
" \n",
" number_of_reviews_ltm | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" first_review | \n",
" 2011-11-15 | \n",
" 2010-02-18 | \n",
"
\n",
" \n",
" last_review | \n",
" 2017-02-26 | \n",
" 2016-06-14 | \n",
"
\n",
" \n",
" review_scores_rating | \n",
" 93 | \n",
" 92 | \n",
"
\n",
" \n",
" review_scores_accuracy | \n",
" 9 | \n",
" 9 | \n",
"
\n",
" \n",
" review_scores_cleanliness | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" review_scores_checkin | \n",
" 10 | \n",
" 10 | \n",
"
\n",
" \n",
" review_scores_communication | \n",
" 9 | \n",
" 10 | \n",
"
\n",
" \n",
" review_scores_location | \n",
" 10 | \n",
" 10 | \n",
"
\n",
" \n",
" review_scores_value | \n",
" 9 | \n",
" 9 | \n",
"
\n",
" \n",
" requires_license | \n",
" t | \n",
" t | \n",
"
\n",
" \n",
" license | \n",
" 18-476608 | \n",
" NaN | \n",
"
\n",
" \n",
" jurisdiction_names | \n",
" {\"British Columbia\",\" Canada\",\" Vancouver\",\" B... | \n",
" {\"British Columbia\",\" Canada\",\" Vancouver\",\" B... | \n",
"
\n",
" \n",
" instant_bookable | \n",
" f | \n",
" f | \n",
"
\n",
" \n",
" is_business_travel_ready | \n",
" f | \n",
" f | \n",
"
\n",
" \n",
" cancellation_policy | \n",
" strict_14_with_grace_period | \n",
" strict_14_with_grace_period | \n",
"
\n",
" \n",
" require_guest_profile_picture | \n",
" f | \n",
" t | \n",
"
\n",
" \n",
" require_guest_phone_verification | \n",
" f | \n",
" t | \n",
"
\n",
" \n",
" calculated_host_listings_count | \n",
" 33 | \n",
" 1 | \n",
"
\n",
" \n",
" calculated_host_listings_count_entire_homes | \n",
" 33 | \n",
" 0 | \n",
"
\n",
" \n",
" calculated_host_listings_count_private_rooms | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" calculated_host_listings_count_shared_rooms | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" reviews_per_month | \n",
" 0.18 | \n",
" 0.64 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 \\\n",
"id 10080 \n",
"listing_url https://www.airbnb.com/rooms/10080 \n",
"scrape_id 20190308052700 \n",
"last_scraped 2019-03-08 \n",
"name D1 - Million Dollar View 2 BR \n",
"summary Stunning two bedroom, two bathroom apartment. ... \n",
"space Bed setup: 2 x queen, I can add up to 2 twin s... \n",
"description Stunning two bedroom, two bathroom apartment. ... \n",
"experiences_offered none \n",
"neighborhood_overview NaN \n",
"notes 1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU... \n",
"transit NaN \n",
"access There is no access to the building ammenities. \n",
"interaction NaN \n",
"house_rules 1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O... \n",
"thumbnail_url NaN \n",
"medium_url NaN \n",
"picture_url https://a0.muscache.com/im/pictures/55778229/c... \n",
"xl_picture_url NaN \n",
"host_id 30899 \n",
"host_url https://www.airbnb.com/users/show/30899 \n",
"host_name Rami \n",
"host_since 2009-08-10 \n",
"host_location Vancouver, British Columbia, Canada \n",
"host_about I will be happy to host you. \n",
"host_response_time within a few hours \n",
"host_response_rate 100% \n",
"host_acceptance_rate NaN \n",
"host_is_superhost f \n",
"host_thumbnail_url https://a0.muscache.com/im/pictures/user/55302... \n",
"host_picture_url https://a0.muscache.com/im/pictures/user/55302... \n",
"host_neighbourhood Coal Harbour \n",
"host_listings_count 33 \n",
"host_total_listings_count 33 \n",
"host_verifications ['email', 'phone', 'reviews', 'jumio', 'govern... \n",
"host_has_profile_pic t \n",
"host_identity_verified t \n",
"street Vancouver, BC, Canada \n",
"neighbourhood Coal Harbour \n",
"neighbourhood_cleansed Downtown \n",
"neighbourhood_group_cleansed NaN \n",
"city Vancouver \n",
"state BC \n",
"zipcode V6E 2P4 \n",
"market Vancouver \n",
"smart_location Vancouver, Canada \n",
"country_code CA \n",
"country Canada \n",
"latitude 49.2877 \n",
"longitude -123.121 \n",
"is_location_exact t \n",
"property_type Condominium \n",
"room_type Entire home/apt \n",
"accommodates 6 \n",
"bathrooms 2 \n",
"bedrooms 2 \n",
"beds 4 \n",
"bed_type Real Bed \n",
"amenities {TV,\"Cable TV\",Internet,Wifi,\"Air conditioning... \n",
"square_feet 1200 \n",
"price $296.00 \n",
"weekly_price NaN \n",
"monthly_price $4,750.00 \n",
"security_deposit $0.00 \n",
"cleaning_fee $195.00 \n",
"guests_included 1 \n",
"extra_people $0.00 \n",
"minimum_nights 60 \n",
"maximum_nights 1124 \n",
"minimum_minimum_nights 60 \n",
"maximum_minimum_nights 60 \n",
"minimum_maximum_nights 1124 \n",
"maximum_maximum_nights 1124 \n",
"minimum_nights_avg_ntm 60 \n",
"maximum_nights_avg_ntm 1124 \n",
"calendar_updated today \n",
"has_availability t \n",
"availability_30 0 \n",
"availability_60 0 \n",
"availability_90 0 \n",
"availability_365 191 \n",
"calendar_last_scraped 2019-03-08 \n",
"number_of_reviews 16 \n",
"number_of_reviews_ltm 0 \n",
"first_review 2011-11-15 \n",
"last_review 2017-02-26 \n",
"review_scores_rating 93 \n",
"review_scores_accuracy 9 \n",
"review_scores_cleanliness 9 \n",
"review_scores_checkin 10 \n",
"review_scores_communication 9 \n",
"review_scores_location 10 \n",
"review_scores_value 9 \n",
"requires_license t \n",
"license 18-476608 \n",
"jurisdiction_names {\"British Columbia\",\" Canada\",\" Vancouver\",\" B... \n",
"instant_bookable f \n",
"is_business_travel_ready f \n",
"cancellation_policy strict_14_with_grace_period \n",
"require_guest_profile_picture f \n",
"require_guest_phone_verification f \n",
"calculated_host_listings_count 33 \n",
"calculated_host_listings_count_entire_homes 33 \n",
"calculated_host_listings_count_private_rooms 0 \n",
"calculated_host_listings_count_shared_rooms 0 \n",
"reviews_per_month 0.18 \n",
"\n",
" 1 \n",
"id 11400 \n",
"listing_url https://www.airbnb.com/rooms/11400 \n",
"scrape_id 20190308052700 \n",
"last_scraped 2019-03-08 \n",
"name Central Lovely Rm in Victorian Home \n",
"summary Well-appointed room with a view of the garden ... \n",
"space Centrally-located lovely, quiet home on tree-l... \n",
"description Well-appointed room with a view of the garden ... \n",
"experiences_offered none \n",
"neighborhood_overview Very quiet residential area, yet only 1-1/2 bl... \n",
"notes NaN \n",
"transit Bus stop 1-1/2 blocks away. 15 minutes to down... \n",
"access Access to my home and garden. Shared bathroom.... \n",
"interaction I live here so will interact with the guest wh... \n",
"house_rules -No smoking indoors. -Recycling, composting ... \n",
"thumbnail_url NaN \n",
"medium_url NaN \n",
"picture_url https://a0.muscache.com/im/pictures/74189163/a... \n",
"xl_picture_url NaN \n",
"host_id 42305 \n",
"host_url https://www.airbnb.com/users/show/42305 \n",
"host_name Jaynne \n",
"host_since 2009-09-30 \n",
"host_location Vancouver, British Columbia, Canada \n",
"host_about I am a self-employed graphic designer, AutoCAD... \n",
"host_response_time NaN \n",
"host_response_rate NaN \n",
"host_acceptance_rate NaN \n",
"host_is_superhost f \n",
"host_thumbnail_url https://a0.muscache.com/im/users/42305/profile... \n",
"host_picture_url https://a0.muscache.com/im/users/42305/profile... \n",
"host_neighbourhood Riley Park \n",
"host_listings_count 1 \n",
"host_total_listings_count 1 \n",
"host_verifications ['email', 'phone', 'reviews', 'jumio', 'govern... \n",
"host_has_profile_pic t \n",
"host_identity_verified t \n",
"street Vancouver, BC, Canada \n",
"neighbourhood Riley Park \n",
"neighbourhood_cleansed Riley Park \n",
"neighbourhood_group_cleansed NaN \n",
"city Vancouver \n",
"state BC \n",
"zipcode V5Y \n",
"market Vancouver \n",
"smart_location Vancouver, Canada \n",
"country_code CA \n",
"country Canada \n",
"latitude 49.2538 \n",
"longitude -123.113 \n",
"is_location_exact t \n",
"property_type House \n",
"room_type Private room \n",
"accommodates 1 \n",
"bathrooms 1 \n",
"bedrooms 1 \n",
"beds 1 \n",
"bed_type Real Bed \n",
"amenities {TV,\"Cable TV\",Internet,Wifi,\"Free parking on ... \n",
"square_feet 0 \n",
"price $60.00 \n",
"weekly_price $385.00 \n",
"monthly_price $1,400.00 \n",
"security_deposit NaN \n",
"cleaning_fee NaN \n",
"guests_included 1 \n",
"extra_people $0.00 \n",
"minimum_nights 364 \n",
"maximum_nights 364 \n",
"minimum_minimum_nights 364 \n",
"maximum_minimum_nights 364 \n",
"minimum_maximum_nights 364 \n",
"maximum_maximum_nights 364 \n",
"minimum_nights_avg_ntm 364 \n",
"maximum_nights_avg_ntm 364 \n",
"calendar_updated 11 months ago \n",
"has_availability t \n",
"availability_30 30 \n",
"availability_60 60 \n",
"availability_90 90 \n",
"availability_365 365 \n",
"calendar_last_scraped 2019-03-08 \n",
"number_of_reviews 70 \n",
"number_of_reviews_ltm 0 \n",
"first_review 2010-02-18 \n",
"last_review 2016-06-14 \n",
"review_scores_rating 92 \n",
"review_scores_accuracy 9 \n",
"review_scores_cleanliness 10 \n",
"review_scores_checkin 10 \n",
"review_scores_communication 10 \n",
"review_scores_location 10 \n",
"review_scores_value 9 \n",
"requires_license t \n",
"license NaN \n",
"jurisdiction_names {\"British Columbia\",\" Canada\",\" Vancouver\",\" B... \n",
"instant_bookable f \n",
"is_business_travel_ready f \n",
"cancellation_policy strict_14_with_grace_period \n",
"require_guest_profile_picture t \n",
"require_guest_phone_verification t \n",
"calculated_host_listings_count 1 \n",
"calculated_host_listings_count_entire_homes 0 \n",
"calculated_host_listings_count_private_rooms 1 \n",
"calculated_host_listings_count_shared_rooms 0 \n",
"reviews_per_month 0.64 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.set_option('display.max_rows', 500)\n",
"pd.set_option('display.max_columns', 500)\n",
"\n",
"df_listings_more.head(2).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not all columns will be required for visualization so let's do some clean-up:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"columns_to_keep = ['id', 'summary', 'space', 'description', 'host_listings_count', 'host_total_listings_count',\n",
" 'neighbourhood_cleansed', 'is_location_exact', 'property_type', 'accommodates', 'bedrooms',\n",
" 'bathrooms', 'beds', 'amenities', 'availability_30', 'availability_60', 'availability_90',\n",
" 'availability_365']\n",
"\n",
"df_listings_more = df_listings_more.loc[:,columns_to_keep]"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id int64\n",
"summary object\n",
"space object\n",
"description object\n",
"host_listings_count int64\n",
"host_total_listings_count int64\n",
"neighbourhood_cleansed object\n",
"is_location_exact object\n",
"property_type object\n",
"accommodates int64\n",
"bedrooms float64\n",
"bathrooms float64\n",
"beds float64\n",
"amenities object\n",
"availability_30 int64\n",
"availability_60 int64\n",
"availability_90 int64\n",
"availability_365 int64\n",
"dtype: object"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings_more.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"summary 167\n",
"space 1057\n",
"description 81\n",
"host_listings_count 0\n",
"host_total_listings_count 0\n",
"neighbourhood_cleansed 0\n",
"is_location_exact 0\n",
"property_type 0\n",
"accommodates 0\n",
"bedrooms 3\n",
"bathrooms 0\n",
"beds 5\n",
"amenities 0\n",
"availability_30 0\n",
"availability_60 0\n",
"availability_90 0\n",
"availability_365 0\n",
"dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings_more.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In case of missing data in `summary`, `space` and `description` columns we will replace NaN's with empty strings:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"df_listings_more['summary'] = df_listings_more['summary'].fillna('')\n",
"df_listings_more['space'] = df_listings_more['space'].fillna('')\n",
"df_listings_more['description'] = df_listings_more['description'].fillna('')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"summary 0\n",
"space 0\n",
"description 0\n",
"host_listings_count 0\n",
"host_total_listings_count 0\n",
"neighbourhood_cleansed 0\n",
"is_location_exact 0\n",
"property_type 0\n",
"accommodates 0\n",
"bedrooms 3\n",
"bathrooms 0\n",
"beds 5\n",
"amenities 0\n",
"availability_30 0\n",
"availability_60 0\n",
"availability_90 0\n",
"availability_365 0\n",
"dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings_more.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will deal with the remaining NaN's in the following sections."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now merge the two datasets, `df_listings` and `df_listings_more` into one single dataset and call it `df_listings_clean`:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"df_listings_clean = df_listings.merge(df_listings_more, on = 'id')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id 0\n",
"name 1\n",
"host_id 0\n",
"host_name 0\n",
"latitude 0\n",
"longitude 0\n",
"room_type 0\n",
"price 0\n",
"minimum_nights 0\n",
"number_of_reviews 0\n",
"calculated_host_listings_count 0\n",
"summary 0\n",
"space 0\n",
"description 0\n",
"host_listings_count 0\n",
"host_total_listings_count 0\n",
"neighbourhood_cleansed 0\n",
"is_location_exact 0\n",
"property_type 0\n",
"accommodates 0\n",
"bedrooms 3\n",
"bathrooms 0\n",
"beds 5\n",
"amenities 0\n",
"availability_30 0\n",
"availability_60 0\n",
"availability_90 0\n",
"availability_365 0\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings_clean.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now drop the 9 remaining rows with missing values in `name`, `bedrooms` and `beds` columns."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"df_listings_clean.dropna(axis = 0, inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings_clean.isna().sum().any()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"id int64\n",
"name object\n",
"host_id int64\n",
"host_name object\n",
"latitude float64\n",
"longitude float64\n",
"room_type object\n",
"price int64\n",
"minimum_nights int64\n",
"number_of_reviews int64\n",
"calculated_host_listings_count int64\n",
"summary object\n",
"space object\n",
"description object\n",
"host_listings_count int64\n",
"host_total_listings_count int64\n",
"neighbourhood_cleansed object\n",
"is_location_exact object\n",
"property_type object\n",
"accommodates int64\n",
"bedrooms float64\n",
"bathrooms float64\n",
"beds float64\n",
"amenities object\n",
"availability_30 int64\n",
"availability_60 int64\n",
"availability_90 int64\n",
"availability_365 int64\n",
"dtype: object"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_listings_clean.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As an optional help for visulaization down the road let's change data type in some columns to appropriate integer type:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"df_listings_clean['bedrooms'] = df_listings_clean['bedrooms'].astype('int64')\n",
"df_listings_clean['bathrooms'] = df_listings_clean['bathrooms'].astype('int64')\n",
"df_listings_clean['beds'] = df_listings_clean['beds'].astype('int64')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### EDA on different neighbourhoods:\n",
"\n",
"Although Tableau will be used down the road to visualize different measures in different neighbourhoods we can also have a quick look at different features across neighbourhoods in here to get an understanding that will help us choose the best visualizations in Tableau:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# Creating an empty dataframe to store different measures for different neighbourhoods:\n",
"df_neighbourhood = pd.DataFrame(index = df_listings_clean['neighbourhood_cleansed'].unique())\n",
"\n",
"# Average nighly price in different neighbourhoods:\n",
"df_neighbourhood['Avg_price'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['price'].mean().round()\n",
"\n",
"# Total number of listings in different neighbourhoods:\n",
"df_neighbourhood['Listing_count'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['id'].count()\n",
"\n",
"# Average 30-Day availability in different neighbourhoods:\n",
"df_neighbourhood['Avg_30_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_30']\\\n",
" .mean().round().astype(int)\n",
"\n",
"# Average 60-Day availability in different neighbourhoods:\n",
"df_neighbourhood['Avg_60_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_60']\\\n",
" .mean().round().astype(int)\n",
"\n",
"# Average 90-Day availability in different neighbourhoods:\n",
"df_neighbourhood['Avg_90_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_90']\\\n",
" .mean().round().astype(int)\n",
"\n",
"# Average 365-Day availability in different neighbourhoods:\n",
"df_neighbourhood['Avg_365_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_365']\\\n",
" .mean().round().astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Avg_price | \n",
" Listing_count | \n",
" Avg_30_availability | \n",
" Avg_60_availability | \n",
" Avg_90_availability | \n",
" Avg_365_availability | \n",
"
\n",
" \n",
" \n",
" \n",
" Downtown | \n",
" 196.0 | \n",
" 1064 | \n",
" 11 | \n",
" 25 | \n",
" 41 | \n",
" 147 | \n",
"
\n",
" \n",
" Riley Park | \n",
" 145.0 | \n",
" 275 | \n",
" 10 | \n",
" 21 | \n",
" 32 | \n",
" 117 | \n",
"
\n",
" \n",
" Downtown Eastside | \n",
" 176.0 | \n",
" 271 | \n",
" 12 | \n",
" 25 | \n",
" 38 | \n",
" 129 | \n",
"
\n",
" \n",
" West End | \n",
" 168.0 | \n",
" 362 | \n",
" 10 | \n",
" 22 | \n",
" 36 | \n",
" 117 | \n",
"
\n",
" \n",
" Kensington-Cedar Cottage | \n",
" 134.0 | \n",
" 319 | \n",
" 10 | \n",
" 22 | \n",
" 35 | \n",
" 136 | \n",
"
\n",
" \n",
" Hastings-Sunrise | \n",
" 120.0 | \n",
" 247 | \n",
" 12 | \n",
" 26 | \n",
" 42 | \n",
" 138 | \n",
"
\n",
" \n",
" Renfrew-Collingwood | \n",
" 102.0 | \n",
" 200 | \n",
" 12 | \n",
" 27 | \n",
" 43 | \n",
" 153 | \n",
"
\n",
" \n",
" Mount Pleasant | \n",
" 150.0 | \n",
" 294 | \n",
" 10 | \n",
" 22 | \n",
" 35 | \n",
" 126 | \n",
"
\n",
" \n",
" Grandview-Woodland | \n",
" 135.0 | \n",
" 247 | \n",
" 8 | \n",
" 17 | \n",
" 28 | \n",
" 116 | \n",
"
\n",
" \n",
" Fairview | \n",
" 144.0 | \n",
" 151 | \n",
" 7 | \n",
" 18 | \n",
" 31 | \n",
" 132 | \n",
"
\n",
" \n",
" Kitsilano | \n",
" 190.0 | \n",
" 424 | \n",
" 10 | \n",
" 22 | \n",
" 34 | \n",
" 133 | \n",
"
\n",
" \n",
" Kerrisdale | \n",
" 173.0 | \n",
" 69 | \n",
" 15 | \n",
" 31 | \n",
" 47 | \n",
" 145 | \n",
"
\n",
" \n",
" Arbutus Ridge | \n",
" 198.0 | \n",
" 84 | \n",
" 13 | \n",
" 29 | \n",
" 47 | \n",
" 156 | \n",
"
\n",
" \n",
" Sunset | \n",
" 172.0 | \n",
" 106 | \n",
" 13 | \n",
" 30 | \n",
" 48 | \n",
" 142 | \n",
"
\n",
" \n",
" Dunbar Southlands | \n",
" 190.0 | \n",
" 138 | \n",
" 13 | \n",
" 30 | \n",
" 48 | \n",
" 170 | \n",
"
\n",
" \n",
" Marpole | \n",
" 129.0 | \n",
" 166 | \n",
" 17 | \n",
" 38 | \n",
" 60 | \n",
" 167 | \n",
"
\n",
" \n",
" Killarney | \n",
" 91.0 | \n",
" 67 | \n",
" 15 | \n",
" 34 | \n",
" 53 | \n",
" 140 | \n",
"
\n",
" \n",
" Shaughnessy | \n",
" 216.0 | \n",
" 87 | \n",
" 12 | \n",
" 26 | \n",
" 39 | \n",
" 107 | \n",
"
\n",
" \n",
" South Cambie | \n",
" 185.0 | \n",
" 59 | \n",
" 10 | \n",
" 22 | \n",
" 35 | \n",
" 125 | \n",
"
\n",
" \n",
" Strathcona | \n",
" 128.0 | \n",
" 22 | \n",
" 8 | \n",
" 20 | \n",
" 32 | \n",
" 129 | \n",
"
\n",
" \n",
" West Point Grey | \n",
" 216.0 | \n",
" 88 | \n",
" 13 | \n",
" 27 | \n",
" 40 | \n",
" 133 | \n",
"
\n",
" \n",
" Oakridge | \n",
" 141.0 | \n",
" 88 | \n",
" 15 | \n",
" 34 | \n",
" 52 | \n",
" 136 | \n",
"
\n",
" \n",
" Victoria-Fraserview | \n",
" 90.0 | \n",
" 101 | \n",
" 19 | \n",
" 41 | \n",
" 65 | \n",
" 228 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Avg_price Listing_count Avg_30_availability \\\n",
"Downtown 196.0 1064 11 \n",
"Riley Park 145.0 275 10 \n",
"Downtown Eastside 176.0 271 12 \n",
"West End 168.0 362 10 \n",
"Kensington-Cedar Cottage 134.0 319 10 \n",
"Hastings-Sunrise 120.0 247 12 \n",
"Renfrew-Collingwood 102.0 200 12 \n",
"Mount Pleasant 150.0 294 10 \n",
"Grandview-Woodland 135.0 247 8 \n",
"Fairview 144.0 151 7 \n",
"Kitsilano 190.0 424 10 \n",
"Kerrisdale 173.0 69 15 \n",
"Arbutus Ridge 198.0 84 13 \n",
"Sunset 172.0 106 13 \n",
"Dunbar Southlands 190.0 138 13 \n",
"Marpole 129.0 166 17 \n",
"Killarney 91.0 67 15 \n",
"Shaughnessy 216.0 87 12 \n",
"South Cambie 185.0 59 10 \n",
"Strathcona 128.0 22 8 \n",
"West Point Grey 216.0 88 13 \n",
"Oakridge 141.0 88 15 \n",
"Victoria-Fraserview 90.0 101 19 \n",
"\n",
" Avg_60_availability Avg_90_availability \\\n",
"Downtown 25 41 \n",
"Riley Park 21 32 \n",
"Downtown Eastside 25 38 \n",
"West End 22 36 \n",
"Kensington-Cedar Cottage 22 35 \n",
"Hastings-Sunrise 26 42 \n",
"Renfrew-Collingwood 27 43 \n",
"Mount Pleasant 22 35 \n",
"Grandview-Woodland 17 28 \n",
"Fairview 18 31 \n",
"Kitsilano 22 34 \n",
"Kerrisdale 31 47 \n",
"Arbutus Ridge 29 47 \n",
"Sunset 30 48 \n",
"Dunbar Southlands 30 48 \n",
"Marpole 38 60 \n",
"Killarney 34 53 \n",
"Shaughnessy 26 39 \n",
"South Cambie 22 35 \n",
"Strathcona 20 32 \n",
"West Point Grey 27 40 \n",
"Oakridge 34 52 \n",
"Victoria-Fraserview 41 65 \n",
"\n",
" Avg_365_availability \n",
"Downtown 147 \n",
"Riley Park 117 \n",
"Downtown Eastside 129 \n",
"West End 117 \n",
"Kensington-Cedar Cottage 136 \n",
"Hastings-Sunrise 138 \n",
"Renfrew-Collingwood 153 \n",
"Mount Pleasant 126 \n",
"Grandview-Woodland 116 \n",
"Fairview 132 \n",
"Kitsilano 133 \n",
"Kerrisdale 145 \n",
"Arbutus Ridge 156 \n",
"Sunset 142 \n",
"Dunbar Southlands 170 \n",
"Marpole 167 \n",
"Killarney 140 \n",
"Shaughnessy 107 \n",
"South Cambie 125 \n",
"Strathcona 129 \n",
"West Point Grey 133 \n",
"Oakridge 136 \n",
"Victoria-Fraserview 228 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_neighbourhood"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding the most common words in Reviews and listing Summaries:\n",
"\n",
"With the aim of making word clouds of most common words in property summary (description) and reviews we go through all reviews and listing summaries in each neghbourhood and will feed Tableau with tables that list the top 50 words used in reviews and listing summaries, in each neighbourhood:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Funtion to return the top 50 words:\n",
"\n",
"from operator import itemgetter\n",
"\n",
"def top_word_counter(df, filter_col, filter_match, lookup_col):\n",
"\n",
" words = list(entry.lower().strip() for entry in df[df[filter_col] == filter_match][lookup_col].astype(str))\n",
" wordcount = {}\n",
" \n",
" # Words that will not be counted:\n",
" stopwords = [\"i\", \"me\", \"my\", \"myself\", \"we\", \"our\", \"ours\", \"ourselves\", \"you\", \"your\", \"yours\", \n",
" \"yourself\", \"yourselves\", \"he\", \"him\", \"his\", \"himself\", \"she\", \"her\", \"hers\", \"herself\", \n",
" \"it\", \"its\", \"itself\", \"they\", \"them\", \"their\", \"theirs\", \"themselves\", \"what\", \"which\", \n",
" \"who\", \"whom\", \"this\", \"that\", \"these\", \"those\", \"am\", \"is\", \"are\", \"was\", \"were\", \"be\", \n",
" \"been\", \"being\", \"have\", \"has\", \"had\", \"having\", \"do\", \"does\", \"did\", \"doing\", \"a\", \"an\", \n",
" \"the\", \"and\", \"but\", \"if\", \"or\", \"because\", \"as\", \"until\", \"while\", \"of\", \"at\", \"by\", \"for\", \n",
" \"with\", \"about\", \"against\", \"between\", \"into\", \"through\", \"during\", \"before\", \"after\", \"above\", \n",
" \"below\", \"to\", \"from\", \"up\", \"down\", \"in\", \"out\", \"on\", \"off\", \"over\", \"under\", \"again\", \"further\", \n",
" \"then\", \"once\", \"here\", \"there\", \"when\", \"where\", \"why\", \"how\", \"all\", \"any\", \"both\", \"each\", \n",
" \"few\", \"more\", \"most\", \"other\", \"some\", \"such\", \"no\", \"nor\", \"not\", \"only\", \"own\", \"same\", \"so\", \n",
" \"than\", \"too\", \"very\", \"s\", \"t\", \"can\", \"will\", \"just\", \"don\", \"should\", \"now\", \"\"]\n",
"\n",
" symbols_to_replace = [\".\", \",\", \":\", \";\", \"\\\"\", \"/\", \"!\", \"?\", \"“\", \"‘\", \"*\", \")\", \"(\", \"|\", \n",
" \">\", \"<\", \"{\", \"}\", \"&\", \"^\", \"~\", \"+\"]\n",
" \n",
" seperator = ' '\n",
" text = seperator.join(words)\n",
"\n",
" for word in text.split():\n",
" for symbol in symbols_to_replace:\n",
" word = word.replace(symbol,\"\")\n",
"\n",
" if word not in stopwords:\n",
" if word not in wordcount:\n",
" wordcount[word] = 1\n",
" else:\n",
" wordcount[word] += 1\n",
"\n",
" top_words = sorted(wordcount.items(), key=itemgetter(1), reverse = True)[0:50]\n",
" \n",
" return [(filter_match, i[0], i[1]) for i in top_words]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Finding the top 50 words in summaries for each neighbouthood:\n",
"summary_words = []\n",
"\n",
"for neighb in df_listings_clean['neighbourhood_cleansed'].unique():\n",
" summary_words += top_word_counter(df_listings_clean, 'neighbourhood_cleansed', neighb, 'summary')\n",
"\n",
"summary_words = pd.DataFrame(summary_words)\n",
"summary_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Cleaning the reviews dataframe before checking for most used words:\n",
"reviews_clean = df_reviews_more.drop(['id', 'date', 'reviewer_id', 'reviewer_name'] , axis = 1)\\\n",
" .rename(columns = {'listing_id': 'id', 'comments': 'review'})\\\n",
" .merge(df_listings_more.loc[:, ['id', 'neighbourhood_cleansed']], on = 'id')\n",
"\n",
"# Finding the top 50 words in reviews for each neighbouthood:\n",
"review_words = []\n",
"\n",
"for neighb in reviews_clean['neighbourhood_cleansed'].unique():\n",
" review_words += top_word_counter(reviews_clean, 'neighbourhood_cleansed', neighb, 'review')\n",
"\n",
"review_words = pd.DataFrame(review_words)\n",
"review_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Exporting new, clean dataframes as csv files that will feed into Tableau:\n",
"\n",
"df_listings_clean.to_csv('Data/clean_data.csv')\n",
"review_words.to_csv('Data/review_words.csv')\n",
"summary_words.to_csv('Data/summary_words.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data visualization to continue on Tableau."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}