{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Airbnb Homes in Vancouver - Data Cleaning\n", "#### Arash Tavassoli - April 2019\n", "---\n", "### Introduction\n", "\n", "In less than 10 years since its official launch, airbnb has now found its place in many of our lives and continues to prove itself as an affordable and convenient alternative to conventional hotel services. This notebook cleans and prepares the data before importing into a Tableau workbook that will visualize airbnb's presence in the City of Vancouver based on publicly available data from airbnb-owned website, Inside-airbnb.\n", "\n", "The Tableau visualization is available in [here](https://public.tableau.com/views/AirbnbHomesinVancouver/Dashboard?:embed=y&:display_count=yes).\n", "\n", "### Data Source\n", "\n", "The data is downloaded from Inside-airbnb on April 29, 2019 and reflects the detailed listing data, calendar data and listing reviews for 4,929 uniquely identified homes in Vancouver, BC, that were active on March 8, 2019.\n", "\n", "The data is dowloaded from [Inside Airbnb website](http://insideairbnb.com/get-the-data.html) where information is available under a Creative Commons CC0 1.0 Universal (CC0 1.0) \"Public Domain Dedication\" license.\n", "\n", "The datasets are specific to the City of Vancouver and are provided in 5 separate CSV and 1 GeoJSON files:\n", "\n", "|File Name | Description \n", "|:---------|:---------\n", "|`calendar.csv` | Detailed Calendar Data for listings in Vancouver\n", "|`listings.csv` | Summary information and metrics for listings in Vancouver\n", "|`listings-2.csv` | Detailed Listings data for Vancouver\n", "|`reviews.csv` | Summary Review data and Listing ID\n", "|`reviews-2.csv` | Detailed Review Data for listings in Vancouver\n", "|`neighbourhoods.csv` | Neighbourhood list for geo filter (ourced from city or open source GIS files)\n", "|`neighbourhoods.geojson` | GeoJSON file of neighbourhoods of the city" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Importing libraries:\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Importing raw datasets:\n", "df_cal = pd.read_csv('Data/calendar.csv')\n", "df_listings = pd.read_csv('Data/listings.csv')\n", "df_listings_more = pd.read_csv('Data/listings-2.csv')\n", "df_neighb = pd.read_csv('Data/neighbourhoods.csv')\n", "df_reviews = pd.read_csv('Data/reviews.csv')\n", "df_reviews_more = pd.read_csv('Data/reviews-2.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cleaning the `df_listings` and `df_listings_more` datasets:\n", "\n", "We start by looking at the two listing datasets: `df_listings` and `df_listings_more`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape of df_listings:\t\t (4938, 16)\n", "Shape of df_listings_more:\t (4938, 106)\n" ] } ], "source": [ "print('Shape of df_listings:\\t\\t', df_listings.shape)\n", "print('Shape of df_listings_more:\\t', df_listings_more.shape)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
id100801140013188
nameD1 - Million Dollar View 2 BRCentral Lovely Rm in Victorian HomeGarden level studio in ideal loc.
host_id308994230551466
host_nameRamiJaynneFamily Guns
neighbourhood_groupNaNNaNNaN
neighbourhoodDowntownRiley ParkRiley Park
latitude49.287749.253849.2458
longitude-123.121-123.113-123.105
room_typeEntire home/aptPrivate roomEntire home/apt
price29660120
minimum_nights603642
number_of_reviews1670171
last_review2017-02-262016-06-142019-02-24
reviews_per_month0.180.641.55
calculated_host_listings_count3311
availability_365191365214
\n", "
" ], "text/plain": [ " 0 \\\n", "id 10080 \n", "name D1 - Million Dollar View 2 BR \n", "host_id 30899 \n", "host_name Rami \n", "neighbourhood_group NaN \n", "neighbourhood Downtown \n", "latitude 49.2877 \n", "longitude -123.121 \n", "room_type Entire home/apt \n", "price 296 \n", "minimum_nights 60 \n", "number_of_reviews 16 \n", "last_review 2017-02-26 \n", "reviews_per_month 0.18 \n", "calculated_host_listings_count 33 \n", "availability_365 191 \n", "\n", " 1 \\\n", "id 11400 \n", "name Central Lovely Rm in Victorian Home \n", "host_id 42305 \n", "host_name Jaynne \n", "neighbourhood_group NaN \n", "neighbourhood Riley Park \n", "latitude 49.2538 \n", "longitude -123.113 \n", "room_type Private room \n", "price 60 \n", "minimum_nights 364 \n", "number_of_reviews 70 \n", "last_review 2016-06-14 \n", "reviews_per_month 0.64 \n", "calculated_host_listings_count 1 \n", "availability_365 365 \n", "\n", " 2 \n", "id 13188 \n", "name Garden level studio in ideal loc. \n", "host_id 51466 \n", "host_name Family Guns \n", "neighbourhood_group NaN \n", "neighbourhood Riley Park \n", "latitude 49.2458 \n", "longitude -123.105 \n", "room_type Entire home/apt \n", "price 120 \n", "minimum_nights 2 \n", "number_of_reviews 171 \n", "last_review 2019-02-24 \n", "reviews_per_month 1.55 \n", "calculated_host_listings_count 1 \n", "availability_365 214 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings.head(3).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We drop the columns that we do not need in the analysis (not required or repeated on other datasets):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df_listings.drop(['neighbourhood_group', 'last_review', 'reviews_per_month', 'availability_365', 'neighbourhood'], \n", " axis = 1, inplace = True)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "name 1\n", "host_id 0\n", "host_name 0\n", "latitude 0\n", "longitude 0\n", "room_type 0\n", "price 0\n", "minimum_nights 0\n", "number_of_reviews 0\n", "calculated_host_listings_count 0\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Checking for NaNs:\n", "df_listings.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll deal with that one missing value later." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
id1008011400
listing_urlhttps://www.airbnb.com/rooms/10080https://www.airbnb.com/rooms/11400
scrape_id2019030805270020190308052700
last_scraped2019-03-082019-03-08
nameD1 - Million Dollar View 2 BRCentral Lovely Rm in Victorian Home
summaryStunning two bedroom, two bathroom apartment. ...Well-appointed room with a view of the garden ...
spaceBed setup: 2 x queen, I can add up to 2 twin s...Centrally-located lovely, quiet home on tree-l...
descriptionStunning two bedroom, two bathroom apartment. ...Well-appointed room with a view of the garden ...
experiences_offerednonenone
neighborhood_overviewNaNVery quiet residential area, yet only 1-1/2 bl...
notes1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU...NaN
transitNaNBus stop 1-1/2 blocks away. 15 minutes to down...
accessThere is no access to the building ammenities.Access to my home and garden. Shared bathroom....
interactionNaNI live here so will interact with the guest wh...
house_rules1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O...-No smoking indoors. -Recycling, composting ...
thumbnail_urlNaNNaN
medium_urlNaNNaN
picture_urlhttps://a0.muscache.com/im/pictures/55778229/c...https://a0.muscache.com/im/pictures/74189163/a...
xl_picture_urlNaNNaN
host_id3089942305
host_urlhttps://www.airbnb.com/users/show/30899https://www.airbnb.com/users/show/42305
host_nameRamiJaynne
host_since2009-08-102009-09-30
host_locationVancouver, British Columbia, CanadaVancouver, British Columbia, Canada
host_aboutI will be happy to host you.I am a self-employed graphic designer, AutoCAD...
host_response_timewithin a few hoursNaN
host_response_rate100%NaN
host_acceptance_rateNaNNaN
host_is_superhostff
host_thumbnail_urlhttps://a0.muscache.com/im/pictures/user/55302...https://a0.muscache.com/im/users/42305/profile...
host_picture_urlhttps://a0.muscache.com/im/pictures/user/55302...https://a0.muscache.com/im/users/42305/profile...
host_neighbourhoodCoal HarbourRiley Park
host_listings_count331
host_total_listings_count331
host_verifications['email', 'phone', 'reviews', 'jumio', 'govern...['email', 'phone', 'reviews', 'jumio', 'govern...
host_has_profile_pictt
host_identity_verifiedtt
streetVancouver, BC, CanadaVancouver, BC, Canada
neighbourhoodCoal HarbourRiley Park
neighbourhood_cleansedDowntownRiley Park
neighbourhood_group_cleansedNaNNaN
cityVancouverVancouver
stateBCBC
zipcodeV6E 2P4V5Y
marketVancouverVancouver
smart_locationVancouver, CanadaVancouver, Canada
country_codeCACA
countryCanadaCanada
latitude49.287749.2538
longitude-123.121-123.113
is_location_exacttt
property_typeCondominiumHouse
room_typeEntire home/aptPrivate room
accommodates61
bathrooms21
bedrooms21
beds41
bed_typeReal BedReal Bed
amenities{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning...{TV,\"Cable TV\",Internet,Wifi,\"Free parking on ...
square_feet12000
price$296.00$60.00
weekly_priceNaN$385.00
monthly_price$4,750.00$1,400.00
security_deposit$0.00NaN
cleaning_fee$195.00NaN
guests_included11
extra_people$0.00$0.00
minimum_nights60364
maximum_nights1124364
minimum_minimum_nights60364
maximum_minimum_nights60364
minimum_maximum_nights1124364
maximum_maximum_nights1124364
minimum_nights_avg_ntm60364
maximum_nights_avg_ntm1124364
calendar_updatedtoday11 months ago
has_availabilitytt
availability_30030
availability_60060
availability_90090
availability_365191365
calendar_last_scraped2019-03-082019-03-08
number_of_reviews1670
number_of_reviews_ltm00
first_review2011-11-152010-02-18
last_review2017-02-262016-06-14
review_scores_rating9392
review_scores_accuracy99
review_scores_cleanliness910
review_scores_checkin1010
review_scores_communication910
review_scores_location1010
review_scores_value99
requires_licensett
license18-476608NaN
jurisdiction_names{\"British Columbia\",\" Canada\",\" Vancouver\",\" B...{\"British Columbia\",\" Canada\",\" Vancouver\",\" B...
instant_bookableff
is_business_travel_readyff
cancellation_policystrict_14_with_grace_periodstrict_14_with_grace_period
require_guest_profile_pictureft
require_guest_phone_verificationft
calculated_host_listings_count331
calculated_host_listings_count_entire_homes330
calculated_host_listings_count_private_rooms01
calculated_host_listings_count_shared_rooms00
reviews_per_month0.180.64
\n", "
" ], "text/plain": [ " 0 \\\n", "id 10080 \n", "listing_url https://www.airbnb.com/rooms/10080 \n", "scrape_id 20190308052700 \n", "last_scraped 2019-03-08 \n", "name D1 - Million Dollar View 2 BR \n", "summary Stunning two bedroom, two bathroom apartment. ... \n", "space Bed setup: 2 x queen, I can add up to 2 twin s... \n", "description Stunning two bedroom, two bathroom apartment. ... \n", "experiences_offered none \n", "neighborhood_overview NaN \n", "notes 1. CHECK-IN TIME IS AFTER 3PM PST AND CHECK-OU... \n", "transit NaN \n", "access There is no access to the building ammenities. \n", "interaction NaN \n", "house_rules 1. CHECK-IN TIME IS AFTER 3 PM PST AND CHECK-O... \n", "thumbnail_url NaN \n", "medium_url NaN \n", "picture_url https://a0.muscache.com/im/pictures/55778229/c... \n", "xl_picture_url NaN \n", "host_id 30899 \n", "host_url https://www.airbnb.com/users/show/30899 \n", "host_name Rami \n", "host_since 2009-08-10 \n", "host_location Vancouver, British Columbia, Canada \n", "host_about I will be happy to host you. \n", "host_response_time within a few hours \n", "host_response_rate 100% \n", "host_acceptance_rate NaN \n", "host_is_superhost f \n", "host_thumbnail_url https://a0.muscache.com/im/pictures/user/55302... \n", "host_picture_url https://a0.muscache.com/im/pictures/user/55302... \n", "host_neighbourhood Coal Harbour \n", "host_listings_count 33 \n", "host_total_listings_count 33 \n", "host_verifications ['email', 'phone', 'reviews', 'jumio', 'govern... \n", "host_has_profile_pic t \n", "host_identity_verified t \n", "street Vancouver, BC, Canada \n", "neighbourhood Coal Harbour \n", "neighbourhood_cleansed Downtown \n", "neighbourhood_group_cleansed NaN \n", "city Vancouver \n", "state BC \n", "zipcode V6E 2P4 \n", "market Vancouver \n", "smart_location Vancouver, Canada \n", "country_code CA \n", "country Canada \n", "latitude 49.2877 \n", "longitude -123.121 \n", "is_location_exact t \n", "property_type Condominium \n", "room_type Entire home/apt \n", "accommodates 6 \n", "bathrooms 2 \n", "bedrooms 2 \n", "beds 4 \n", "bed_type Real Bed \n", "amenities {TV,\"Cable TV\",Internet,Wifi,\"Air conditioning... \n", "square_feet 1200 \n", "price $296.00 \n", "weekly_price NaN \n", "monthly_price $4,750.00 \n", "security_deposit $0.00 \n", "cleaning_fee $195.00 \n", "guests_included 1 \n", "extra_people $0.00 \n", "minimum_nights 60 \n", "maximum_nights 1124 \n", "minimum_minimum_nights 60 \n", "maximum_minimum_nights 60 \n", "minimum_maximum_nights 1124 \n", "maximum_maximum_nights 1124 \n", "minimum_nights_avg_ntm 60 \n", "maximum_nights_avg_ntm 1124 \n", "calendar_updated today \n", "has_availability t \n", "availability_30 0 \n", "availability_60 0 \n", "availability_90 0 \n", "availability_365 191 \n", "calendar_last_scraped 2019-03-08 \n", "number_of_reviews 16 \n", "number_of_reviews_ltm 0 \n", "first_review 2011-11-15 \n", "last_review 2017-02-26 \n", "review_scores_rating 93 \n", "review_scores_accuracy 9 \n", "review_scores_cleanliness 9 \n", "review_scores_checkin 10 \n", "review_scores_communication 9 \n", "review_scores_location 10 \n", "review_scores_value 9 \n", "requires_license t \n", "license 18-476608 \n", "jurisdiction_names {\"British Columbia\",\" Canada\",\" Vancouver\",\" B... \n", "instant_bookable f \n", "is_business_travel_ready f \n", "cancellation_policy strict_14_with_grace_period \n", "require_guest_profile_picture f \n", "require_guest_phone_verification f \n", "calculated_host_listings_count 33 \n", "calculated_host_listings_count_entire_homes 33 \n", "calculated_host_listings_count_private_rooms 0 \n", "calculated_host_listings_count_shared_rooms 0 \n", "reviews_per_month 0.18 \n", "\n", " 1 \n", "id 11400 \n", "listing_url https://www.airbnb.com/rooms/11400 \n", "scrape_id 20190308052700 \n", "last_scraped 2019-03-08 \n", "name Central Lovely Rm in Victorian Home \n", "summary Well-appointed room with a view of the garden ... \n", "space Centrally-located lovely, quiet home on tree-l... \n", "description Well-appointed room with a view of the garden ... \n", "experiences_offered none \n", "neighborhood_overview Very quiet residential area, yet only 1-1/2 bl... \n", "notes NaN \n", "transit Bus stop 1-1/2 blocks away. 15 minutes to down... \n", "access Access to my home and garden. Shared bathroom.... \n", "interaction I live here so will interact with the guest wh... \n", "house_rules -No smoking indoors. -Recycling, composting ... \n", "thumbnail_url NaN \n", "medium_url NaN \n", "picture_url https://a0.muscache.com/im/pictures/74189163/a... \n", "xl_picture_url NaN \n", "host_id 42305 \n", "host_url https://www.airbnb.com/users/show/42305 \n", "host_name Jaynne \n", "host_since 2009-09-30 \n", "host_location Vancouver, British Columbia, Canada \n", "host_about I am a self-employed graphic designer, AutoCAD... \n", "host_response_time NaN \n", "host_response_rate NaN \n", "host_acceptance_rate NaN \n", "host_is_superhost f \n", "host_thumbnail_url https://a0.muscache.com/im/users/42305/profile... \n", "host_picture_url https://a0.muscache.com/im/users/42305/profile... \n", "host_neighbourhood Riley Park \n", "host_listings_count 1 \n", "host_total_listings_count 1 \n", "host_verifications ['email', 'phone', 'reviews', 'jumio', 'govern... \n", "host_has_profile_pic t \n", "host_identity_verified t \n", "street Vancouver, BC, Canada \n", "neighbourhood Riley Park \n", "neighbourhood_cleansed Riley Park \n", "neighbourhood_group_cleansed NaN \n", "city Vancouver \n", "state BC \n", "zipcode V5Y \n", "market Vancouver \n", "smart_location Vancouver, Canada \n", "country_code CA \n", "country Canada \n", "latitude 49.2538 \n", "longitude -123.113 \n", "is_location_exact t \n", "property_type House \n", "room_type Private room \n", "accommodates 1 \n", "bathrooms 1 \n", "bedrooms 1 \n", "beds 1 \n", "bed_type Real Bed \n", "amenities {TV,\"Cable TV\",Internet,Wifi,\"Free parking on ... \n", "square_feet 0 \n", "price $60.00 \n", "weekly_price $385.00 \n", "monthly_price $1,400.00 \n", "security_deposit NaN \n", "cleaning_fee NaN \n", "guests_included 1 \n", "extra_people $0.00 \n", "minimum_nights 364 \n", "maximum_nights 364 \n", "minimum_minimum_nights 364 \n", "maximum_minimum_nights 364 \n", "minimum_maximum_nights 364 \n", "maximum_maximum_nights 364 \n", "minimum_nights_avg_ntm 364 \n", "maximum_nights_avg_ntm 364 \n", "calendar_updated 11 months ago \n", "has_availability t \n", "availability_30 30 \n", "availability_60 60 \n", "availability_90 90 \n", "availability_365 365 \n", "calendar_last_scraped 2019-03-08 \n", "number_of_reviews 70 \n", "number_of_reviews_ltm 0 \n", "first_review 2010-02-18 \n", "last_review 2016-06-14 \n", "review_scores_rating 92 \n", "review_scores_accuracy 9 \n", "review_scores_cleanliness 10 \n", "review_scores_checkin 10 \n", "review_scores_communication 10 \n", "review_scores_location 10 \n", "review_scores_value 9 \n", "requires_license t \n", "license NaN \n", "jurisdiction_names {\"British Columbia\",\" Canada\",\" Vancouver\",\" B... \n", "instant_bookable f \n", "is_business_travel_ready f \n", "cancellation_policy strict_14_with_grace_period \n", "require_guest_profile_picture t \n", "require_guest_phone_verification t \n", "calculated_host_listings_count 1 \n", "calculated_host_listings_count_entire_homes 0 \n", "calculated_host_listings_count_private_rooms 1 \n", "calculated_host_listings_count_shared_rooms 0 \n", "reviews_per_month 0.64 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.max_rows', 500)\n", "pd.set_option('display.max_columns', 500)\n", "\n", "df_listings_more.head(2).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not all columns will be required for visualization so let's do some clean-up:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "columns_to_keep = ['id', 'summary', 'space', 'description', 'host_listings_count', 'host_total_listings_count',\n", " 'neighbourhood_cleansed', 'is_location_exact', 'property_type', 'accommodates', 'bedrooms',\n", " 'bathrooms', 'beds', 'amenities', 'availability_30', 'availability_60', 'availability_90',\n", " 'availability_365']\n", "\n", "df_listings_more = df_listings_more.loc[:,columns_to_keep]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id int64\n", "summary object\n", "space object\n", "description object\n", "host_listings_count int64\n", "host_total_listings_count int64\n", "neighbourhood_cleansed object\n", "is_location_exact object\n", "property_type object\n", "accommodates int64\n", "bedrooms float64\n", "bathrooms float64\n", "beds float64\n", "amenities object\n", "availability_30 int64\n", "availability_60 int64\n", "availability_90 int64\n", "availability_365 int64\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings_more.dtypes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "summary 167\n", "space 1057\n", "description 81\n", "host_listings_count 0\n", "host_total_listings_count 0\n", "neighbourhood_cleansed 0\n", "is_location_exact 0\n", "property_type 0\n", "accommodates 0\n", "bedrooms 3\n", "bathrooms 0\n", "beds 5\n", "amenities 0\n", "availability_30 0\n", "availability_60 0\n", "availability_90 0\n", "availability_365 0\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings_more.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In case of missing data in `summary`, `space` and `description` columns we will replace NaN's with empty strings:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "df_listings_more['summary'] = df_listings_more['summary'].fillna('')\n", "df_listings_more['space'] = df_listings_more['space'].fillna('')\n", "df_listings_more['description'] = df_listings_more['description'].fillna('')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "summary 0\n", "space 0\n", "description 0\n", "host_listings_count 0\n", "host_total_listings_count 0\n", "neighbourhood_cleansed 0\n", "is_location_exact 0\n", "property_type 0\n", "accommodates 0\n", "bedrooms 3\n", "bathrooms 0\n", "beds 5\n", "amenities 0\n", "availability_30 0\n", "availability_60 0\n", "availability_90 0\n", "availability_365 0\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings_more.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will deal with the remaining NaN's in the following sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now merge the two datasets, `df_listings` and `df_listings_more` into one single dataset and call it `df_listings_clean`:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "df_listings_clean = df_listings.merge(df_listings_more, on = 'id')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "name 1\n", "host_id 0\n", "host_name 0\n", "latitude 0\n", "longitude 0\n", "room_type 0\n", "price 0\n", "minimum_nights 0\n", "number_of_reviews 0\n", "calculated_host_listings_count 0\n", "summary 0\n", "space 0\n", "description 0\n", "host_listings_count 0\n", "host_total_listings_count 0\n", "neighbourhood_cleansed 0\n", "is_location_exact 0\n", "property_type 0\n", "accommodates 0\n", "bedrooms 3\n", "bathrooms 0\n", "beds 5\n", "amenities 0\n", "availability_30 0\n", "availability_60 0\n", "availability_90 0\n", "availability_365 0\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings_clean.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now drop the 9 remaining rows with missing values in `name`, `bedrooms` and `beds` columns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df_listings_clean.dropna(axis = 0, inplace = True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings_clean.isna().sum().any()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id int64\n", "name object\n", "host_id int64\n", "host_name object\n", "latitude float64\n", "longitude float64\n", "room_type object\n", "price int64\n", "minimum_nights int64\n", "number_of_reviews int64\n", "calculated_host_listings_count int64\n", "summary object\n", "space object\n", "description object\n", "host_listings_count int64\n", "host_total_listings_count int64\n", "neighbourhood_cleansed object\n", "is_location_exact object\n", "property_type object\n", "accommodates int64\n", "bedrooms float64\n", "bathrooms float64\n", "beds float64\n", "amenities object\n", "availability_30 int64\n", "availability_60 int64\n", "availability_90 int64\n", "availability_365 int64\n", "dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_listings_clean.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an optional help for visulaization down the road let's change data type in some columns to appropriate integer type:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "df_listings_clean['bedrooms'] = df_listings_clean['bedrooms'].astype('int64')\n", "df_listings_clean['bathrooms'] = df_listings_clean['bathrooms'].astype('int64')\n", "df_listings_clean['beds'] = df_listings_clean['beds'].astype('int64')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA on different neighbourhoods:\n", "\n", "Although Tableau will be used down the road to visualize different measures in different neighbourhoods we can also have a quick look at different features across neighbourhoods in here to get an understanding that will help us choose the best visualizations in Tableau:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Creating an empty dataframe to store different measures for different neighbourhoods:\n", "df_neighbourhood = pd.DataFrame(index = df_listings_clean['neighbourhood_cleansed'].unique())\n", "\n", "# Average nighly price in different neighbourhoods:\n", "df_neighbourhood['Avg_price'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['price'].mean().round()\n", "\n", "# Total number of listings in different neighbourhoods:\n", "df_neighbourhood['Listing_count'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['id'].count()\n", "\n", "# Average 30-Day availability in different neighbourhoods:\n", "df_neighbourhood['Avg_30_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_30']\\\n", " .mean().round().astype(int)\n", "\n", "# Average 60-Day availability in different neighbourhoods:\n", "df_neighbourhood['Avg_60_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_60']\\\n", " .mean().round().astype(int)\n", "\n", "# Average 90-Day availability in different neighbourhoods:\n", "df_neighbourhood['Avg_90_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_90']\\\n", " .mean().round().astype(int)\n", "\n", "# Average 365-Day availability in different neighbourhoods:\n", "df_neighbourhood['Avg_365_availability'] = df_listings_clean.groupby(by = 'neighbourhood_cleansed')['availability_365']\\\n", " .mean().round().astype(int)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Avg_priceListing_countAvg_30_availabilityAvg_60_availabilityAvg_90_availabilityAvg_365_availability
Downtown196.01064112541147
Riley Park145.0275102132117
Downtown Eastside176.0271122538129
West End168.0362102236117
Kensington-Cedar Cottage134.0319102235136
Hastings-Sunrise120.0247122642138
Renfrew-Collingwood102.0200122743153
Mount Pleasant150.0294102235126
Grandview-Woodland135.024781728116
Fairview144.015171831132
Kitsilano190.0424102234133
Kerrisdale173.069153147145
Arbutus Ridge198.084132947156
Sunset172.0106133048142
Dunbar Southlands190.0138133048170
Marpole129.0166173860167
Killarney91.067153453140
Shaughnessy216.087122639107
South Cambie185.059102235125
Strathcona128.02282032129
West Point Grey216.088132740133
Oakridge141.088153452136
Victoria-Fraserview90.0101194165228
\n", "
" ], "text/plain": [ " Avg_price Listing_count Avg_30_availability \\\n", "Downtown 196.0 1064 11 \n", "Riley Park 145.0 275 10 \n", "Downtown Eastside 176.0 271 12 \n", "West End 168.0 362 10 \n", "Kensington-Cedar Cottage 134.0 319 10 \n", "Hastings-Sunrise 120.0 247 12 \n", "Renfrew-Collingwood 102.0 200 12 \n", "Mount Pleasant 150.0 294 10 \n", "Grandview-Woodland 135.0 247 8 \n", "Fairview 144.0 151 7 \n", "Kitsilano 190.0 424 10 \n", "Kerrisdale 173.0 69 15 \n", "Arbutus Ridge 198.0 84 13 \n", "Sunset 172.0 106 13 \n", "Dunbar Southlands 190.0 138 13 \n", "Marpole 129.0 166 17 \n", "Killarney 91.0 67 15 \n", "Shaughnessy 216.0 87 12 \n", "South Cambie 185.0 59 10 \n", "Strathcona 128.0 22 8 \n", "West Point Grey 216.0 88 13 \n", "Oakridge 141.0 88 15 \n", "Victoria-Fraserview 90.0 101 19 \n", "\n", " Avg_60_availability Avg_90_availability \\\n", "Downtown 25 41 \n", "Riley Park 21 32 \n", "Downtown Eastside 25 38 \n", "West End 22 36 \n", "Kensington-Cedar Cottage 22 35 \n", "Hastings-Sunrise 26 42 \n", "Renfrew-Collingwood 27 43 \n", "Mount Pleasant 22 35 \n", "Grandview-Woodland 17 28 \n", "Fairview 18 31 \n", "Kitsilano 22 34 \n", "Kerrisdale 31 47 \n", "Arbutus Ridge 29 47 \n", "Sunset 30 48 \n", "Dunbar Southlands 30 48 \n", "Marpole 38 60 \n", "Killarney 34 53 \n", "Shaughnessy 26 39 \n", "South Cambie 22 35 \n", "Strathcona 20 32 \n", "West Point Grey 27 40 \n", "Oakridge 34 52 \n", "Victoria-Fraserview 41 65 \n", "\n", " Avg_365_availability \n", "Downtown 147 \n", "Riley Park 117 \n", "Downtown Eastside 129 \n", "West End 117 \n", "Kensington-Cedar Cottage 136 \n", "Hastings-Sunrise 138 \n", "Renfrew-Collingwood 153 \n", "Mount Pleasant 126 \n", "Grandview-Woodland 116 \n", "Fairview 132 \n", "Kitsilano 133 \n", "Kerrisdale 145 \n", "Arbutus Ridge 156 \n", "Sunset 142 \n", "Dunbar Southlands 170 \n", "Marpole 167 \n", "Killarney 140 \n", "Shaughnessy 107 \n", "South Cambie 125 \n", "Strathcona 129 \n", "West Point Grey 133 \n", "Oakridge 136 \n", "Victoria-Fraserview 228 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_neighbourhood" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding the most common words in Reviews and listing Summaries:\n", "\n", "With the aim of making word clouds of most common words in property summary (description) and reviews we go through all reviews and listing summaries in each neghbourhood and will feed Tableau with tables that list the top 50 words used in reviews and listing summaries, in each neighbourhood:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Funtion to return the top 50 words:\n", "\n", "from operator import itemgetter\n", "\n", "def top_word_counter(df, filter_col, filter_match, lookup_col):\n", "\n", " words = list(entry.lower().strip() for entry in df[df[filter_col] == filter_match][lookup_col].astype(str))\n", " wordcount = {}\n", " \n", " # Words that will not be counted:\n", " stopwords = [\"i\", \"me\", \"my\", \"myself\", \"we\", \"our\", \"ours\", \"ourselves\", \"you\", \"your\", \"yours\", \n", " \"yourself\", \"yourselves\", \"he\", \"him\", \"his\", \"himself\", \"she\", \"her\", \"hers\", \"herself\", \n", " \"it\", \"its\", \"itself\", \"they\", \"them\", \"their\", \"theirs\", \"themselves\", \"what\", \"which\", \n", " \"who\", \"whom\", \"this\", \"that\", \"these\", \"those\", \"am\", \"is\", \"are\", \"was\", \"were\", \"be\", \n", " \"been\", \"being\", \"have\", \"has\", \"had\", \"having\", \"do\", \"does\", \"did\", \"doing\", \"a\", \"an\", \n", " \"the\", \"and\", \"but\", \"if\", \"or\", \"because\", \"as\", \"until\", \"while\", \"of\", \"at\", \"by\", \"for\", \n", " \"with\", \"about\", \"against\", \"between\", \"into\", \"through\", \"during\", \"before\", \"after\", \"above\", \n", " \"below\", \"to\", \"from\", \"up\", \"down\", \"in\", \"out\", \"on\", \"off\", \"over\", \"under\", \"again\", \"further\", \n", " \"then\", \"once\", \"here\", \"there\", \"when\", \"where\", \"why\", \"how\", \"all\", \"any\", \"both\", \"each\", \n", " \"few\", \"more\", \"most\", \"other\", \"some\", \"such\", \"no\", \"nor\", \"not\", \"only\", \"own\", \"same\", \"so\", \n", " \"than\", \"too\", \"very\", \"s\", \"t\", \"can\", \"will\", \"just\", \"don\", \"should\", \"now\", \"\"]\n", "\n", " symbols_to_replace = [\".\", \",\", \":\", \";\", \"\\\"\", \"/\", \"!\", \"?\", \"“\", \"‘\", \"*\", \")\", \"(\", \"|\", \n", " \">\", \"<\", \"{\", \"}\", \"&\", \"^\", \"~\", \"+\"]\n", " \n", " seperator = ' '\n", " text = seperator.join(words)\n", "\n", " for word in text.split():\n", " for symbol in symbols_to_replace:\n", " word = word.replace(symbol,\"\")\n", "\n", " if word not in stopwords:\n", " if word not in wordcount:\n", " wordcount[word] = 1\n", " else:\n", " wordcount[word] += 1\n", "\n", " top_words = sorted(wordcount.items(), key=itemgetter(1), reverse = True)[0:50]\n", " \n", " return [(filter_match, i[0], i[1]) for i in top_words]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Finding the top 50 words in summaries for each neighbouthood:\n", "summary_words = []\n", "\n", "for neighb in df_listings_clean['neighbourhood_cleansed'].unique():\n", " summary_words += top_word_counter(df_listings_clean, 'neighbourhood_cleansed', neighb, 'summary')\n", "\n", "summary_words = pd.DataFrame(summary_words)\n", "summary_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Cleaning the reviews dataframe before checking for most used words:\n", "reviews_clean = df_reviews_more.drop(['id', 'date', 'reviewer_id', 'reviewer_name'] , axis = 1)\\\n", " .rename(columns = {'listing_id': 'id', 'comments': 'review'})\\\n", " .merge(df_listings_more.loc[:, ['id', 'neighbourhood_cleansed']], on = 'id')\n", "\n", "# Finding the top 50 words in reviews for each neighbouthood:\n", "review_words = []\n", "\n", "for neighb in reviews_clean['neighbourhood_cleansed'].unique():\n", " review_words += top_word_counter(reviews_clean, 'neighbourhood_cleansed', neighb, 'review')\n", "\n", "review_words = pd.DataFrame(review_words)\n", "review_words.columns = ['neighbourhood_cleansed', 'Word', 'Word_Count']" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# Exporting new, clean dataframes as csv files that will feed into Tableau:\n", "\n", "df_listings_clean.to_csv('Data/clean_data.csv')\n", "review_words.to_csv('Data/review_words.csv')\n", "summary_words.to_csv('Data/summary_words.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data visualization to continue on Tableau." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }