{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AirBnB Listings by Reputation and Description\n", "by [Talha Oz](http://talhaoz.com)\n", "\n", "## Abstract\n", "\n", "Reviewing is a feedback mechanism that e-commerce sites leverage to help their customers make more informative purchase decisions on their platforms. Although the biggest online sellers such as Amazon and eBay allow their users to filter the search results by seller reputations, the leading space sharing platform AirBnB lacks this crucial feature. Even more disappointingly, AirBnB does not allow it’s users to search for keywords within listing contents (descriptions). In this project, I create a demo geo-web application to meet these needs of AirBnB users. The application allows its users i) to filter the listings by review scores for six reputation categories, ii) to search in listing descriptions, and iii) to experience better visualization by adopting a different marker for each listing room type and by providing clustered-listings view. I demonstrate the application for the Washington, D.C. area by utilizing a publicly available AirBnB listings dataset." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "pd.set_option('display.float_format', lambda x: '%.2f' % x)\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "colpal = sns.color_palette(\"hls\", 7)\n", "sns.set(palette=colpal, style='ticks', rc={\"figure.figsize\":(7.75,5),'savefig.dpi':150})\n", "\n", "import folium\n", "from IPython.display import HTML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## INTRODUCTION\n", "AirBnB is one of the greatest success stories of sharing economy, a website of value $25.5 billion as of November 2015 [1] where hosts provide lodging spaces and guests rent them. Hosts basically rent three types of spaces using this platform: `entire home`, `private room`, and `shared room`. In return of the quality they received during their rental, guests then leave feedback, some of which is public, to the hosts. In addition to the option of free text comments, AirBnB provides six review categories where guests can rate their experience from zero to five (in 0.5 incremental steps). This richness of feedback types is very valuable as trust is of great importance in the sharing economy.\n", "\n", "One of the main parts, if not the main part, of listings on AirBnB website is the free-text `description` section where hosts strive for describing their property as attractive as possible. Surprisingly though, currently the website does not allow for searching in it.\n", "\n", "It is unfortunate that the guests can see the listing descriptions as well as the review scores, while not being able to narrow their search exploiting this information. In this project I create a geo-web application to overcome this problem. \n", "In this report, I first introduce AirBnB and describe the purpose of my demo geo-web application in this (Introduction) section. In the next (Data) section I then provide some of the characteristics of the dataset on which I built this demo application. The third section is about the Design of the application where I discuss it under two subsections as Back End and Front End. I then conclude the report with Conclusion and Discussion section. Tables and code snippets are added to appendices whenever found necessary. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of records: 3723\n", "Number of columns: 91\n", "listing_url, scrape_id, last_scraped, name, summary, space, description, experiences_offered, neighborhood_overview, notes, transit, thumbnail_url, medium_url, picture_url, xl_picture_url, host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, street, neighbourhood, neighbourhood_cleansed, neighbourhood_group_cleansed, city, state, zipcode, market, smart_location, country_code, country, latitude, longitude, is_location_exact, property_type, room_type, accommodates, bathrooms, bedrooms, beds, bed_type, amenities, square_feet, price, weekly_price, monthly_price, security_deposit, cleaning_fee, guests_included, extra_people, minimum_nights, maximum_nights, calendar_updated, has_availability, availability_30, availability_60, availability_90, availability_365, calendar_last_scraped, number_of_reviews, first_review, last_review, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, requires_license, license, jurisdiction_names, instant_bookable, cancellation_policy, require_guest_profile_picture, require_guest_phone_verification, calculated_host_listings_count, reviews_per_month\n" ] } ], "source": [ "# read data in \n", "df = pd.read_csv('data/listings.csv',index_col='id')\n", "print('Number of records:',df.shape[0])\n", "print('Number of columns:',df.shape[1])\n", "print(', '.join(df.columns)) #see the columns starting with review_score..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Source\n", "AirBnB (as of December 1, 2015) does not provide a publicly accessible application programming interface (API) for developers to collect information about the listings on their platform. However, enthusiastic hackers have managed to collect the listing data by implementing web scrapers (a search of ‘airbnb data’ in GitHub lists some). \n", "\n", "The dataset (`Listing.csv`) being utilized in this study is retrieved from [insideairbnb.com](http://insideairbnb.com/get-the-data.html) website and also made available in the public repository of this project. The original data source provides the date they scraped the listings, which happens to be October 3, 2015 for Washington, DC.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# pre-process the data\n", "revcols = 'host_id\thost_listings_count\tnumber_of_reviews\treview_scores_rating\treview_scores_accuracy\treview_scores_cleanliness\treview_scores_checkin\treview_scores_communication\treview_scores_location\treview_scores_value'.split('\\t')\n", "df = df.dropna(subset=revcols).sort('review_scores_rating',ascending=False)\n", "df = df.rename(columns=dict(zip(revcols[3:],[c.split('_')[-1] for c in revcols[3:]])))\n", "df = df.rename(columns={'neighbourhood_cleansed':'neighborhood'})\n", "revcols = [c.split('_')[-1] for c in revcols[4:]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Descriptive Statistics\n", "There are a total of 3723 listings in Washington, D.C area in the dataset. The very first question one might ask is the spatial distribution of these listings. Are people in Georgetown area more willing to host (list) their properties on AirBnB than those in Foggy Bottom? What neighborhoods are leading in the listings count? To be able to answer questions of this kind, I created a table (Appendix A) as well as a map (Figure 1) showing the number of listings per neighborhood.\n", "\n", "The AirBnB listings dataset is also attribute rich, Listing.csv has 91 columns (Appendix A), including listing id, name, neighborhood, room type, description, latitude, longitude, host id, host listings count, number of reviews, and review scores. Other than the total review score, each listing reviewed has scores of six review categories: accuracy, check-in, cleanliness, communication, location, and value. Then one might wonder what the average review score for each category is. For all of the six categories, I found that (Figure 2) the guest satisfaction in general is very high; value and cleanliness are the lowest two with 9.32 and 9.33 respectively, and communication is the highest with 9.75 (One of course from these results should not interpolate that the hosts in the capital are good communicators but dirty, just as the main theme of the city, the politics itself). I should note that only 2846 listings of 3723 are reviewed at least once.\n", "\n", "For data management operations, Python’s Pandas library [7] is used, and for visualization Folium [4] and Seaborn [8] exploited." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
| \n", " | neighborhood | \n", "rating | \n", "
|---|---|---|
| 0 | \n", "Columbia Heights, Mt. Pleasant, Pleasant Plain... | \n", "351 | \n", "
| 1 | \n", "Dupont Circle, Connecticut Avenue/K Street | \n", "285 | \n", "
| 2 | \n", "Capitol Hill, Lincoln Park | \n", "242 | \n", "
| 3 | \n", "Shaw, Logan Circle | \n", "239 | \n", "
| 4 | \n", "Union Station, Stanton Park, Kingman Park | \n", "234 | \n", "
| 5 | \n", "Edgewood, Bloomingdale, Truxton Circle, Eckington | \n", "194 | \n", "
| 6 | \n", "Kalorama Heights, Adams Morgan, Lanier Heights | \n", "183 | \n", "
| 7 | \n", "Brightwood Park, Crestwood, Petworth | \n", "140 | \n", "
| 8 | \n", "Downtown, Chinatown, Penn Quarters, Mount Vern... | \n", "135 | \n", "
| 9 | \n", "Howard University, Le Droit Park, Cardozo/Shaw | \n", "114 | \n", "
| 10 | \n", "West End, Foggy Bottom, GWU | \n", "95 | \n", "
| 11 | \n", "Georgetown, Burleith/Hillandale | \n", "87 | \n", "
| 12 | \n", "Southwest Employment Area, Southwest/Waterfron... | \n", "63 | \n", "
| 13 | \n", "Ivy City, Arboretum, Trinidad, Carver Langston | \n", "52 | \n", "
| 14 | \n", "Takoma, Brightwood, Manor Park | \n", "47 | \n", "
| 15 | \n", "Brookland, Brentwood, Langdon | \n", "44 | \n", "
| 16 | \n", "Cathedral Heights, McLean Gardens, Glover Park | \n", "43 | \n", "
| 17 | \n", "Cleveland Park, Woodley Park, Massachusetts Av... | \n", "41 | \n", "
| 18 | \n", "Spring Valley, Palisades, Wesley Heights, Foxh... | \n", "28 | \n", "
| 19 | \n", "North Michigan Park, Michigan Park, University... | \n", "26 | \n", "
| 20 | \n", "Historic Anacostia | \n", "24 | \n", "
| 21 | \n", "Friendship Heights, American University Park, ... | \n", "23 | \n", "
| 22 | \n", "North Cleveland Park, Forest Hills, Van Ness | \n", "22 | \n", "
| 23 | \n", "Twining, Fairlawn, Randle Highlands, Penn Bran... | \n", "18 | \n", "
| 24 | \n", "Colonial Village, Shepherd Park, North Portal ... | \n", "16 | \n", "
| 25 | \n", "Hawthorne, Barnaby Woods, Chevy Chase | \n", "15 | \n", "
| 26 | \n", "Woodridge, Fort Lincoln, Gateway | \n", "12 | \n", "
| 27 | \n", "Near Southeast, Navy Yard | \n", "12 | \n", "
| 28 | \n", "Lamont Riggs, Queens Chapel, Fort Totten, Plea... | \n", "9 | \n", "
| 29 | \n", "Capitol View, Marshall Heights, Benning Heights | \n", "7 | \n", "
| 30 | \n", "Sheridan, Barry Farm, Buena Vista | \n", "7 | \n", "
| 31 | \n", "River Terrace, Benning, Greenway, Dupont Park | \n", "6 | \n", "
| 32 | \n", "Douglas, Shipley Terrace | \n", "6 | \n", "
| 33 | \n", "Mayfair, Hillbrook, Mahaning Heights | \n", "6 | \n", "
| 34 | \n", "Eastland Gardens, Kenilworth | \n", "5 | \n", "
| 35 | \n", "Congress Heights, Bellevue, Washington Highlands | \n", "5 | \n", "
| 36 | \n", "Fairfax Village, Naylor Gardens, Hillcrest, Su... | \n", "4 | \n", "
| 37 | \n", "Deanwood, Burrville, Grant Park, Lincoln Heigh... | \n", "3 | \n", "
| 38 | \n", "Woodland/Fort Stanton, Garfield Heights, Knox ... | \n", "3 | \n", "