{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this five-part Ipython Notebook series is to gather, clean, visaulize, and model data for my final project in General Assembly's data science course. \n", "\n", "Using listings on craigslist, the below code is a set of functions which programatically specifies markets of interest and extracts listing data. All defined functions are used in the main class near the bottom of the notebook - which runs all functions to collect the data. \n", "\n", "The project data consists of listing data - text descriptions, geolocation, image attributes, and listiing metadata - for rentals in the North Virginia, Maryland, and Washington DC." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import requests # Helps construct the request to send to the API\n", "import json # JSON helper functions\n", "from bs4 import BeautifulSoup #Data Scraping Library\n", "import pandas as pd\n", "import time" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Specify markets to search for housing ads. \n", "# Each market has 2 levels - e.g [washingtondc, nva] for listings in North VA would create the start of the \n", "# following URL: http://washingtondc.craigslist.org/search/nva/\n", "\n", "def define_markets_to_search():\n", " \n", " #To get listings for new locations, add the new location in the markets list\n", " markets = [['washingtondc','nva'],['washingtondc','mld'],['washingtondc','doc']]\n", " return markets" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Determines how many pages of listings exist in each market - listings are split into pages with 100 listings per page\n", "# A market (e.g. North Virginia) with 2500 listings would return a pages_of_listings value of 25\n", "\n", "def count_number_of_listings(market):\n", "\n", " #create url for each specified market\n", " url = 'http://' + market[0] + '.craigslist.org/search/apa' #apa == apartment\n", "\n", " # Make the request\n", " response = requests.post(url)\n", "\n", " #place data in a Beautiful soup object\n", " soup = BeautifulSoup(response.text)\n", "\n", " #Determine how many pages (listings) exist to define number of times to run code\n", " pages_of_listings = int(soup.find_all('span', class_='totalcount')[0].text)\n", " return pages_of_listings" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get a list of all listing ids - a market with 100 listings would return 100 distinct ids\n", "\n", "def get_listing_url_list(market, pages_of_listings):\n", " listing_ids = [] #create list of all the listing ids - these are the unique ids, which make each URL distinct\n", " \n", " for page in xrange(0, pages_of_listings, 100):\n", "\n", " #create url for each specified market\n", " if page == 0: \n", " url = 'http://' + market[0] + '.craigslist.org/search/' + market[1] + '/apa'\n", " else: \n", " #all pages of listings after the first append a page number to the end of the URL \n", " url = 'http://' + market[0] + '.craigslist.org/search/' + market[1] + '/apa?s=' + str(page) + '&'\n", "\n", " # delay script each time it gets a new set of listing urls (100 for each page) to avoid burdening the server\n", " time.sleep(0.5) #seconds\n", "\n", " # Make the request\n", " response = requests.post(url)\n", "\n", " #place data in Beautiful soup object\n", " soup = BeautifulSoup(response.text)\n", "\n", " #create links to each listing page\n", " data_pid = soup.find_all('p', class_='row')\n", " for listing_id in data_pid:\n", " listing_ids.append(listing_id['data-pid'])\n", "\n", " print page, url #prints the page number of URL for each page of listings used\n", "\n", " return listing_ids[0:5] #currently limited to five listings for test, remove slicing at end to increase results" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a URL and get data from craigslist for a single listing id. \n", "# This code is rerun in a for loop to create a distinct url and get the data for each listing of interest\n", "\n", "def get_craigslist_listing(id, market):\n", " url = 'http://' + market[0] + '.craigslist.org/' + market[1] + '/apa/' + id + '.html'\n", " \n", " # Make the request\n", " response = requests.post(url)\n", "\n", " # Confirm the response worked\n", " response.status_code\n", " \n", " #place data in Beautiful soup object\n", " soup = BeautifulSoup(response.text)\n", "\n", " return soup, url" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Craigslist example listing page\n", "\n", "from IPython.display import HTML\n", "HTML('')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the property attributes from the beautiful soup object for each listing, including # of bedrooms, bathrooms, and\n", "# square feet; housing type, pets allowed, laundry availability, parking access, smoking permissions, and availability\n", "\n", "def get_property_attributes(property):\n", "\n", " #create empty property dict to collect property attributes\n", " attribute_dict = {}\n", "\n", " try:\n", " attributes_data = property.find('p',class_='attrgroup').find_all('span')\n", "\n", " for attribute in attributes_data:\n", " if 'BR' in attribute.text:\n", " attribute_dict['bedroom'] = attribute.text.split('/')[0].replace('BR','') #only keep the number\n", " if 'Ba' in attribute.text:\n", " attribute_dict['bathroom'] = attribute.text.split('/')[1].replace('Ba','')#only keep the number\n", " if 'ft' in attribute.text:\n", " attribute_dict['square_footage'] = attribute.text.replace('ft2','') #only keep the number\n", " if attribute.text in ['apartment', 'condo', 'cottage/cabin', 'duplex', 'flat', \n", " 'house', 'in-law', 'loft', 'townhouse','manufactured', 'assisted living', 'land']:\n", " attribute_dict['housing_type'] = attribute.text\n", " if 'cat' in attribute.text:\n", " attribute_dict['cat'] = attribute.text\n", " if 'dog' in attribute.text:\n", " attribute_dict['dog'] = attribute.text\n", " if attribute.text in ['w/d in unit','laundry in bldg','laundry on site','w/d hookups']:\n", " attribute_dict['laundry'] = attribute.text \n", " if attribute.text in ['carport', 'attached garage', 'detached garage', 'off-street parking', 'street parking', 'valet parking']:\n", " attribute_dict['parking'] = attribute.text\n", " if 'smoking' in attribute.text:\n", " attribute_dict['smoking'] = attribute.text\n", "\n", " attribute_dict['availability'] = property.find('span', class_='housing_movein_now property_date')['today_msg']\n", " attribute_dict['date_available'] = property.find('span', class_='housing_movein_now property_date')['date']\n", "\n", " return attribute_dict\n", " except: pass" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the listed price - listings without pricing data are not relevant and will be discarded in later code\n", "\n", "def get_property_price(property):\n", "\n", " price = property.find('h2')\n", " price = price.contents[3].text\n", " price = price.split()\n", " return price[0]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the date and time of the posting (e.g. 2015-01-01 12:00pm)\n", "\n", "def get_posting_date_and_time(property):\n", " \n", " posting_time = property.find('p', class_='postinginfo').find('time').text\n", " return posting_time" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get the user created text description for the listing\n", "\n", "def get_property_description(property):\n", "\n", " listing_description = property.find('section', class_='userbody').find('section')\n", " return listing_description.text" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#Get image attributes, such as the number of images and average image size\n", "\n", "def get_image_data(property):\n", "\n", " #Create a list of dicts containing image number, and image size\n", " average_image_size = 0\n", " image_size_sum = 0\n", " image_number = 0\n", " \n", " try:\n", " images = property.find('figure').find_all('div')\n", " for pic in images[-1]: \n", " image_number = int(pic['title']) #find out number of images in listing - only keep the last (max) number\n", " image_size = pic['href'].split('_')[-1].split('.')[0].split('x')\n", " image_size_sum += int(image_size[0]) * int(image_size[1]) #get sum of image size (Width * Height)\n", " except: pass\n", "\n", " try: \n", " average_image_size = image_size_sum / image_number\n", " except: pass\n", "\n", " return image_number, average_image_size" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Get location data, including latitude, longitutde, country, city, state, & googlemaps metric: location_data_accuracy\n", "\n", "def get_property_location(property):\n", "\n", " #get longitude, latitude, and location accuracy metric\n", " location = property.find('div', class_='viewposting')\n", " \n", " try:\n", " location_data_accuracy = property.find('div', class_='viewposting')['data-accuracy']\n", " except: pass\n", " \n", " try:\n", " latitude = property.find('div', class_='viewposting')['data-latitude']\n", " except: pass\n", " \n", " try:\n", " longitude = property.find('div', class_='viewposting')['data-longitude']\n", " except: pass\n", "\n", " #get country, state, city \n", " try:\n", " #get yahoo maps link - easier to extract data than google maps\n", " map = property.find('p', class_='mapaddress').find_all('a')[1]['href']\n", " country = map.split('country=')[1]\n", " state = map.split('csz=')[1].split('&')[0].split('+')[1]\n", " city = map.split('csz=')[1].split('&')[0].split('+')[0]\n", " except: pass\n", " \n", " #get address\n", " try:\n", " address = property.find('div', class_='mapaddress').text\n", " except: pass\n", " \n", " #place all location data into dict\n", " location_dict = {}\n", " \n", " try:\n", " location_dict['location_data_accuracy'] = location_data_accuracy\n", " except: pass\n", " \n", " try:\n", " location_dict['latitude'] = latitude\n", " except: pass\n", " \n", " try:\n", " location_dict['longitude'] = longitude\n", " except: pass\n", " \n", " try:\n", " location_dict['country'] = country\n", " except: pass\n", " \n", " try:\n", " location_dict['state'] = state\n", " except: pass\n", " \n", " try:\n", " location_dict['city'] = city\n", " except: pass\n", " \n", " latitude = \"\"\n", " longitude = \"\"\n", " country = \"\"\n", " state = \"\"\n", " city = \"\"\n", " location_data_accuracy = ''\n", "\n", " return location_dict" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 http://washingtondc.craigslist.org/search/nva/apa\n", "100 http://washingtondc.craigslist.org/search/nva/apa?s=100&\n", "200 http://washingtondc.craigslist.org/search/nva/apa?s=200&\n", "300 http://washingtondc.craigslist.org/search/nva/apa?s=300&\n", "400 http://washingtondc.craigslist.org/search/nva/apa?s=400&\n", "500 http://washingtondc.craigslist.org/search/nva/apa?s=500&\n", "600 http://washingtondc.craigslist.org/search/nva/apa?s=600&\n", "700 http://washingtondc.craigslist.org/search/nva/apa?s=700&\n", "800 http://washingtondc.craigslist.org/search/nva/apa?s=800&\n", "900 http://washingtondc.craigslist.org/search/nva/apa?s=900&\n", "1000 http://washingtondc.craigslist.org/search/nva/apa?s=1000&\n", "1100 http://washingtondc.craigslist.org/search/nva/apa?s=1100&\n", "1200 http://washingtondc.craigslist.org/search/nva/apa?s=1200&\n", "1300 http://washingtondc.craigslist.org/search/nva/apa?s=1300&\n", "1400 http://washingtondc.craigslist.org/search/nva/apa?s=1400&\n", "1500 http://washingtondc.craigslist.org/search/nva/apa?s=1500&\n", "1600 http://washingtondc.craigslist.org/search/nva/apa?s=1600&\n", "1700 http://washingtondc.craigslist.org/search/nva/apa?s=1700&\n", "1800 http://washingtondc.craigslist.org/search/nva/apa?s=1800&\n", "1900 http://washingtondc.craigslist.org/search/nva/apa?s=1900&\n", "2000 http://washingtondc.craigslist.org/search/nva/apa?s=2000&\n", "2100 http://washingtondc.craigslist.org/search/nva/apa?s=2100&\n", "2200 http://washingtondc.craigslist.org/search/nva/apa?s=2200&\n", "2300 http://washingtondc.craigslist.org/search/nva/apa?s=2300&\n", "2400 http://washingtondc.craigslist.org/search/nva/apa?s=2400&\n", "0 http://washingtondc.craigslist.org/search/mld/apa\n", "100 http://washingtondc.craigslist.org/search/mld/apa?s=100&\n", "200 http://washingtondc.craigslist.org/search/mld/apa?s=200&\n", "300 http://washingtondc.craigslist.org/search/mld/apa?s=300&\n", "400 http://washingtondc.craigslist.org/search/mld/apa?s=400&\n", "500 http://washingtondc.craigslist.org/search/mld/apa?s=500&\n", "600 http://washingtondc.craigslist.org/search/mld/apa?s=600&\n", "700 http://washingtondc.craigslist.org/search/mld/apa?s=700&\n", "800 http://washingtondc.craigslist.org/search/mld/apa?s=800&\n", "900 http://washingtondc.craigslist.org/search/mld/apa?s=900&\n", "1000 http://washingtondc.craigslist.org/search/mld/apa?s=1000&\n", "1100 http://washingtondc.craigslist.org/search/mld/apa?s=1100&\n", "1200 http://washingtondc.craigslist.org/search/mld/apa?s=1200&\n", "1300 http://washingtondc.craigslist.org/search/mld/apa?s=1300&\n", "1400 http://washingtondc.craigslist.org/search/mld/apa?s=1400&\n", "1500 http://washingtondc.craigslist.org/search/mld/apa?s=1500&\n", "1600 http://washingtondc.craigslist.org/search/mld/apa?s=1600&\n", "1700 http://washingtondc.craigslist.org/search/mld/apa?s=1700&\n", "1800 http://washingtondc.craigslist.org/search/mld/apa?s=1800&\n", "1900 http://washingtondc.craigslist.org/search/mld/apa?s=1900&\n", "2000 http://washingtondc.craigslist.org/search/mld/apa?s=2000&\n", "2100 http://washingtondc.craigslist.org/search/mld/apa?s=2100&\n", "2200 http://washingtondc.craigslist.org/search/mld/apa?s=2200&\n", "2300 http://washingtondc.craigslist.org/search/mld/apa?s=2300&\n", "2400 http://washingtondc.craigslist.org/search/mld/apa?s=2400&\n", "0 http://washingtondc.craigslist.org/search/doc/apa\n", "100 http://washingtondc.craigslist.org/search/doc/apa?s=100&\n", "200 http://washingtondc.craigslist.org/search/doc/apa?s=200&\n", "300 http://washingtondc.craigslist.org/search/doc/apa?s=300&\n", "400 http://washingtondc.craigslist.org/search/doc/apa?s=400&\n", "500 http://washingtondc.craigslist.org/search/doc/apa?s=500&\n", "600 http://washingtondc.craigslist.org/search/doc/apa?s=600&\n", "700 http://washingtondc.craigslist.org/search/doc/apa?s=700&\n", "800 http://washingtondc.craigslist.org/search/doc/apa?s=800&\n", "900 http://washingtondc.craigslist.org/search/doc/apa?s=900&\n", "1000 http://washingtondc.craigslist.org/search/doc/apa?s=1000&\n", "1100 http://washingtondc.craigslist.org/search/doc/apa?s=1100&\n", "1200 http://washingtondc.craigslist.org/search/doc/apa?s=1200&\n", "1300 http://washingtondc.craigslist.org/search/doc/apa?s=1300&\n", "1400 http://washingtondc.craigslist.org/search/doc/apa?s=1400&\n", "1500 http://washingtondc.craigslist.org/search/doc/apa?s=1500&\n", "1600 http://washingtondc.craigslist.org/search/doc/apa?s=1600&\n", "1700 http://washingtondc.craigslist.org/search/doc/apa?s=1700&\n", "1800 http://washingtondc.craigslist.org/search/doc/apa?s=1800&\n", "1900 http://washingtondc.craigslist.org/search/doc/apa?s=1900&\n", "2000 http://washingtondc.craigslist.org/search/doc/apa?s=2000&\n", "2100 http://washingtondc.craigslist.org/search/doc/apa?s=2100&\n", "2200 http://washingtondc.craigslist.org/search/doc/apa?s=2200&\n", "2300 http://washingtondc.craigslist.org/search/doc/apa?s=2300&\n", "2400 http://washingtondc.craigslist.org/search/doc/apa?s=2400&\n" ] } ], "source": [ "#Main class - loop though all listings in listings_ids and extract features\n", "\n", "#initialize lists to collect attribute and location data\n", "property_attributes_data = []\n", "property_location_data = []\n", "\n", "markets = define_markets_to_search() #Define which markets to search (e.g. Washington DC)\n", "\n", "for market in markets: \n", " pages_of_listings = count_number_of_listings(market) #Count number of listings in specified market\n", " listing_ids = get_listing_url_list(market, pages_of_listings) #Get a list of all the individual listing to search\n", "\n", " for id in listing_ids:\n", "\n", " time.sleep(0.5) # delay script each time it get a new listing\n", " try:\n", " #get the listing HTML and the listing URL\n", " property, url = get_craigslist_listing(id, market) \n", " except: pass\n", "\n", " try:\n", " #create initial dict with property attributes\n", " property_attributes = get_property_attributes(property)\n", " property_attributes['url'] = url\n", " \n", " #add price, description, and image data to dict\n", " property_attributes['price'] = get_property_price(property)\n", " property_attributes['description'] = get_property_description(property).encode('utf-8')\n", " property_attributes['time_of_posting'] = get_posting_date_and_time(property)\n", "\n", " image_number, average_image_size = get_image_data(property)\n", " property_attributes['image_number'] = image_number\n", " property_attributes['average_image_size'] = average_image_size\n", "\n", " #add property attributes to property_attributes_data list\n", " property_attributes = pd.Series(property_attributes, name=id)\n", " property_attributes_data.append(property_attributes)\n", " except: pass\n", "\n", " try:\n", " #add location data to location data list\n", " location = get_property_location(property)\n", " location = pd.Series(location, name=id) \n", " property_location_data.append(location)\n", " except: pass\n", "\n", " #put lists into DataFrames\n", " property_attributes_dataframe = pd.DataFrame(property_attributes_data)\n", " property_location_dataframe = pd.DataFrame(property_location_data)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Merge the property attributes and location dataframes\n", "\n", "dat = pd.merge(property_location_dataframe, property_attributes_dataframe, left_index=True, right_index=True)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
citycountrylatitudelocation_data_accuracylongitudestateavailabilityaverage_image_sizebathroombedroom...doghousing_typeimage_numberlaundryparkingpricesmokingsquare_footagetime_of_postingurl
4959351766NaNNaNNaNNaNNaNNaNavailable now27000012...NaNcondo18NaNNaN$1310NaNNaN2015-04-01 2:44pmhttp://washingtondc.craigslist.org/mld/apa/495...
4959370650AlexandriaUS38.80600022-77.052900DCavailable now0NaN0...dogs are OK - wooofhouse0NaNattached garage$860NaNNaN2015-04-01 2:55pmhttp://washingtondc.craigslist.org/mld/apa/495...
\n", "

2 rows × 23 columns

\n", "
" ], "text/plain": [ " city country latitude location_data_accuracy longitude \\\n", "4959351766 NaN NaN NaN NaN NaN \n", "4959370650 Alexandria US 38.806000 22 -77.052900 \n", "\n", " state availability average_image_size bathroom bedroom \\\n", "4959351766 NaN available now 270000 1 2 \n", "4959370650 DC available now 0 NaN 0 \n", "\n", " ... \\\n", "4959351766 ... \n", "4959370650 ... \n", "\n", " dog housing_type image_number laundry \\\n", "4959351766 NaN condo 18 NaN \n", "4959370650 dogs are OK - wooof house 0 NaN \n", "\n", " parking price smoking square_footage time_of_posting \\\n", "4959351766 NaN $1310 NaN NaN 2015-04-01 2:44pm \n", "4959370650 attached garage $860 NaN NaN 2015-04-01 2:55pm \n", "\n", " url \n", "4959351766 http://washingtondc.craigslist.org/mld/apa/495... \n", "4959370650 http://washingtondc.craigslist.org/mld/apa/495... \n", "\n", "[2 rows x 23 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check results before converting to a csv\n", "len(dat)\n", "dat[0:2]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a csv with a utf-8 encoding\n", "\n", "dat.to_csv(r'C:\\Users\\alsherman\\Desktop\\GitHub\\DataScience_GeneralAssembly\\Data\\Craigslist_Data_May_3_.csv', encoding='utf-8')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }