{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MusicBrainz place geocoder\n", "\n", "To see this analysis live, check out my article [\"Analyzing Last.fm Listening History\"](http://geoffboeing.com/2016/05/analyzing-lastfm-history/)\n", "\n", "This notebook loads a set of artists from musicbrainz, created by the [musicbrainz_downloader](musicbrainz_downloader.ipynb). Then it takes each's place name (ie, either where they're from or where they're most associated with - as determined in other notebook), and geocodes that place name to lat long. Then it maps the artists.\n", "\n", "Nominatim API documentation: https://wiki.openstreetmap.org/wiki/Nominatim\n", "\n", "Sample Nominatim query: https://nominatim.openstreetmap.org/search?format=json&q=brixton,london,england" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd, numpy as np, matplotlib.pyplot as plt, time, requests\n", "from mpl_toolkits.basemap import Basemap\n", "from geopy.distance import great_circle\n", "\n", "%matplotlib inline\n", "pause = 0.75" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define geocoding functions\n", "\n", "Nominatim and Google APIs" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def geocode_nominatim(address):\n", " time.sleep(pause)\n", " url = u'https://nominatim.openstreetmap.org/search?format=json&q={}'\n", " request = url.format(address)\n", " response = requests.get(request)\n", " data = response.json()\n", " if len(data) > 0:\n", " return '{},{}'.format(data[0]['lat'], data[0]['lon'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def geocode_google(address):\n", " time.sleep(pause)\n", " url = u'http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address={}'\n", " request = url.format(address)\n", " response = requests.get(request)\n", " data = response.json()\n", " if len(data['results']) > 0:\n", " latitude = data['results'][0]['geometry']['location']['lat']\n", " longitude = data['results'][0]['geometry']['location']['lng']\n", " return '{},{}'.format(latitude, longitude)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test it" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "address = u\"Brixton, London, England, United Kingdom\"\n", "latlng_google = geocode_google(address)\n", "latlng_nominatim = geocode_nominatim(address)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "51.4612794,-0.1156148 google\n", "51.4568044,-0.1167958 nominatim\n", "0.31 miles apart\n" ] } ], "source": [ "print '{} google'.format(latlng_google)\n", "print '{} nominatim'.format(latlng_nominatim)\n", "print '{} miles apart'.format(round(great_circle(latlng_google, latlng_nominatim).miles, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run it" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "428 total artists\n", "231 unique places\n" ] } ], "source": [ "artists = pd.read_csv('data/mb.csv', encoding='utf-8')\n", "print '{:,} total artists'.format(len(artists))\n", "\n", "# drop nans and get the unique set of places\n", "addresses = pd.Series(artists['place_full'].dropna().sort_values().unique())\n", "print '{:,} unique places'.format(len(addresses))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "24 countries with more detail\n", "217 unique addresses to geocode\n" ] } ], "source": [ "def get_country_if_more_detail(address):\n", " tokens = address.split(',')\n", " if len(tokens) > 1:\n", " return tokens[-1].strip()\n", "\n", "# if a place contains only country name, check if that country name exists with more detail elsewhere in the list of places\n", "# countries_with_more_detail is a list of all the countries that appear at end of comma-separated address strings\n", "countries_with_more_detail = pd.Series(addresses.map(get_country_if_more_detail).dropna().sort_values().unique())\n", "print '{:,} countries with more detail'.format(len(countries_with_more_detail))\n", "\n", "# if so, discard the instance that is country name only - this country is represented elsewhere in list with finer grain info\n", "# ie, keep 'estonia' if there is no 'talinn, estonia' elsewhere in list, \n", "# but discard 'russia' if 'moscow, russia' exists elsewhere in the list\n", "addresses_to_geocode = addresses[~addresses.isin(countries_with_more_detail)]\n", "print '{:,} unique addresses to geocode'.format(len(addresses_to_geocode))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210\n" ] } ], "source": [ "# geocode (with nominatim) each retained address (ie, full place name string)\n", "start_time = time.time()\n", "\n", "latlng_dict = {}\n", "for address, n in zip(addresses_to_geocode, range(len(addresses_to_geocode))):\n", " if n % 10 == 0: print n,\n", " latlng_dict[address] = geocode_nominatim(address)\n", "\n", "finish_time = time.time()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nominatim geocoded 217 addresses in 191 seconds\n", "received 208 non-null lat-longs\n" ] } ], "source": [ "print 'nominatim geocoded {:,} addresses in {:,} seconds'.format(len(addresses_to_geocode), int(finish_time-start_time))\n", "print 'received {:,} non-null lat-longs'.format(len([key for key in latlng_dict if latlng_dict[key] is not None]))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# which addresses failed to geocode successfully?\n", "addresses_to_geocode = [ key for key in latlng_dict if latlng_dict[key] is None ]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n" ] } ], "source": [ "# now geocode (with google) each address that failed with nominatim\n", "if len(addresses_to_geocode) < 2500: #daily google request limit\n", " start_time = time.time()\n", " for address, n in zip(addresses_to_geocode, range(len(addresses_to_geocode))):\n", " if n % 10 == 0: print n,\n", " latlng_dict[address] = geocode_google(address)\n", " finish_time = time.time()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "google geocoded 9 addresses in 7 seconds\n", "received 217 non-null lat-longs\n" ] } ], "source": [ "print 'google geocoded {:,} addresses in {:,} seconds'.format(len(addresses_to_geocode), int(finish_time-start_time))\n", "print 'received {:,} non-null lat-longs'.format(len([key for key in latlng_dict if latlng_dict[key] is not None]))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | name | \n", "place_full | \n", "place_latlng | \n", "
---|---|---|---|
306 | \n", "Willie Nelson | \n", "Abbott, Hill County, Texas, United States | \n", "31.8848809,-97.073336 | \n", "
31 | \n", "Linkin Park | \n", "Agoura Hills, Los Angeles County, California, ... | \n", "34.1363945,-118.7745347 | \n", "
332 | \n", "Sum 41 | \n", "Ajax, Ontario, Canada | \n", "43.8492143,-79.0241784 | \n", "
60 | \n", "Demi Lovato | \n", "Albuquerque, Bernalillo County, New Mexico, Un... | \n", "35.0841034,-106.650985 | \n", "
209 | \n", "Jeff Buckley | \n", "Anaheim, Orange County, California, United States | \n", "33.8347516,-117.9117319 | \n", "