Data Science

APIs, JSON, NoSQL Databases

Alessandro Gagliardi
Sr. Data Scientist, Glassdoor.com

Last Time

  • Relational Databases
    • Normalization
    • Natural vs. Artificial Keys
    • Star Schema
    • JOIN
    • GROUP BY
    • MAX
  • Pandas
    • .head()
    • .join()
    • .merge()
    • .hist()
    • .plot()
    • .mean()
    • .map()
    • .groupby()
    • .count()
    • .resample()

Questions?

Agenda

  1. APIs & JSON
  2. NoSQL Databases
  3. Lab: Twitter & MongoDB

1. APIs & JSON

Speaking broadly:

An application programming interface (API) specifies how some software components should interact with each other.

More specifically:

A web API is a programmatic interface to a defined request-response message system, typically expressed in JSON or XML, which is exposed via the web—most commonly by means of an HTTP-based web server.

from Wikipedia

Web APIs allow people to interact with the structures of an application to:

  • get
  • put
  • delete
  • update

data

Best practices for web APIs are to use RESTful principles.

Think of some web services you might like to get data from. Perhaps they have APIs?

REST = REpresentational State Transfer

REST vs. SQL

GET ( ~ SELECT)
POST ( ~ UPDATE)
PUT ( ~ INSERT)
DELETE ( ~ DELETE)

RESTful web API HTTP methods

ResourceGETPUTPOSTDELETE
Collection URI, such as http://example.com/resourcesList the URIs and perhaps other details of the collection's members.Replace the entire collection with another collection.Create a new entry in the collection. The new entry's URI is assigned automatically and is usually returned by the operation.Delete the entire collection.
Element URI, such as http://example.com/resources/item17Retrieve a representation of the addressed member of the collection, expressed in an appropriate Internet media type.Replace the addressed member of the collection, or if it doesn't exist, create it.Not generally used. Treat the addressed member as a collection in its own right and create a new entry in it.Delete the addressed member of the collection.
From http://en.wikipedia.org/wiki/Representational_state_transfer

HTTP requests can be handled easily using Python's requests library.

First we will load our credentials which we keep in a YAML file for safe keeping.

In [1]:
import yaml
credentials = yaml.load(open('/Users/alessandro.gagliardi/api_cred.yml'))

Then we pass those credentials in to a GET request using the requests library. In this case, I am querying my own user data from Github:

In [2]:
import requests
r = requests.get('https://api.github.com/user', 
                 auth=(credentials['USER'], credentials['PASS']))

Requests gives us an object from which we can read its content.

In [3]:
r.content
Out[3]:
'{"login":"eklypse","id":896607,"avatar_url":"https://avatars.githubusercontent.com/u/896607?","gravatar_id":"42c577edc388cc9d1050927da89d47cc","url":"https://api.github.com/users/eklypse","html_url":"https://github.com/eklypse","followers_url":"https://api.github.com/users/eklypse/followers","following_url":"https://api.github.com/users/eklypse/following{/other_user}","gists_url":"https://api.github.com/users/eklypse/gists{/gist_id}","starred_url":"https://api.github.com/users/eklypse/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/eklypse/subscriptions","organizations_url":"https://api.github.com/users/eklypse/orgs","repos_url":"https://api.github.com/users/eklypse/repos","events_url":"https://api.github.com/users/eklypse/events{/privacy}","received_events_url":"https://api.github.com/users/eklypse/received_events","type":"User","site_admin":false,"name":"Alessandro D. Gagliardi","company":"Glassdoor.com","blog":"twitter.com/MadDataScience","location":"San Francisco","email":null,"hireable":false,"bio":null,"public_repos":7,"public_gists":1,"followers":6,"following":21,"created_at":"2011-07-05T20:17:04Z","updated_at":"2014-04-17T04:29:30Z","private_gists":1,"total_private_repos":2,"owned_private_repos":0,"disk_usage":88449,"collaborators":0,"plan":{"name":"free","space":307200,"collaborators":0,"private_repos":0}}'

One of the reasons we like JSON is that it is easy to transform into a Python dict object using the json library:

In [4]:
import json
user = json.loads(r.content)
user
Out[4]:
{u'avatar_url': u'https://avatars.githubusercontent.com/u/896607?',
 u'bio': None,
 u'blog': u'twitter.com/MadDataScience',
 u'collaborators': 0,
 u'company': u'Glassdoor.com',
 u'created_at': u'2011-07-05T20:17:04Z',
 u'disk_usage': 88449,
 u'email': None,
 u'events_url': u'https://api.github.com/users/eklypse/events{/privacy}',
 u'followers': 6,
 u'followers_url': u'https://api.github.com/users/eklypse/followers',
 u'following': 21,
 u'following_url': u'https://api.github.com/users/eklypse/following{/other_user}',
 u'gists_url': u'https://api.github.com/users/eklypse/gists{/gist_id}',
 u'gravatar_id': u'42c577edc388cc9d1050927da89d47cc',
 u'hireable': False,
 u'html_url': u'https://github.com/eklypse',
 u'id': 896607,
 u'location': u'San Francisco',
 u'login': u'eklypse',
 u'name': u'Alessandro D. Gagliardi',
 u'organizations_url': u'https://api.github.com/users/eklypse/orgs',
 u'owned_private_repos': 0,
 u'plan': {u'collaborators': 0,
  u'name': u'free',
  u'private_repos': 0,
  u'space': 307200},
 u'private_gists': 1,
 u'public_gists': 1,
 u'public_repos': 7,
 u'received_events_url': u'https://api.github.com/users/eklypse/received_events',
 u'repos_url': u'https://api.github.com/users/eklypse/repos',
 u'site_admin': False,
 u'starred_url': u'https://api.github.com/users/eklypse/starred{/owner}{/repo}',
 u'subscriptions_url': u'https://api.github.com/users/eklypse/subscriptions',
 u'total_private_repos': 2,
 u'type': u'User',
 u'updated_at': u'2014-04-17T04:29:30Z',
 u'url': u'https://api.github.com/users/eklypse'}
In [5]:
print user.keys()
[u'disk_usage', u'private_gists', u'public_repos', u'site_admin', u'subscriptions_url', u'gravatar_id', u'hireable', u'id', u'followers_url', u'following_url', u'collaborators', u'total_private_repos', u'blog', u'followers', u'location', u'type', u'email', u'bio', u'gists_url', u'owned_private_repos', u'company', u'events_url', u'html_url', u'updated_at', u'plan', u'received_events_url', u'starred_url', u'public_gists', u'name', u'organizations_url', u'url', u'created_at', u'avatar_url', u'repos_url', u'following', u'login']

We can access values in this dict directly (such as my hireable status) and even render the url of my avatar:

In [6]:
from IPython.display import Image
print "Hireable: {}".format(user.get('hireable'))
Image(url=user.get('avatar_url'))
Hireable: False

Out[6]:

Twitter API

Twitter has no less than 10 python libraries. We'll be using Python Twitter Tools because it's what's used in Mining the Social Web.

Some services (like Twitter) have released Python libraries of their own to make using their API even easier.

In [7]:
import twitter

auth = twitter.oauth.OAuth(credentials['ACCESS_TOKEN'], 
                           credentials['ACCESS_TOKEN_SECRET'],
                           credentials['API_KEY'],
                           credentials['API_SECRET'])

twitter_api = twitter.Twitter(auth=auth)

print twitter_api
<twitter.api.Twitter object at 0x1062ec810>

Using a library like this, we don't even need to specify the URL (that's handled internally).

Using a library like this, it's easy to do something like search for tweets mentioning #bigdata

The results are transformed into a Python object (which in this case is a thin wrapper around a dict)

In [8]:
bigdata = twitter_api.search.tweets(q='#bigdata', count=5)
type(bigdata)
Out[8]:
twitter.api.TwitterDictResponse
In [9]:
for status in bigdata['statuses']:
    print status.get('text')
RT @mapr: Gear up for today's @ApacheMahout meetup! Hosted by Intuit in Mountain View @ 6pm. http://t.co/wM0NwPjtdZ #machinelearning #hadoo…
RT @iron_light: "Knowledge Ecology" translates to dynamic "Knowledge Management" #JoinTheConversation http://t.co/83HrO0sOMc #bigdata http:…
I hate that stores use my phone #, as a #database ID. Bad idea for these mobile times. I will forever be tied to 2004 cell. #bigdata
RT @BigDataBorat: Result of Pokémon or #bigdata survey in: Tokutek is database most likely confuse as Pokémon http://t.co/YaNKY5wfSJ
RT @BigDataBorat: Result of Pokémon or #bigdata survey in: Tokutek is database most likely confuse as Pokémon http://t.co/YaNKY5wfSJ

2. NoSQL

NoSQL databases are a new trend in databases

The name NoSQL refers to the lack of a relational structure between stored objects. Data are semi-structured.

Most importantly they attempt to minimize the need for JOIN operations, or solve other data needs

This is good for engineers but bad for data scientists.

Still, NoSQL databases have their uses.

What makes a NoSQL database?

  • Doesn't use SQL as query language
    • usually more primitive query langauge
    • sometimes key/value only
  • BASE rather than ACID
    • that is, sacrifices consistency for availability
  • Schemaless
    • that is, data need not conform to a predefined schema (i.e. semi-structured)

BASE vs ACID

  • ACID
    • Atomicity
    • Consistency
    • Isolation
    • Durability
  • BASE
    • Basically Available
    • Soft-state
    • Eventual consistency

CAP

  • Consistency
    • all nodes always give the same answer
  • Availability
    • nodes always answer queries and accept updates
  • Partition-tolerance
    • system continues working even if one or more nodes go down

CAP Theorem: Pick two

Eventual consistency

  • A key property of non-ACID systems
  • Means
    • if no further changes made,
    • eventually all nodes will be consistent
  • In itself eventual consistency is a very weak guarantee
    • when is "eventually"?
    • in practice it means the system can be inconsetent at any time
  • Stronger guarantees are sometimes made
    • with prediction and measuring, actual behavior can be quantified
    • in practice, systems often appear strongly consistent

NoSQL Examples

  • Memcached
  • Cassandra
  • MongoDB

More Examples:

  • Apache HBase
  • CouchDB
  • DynamoDB

Memcached:

  • Developed by LiveJournal
  • Distributed key-value store (like a Python Dict)
  • Supports two very fast operations: get and set

Memcached is best used for storing application configuration settings, and essential caching those settings.

Cassandra:

  • Developed by Facebook
  • Messages application and Inbox Search
  • Key-Value (ish)
    • Supports query by key or key range
  • Very fast writing speeds
  • Useful for record keeping, logging

Mongo:

  • Developed by 10Gen (now MongoDB, Inc)
  • Document and Collection Structure
  • BSON (JSON-like) Storage system
  • Aggregation Framework

When might you want to use a NoSQL database? When not?

Mongo

What is MongoDB?

MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.

Document Database

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.

A MongoDB document.

The advantages of using documents are:

  • Documents (i.e. objects) correspond to native data types in many programming language.
  • Embedded documents and arrays reduce need for expensive joins.
  • Dynamic schema supports fluent polymorphism.

Notice how similar this looks to a Python dictionary.

Let's get started:

Open DS_Lab04-API