{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mining the Social Web\n", "\n", "## Mining Facebook\n", "\n", "This Jupyter Notebook provides an interactive way to follow along with the video lectures. The intent behind this notebook is to reinforce the concepts in a fun, convenient, and effective way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Facebook API Access\n", "\n", "Facebook implements OAuth 2.0 as its standard authentication mechanism, but provides a convenient way for you to get an _access token_ for development purposes, and we'll opt to take advantage of that convenience in this notebook.\n", "\n", "To get started, log in to your Facebook account and go to https://developers.facebook.com/tools/explorer/ to obtain an ACCESS_TOKEN, and then paste it into the code cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Copy and paste in the value you just got from the inline frame into this variable and execute this cell.\n", "# Keep in mind that you could have just gone to https://developers.facebook.com/tools/access_token/\n", "# and retrieved the \"User Token\" value from the Access Token Tool\n", "\n", "ACCESS_TOKEN = ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1. Making Graph API requests over HTTP" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests # pip install requests\n", "import json\n", "\n", "base_url = 'https://graph.facebook.com/me'\n", "\n", "# Specify which fields to retrieve\n", "fields = 'id,name,likes.limit(10){about}'\n", "\n", "url = '{0}?fields={1}&access_token={2}'.format(base_url, fields, ACCESS_TOKEN)\n", "\n", "# This API is HTTP-based and could be requested in the browser,\n", "# with a command line utlity like curl, or using just about\n", "# any programming language by making a request to the URL.\n", "# Click the hyperlink that appears in your notebook output\n", "# when you execute this code cell to see for yourself...\n", "print(url)\n", "\n", "# Interpret the response as JSON and convert back\n", "# to Python data structures\n", "content = requests.get(url).json()\n", "\n", "# Pretty-print the JSON and display it\n", "print(json.dumps(content, indent=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2. Querying the Graph API with Python\n", "\n", "Facebook SDK for Python API reference:\n", "http://facebook-sdk.readthedocs.io/en/v2.0.0/api.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import facebook # pip install facebook-sdk\n", "import json\n", "\n", "# A helper function to pretty-print Python objects as JSON\n", "def pp(o):\n", " print(json.dumps(o, indent=1))\n", "\n", "# Create a connection to the Graph API with your access token\n", "g = facebook.GraphAPI(ACCESS_TOKEN, version='2.7')\n", "\n", "# Execute a few example queries:\n", "\n", "# Get my ID\n", "pp(g.get_object('me'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the connections to an ID\n", "# Example connection names: 'feed', 'likes', 'groups', 'posts'\n", "pp(g.get_connections(id='me', connection_name='likes'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Search for a location, may require approved app\n", "pp(g.request(\"search\", {'type': 'place', 'center': '40.749444, -73.968056', 'fields': 'name, location'}))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 3. Querying the Graph API for Mining the Social Web and Counting Fans" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Search for a page's ID by name\n", "pp(g.request(\"search\", {'q': 'Mining the Social Web', 'type': 'page'}))\n", "\n", "# Grab the ID for the book and check the number of fans\n", "mtsw_id = '146803958708175'\n", "pp(g.get_object(id=mtsw_id, fields=['fan_count']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 4. Querying the Graph API for Open Graph objects by their URLs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# MTSW catalog link\n", "pp(g.get_object('http://shop.oreilly.com/product/0636920030195.do'))\n", "\n", "# PCI catalog link\n", "pp(g.get_object('http://shop.oreilly.com/product/9780596529321.do'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 5. Counting total number of page fans" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The following code may require the developer's app be submitted for review and\n", "# approved. See https://developers.facebook.com/docs/apps/review\n", "\n", "# Take, for example, three popular musicians and their page IDs.\n", "taylor_swift_id = '19614945368'\n", "drake_id = '83711079303'\n", "beyonce_id = '28940545600'\n", "\n", "# Declare a helper function for retrieving the total number of fans ('likes') a page has \n", "def get_total_fans(page_id):\n", " return int(g.get_object(id=page_id, fields=['fan_count'])['fan_count'])\n", "\n", "tswift_fans = get_total_fans(taylor_swift_id)\n", "drake_fans = get_total_fans(drake_id)\n", "beyonce_fans = get_total_fans(beyonce_id)\n", "\n", "print('Taylor Swift: {0} fans on Facebook'.format(tswift_fans))\n", "print('Drake: {0} fans on Facebook'.format(drake_fans))\n", "print('Beyoncé: {0} fans on Facebook'.format(beyonce_fans))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 6. Retrieving a page's feed" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Declare a helper function for retrieving the official feed from a given page.\n", "def retrieve_page_feed(page_id, n_posts):\n", " \"\"\"Retrieve the first n_posts from a page's feed in reverse\n", " chronological order.\"\"\"\n", " feed = g.get_connections(page_id, 'posts')\n", " posts = []\n", " posts.extend(feed['data'])\n", "\n", " while len(posts) < n_posts:\n", " try:\n", " feed = requests.get(feed['paging']['next']).json()\n", " posts.extend(feed['data'])\n", " except KeyError:\n", " # When there are no more posts in the feed, break\n", " print('Reached end of feed.')\n", " break\n", " \n", " if len(posts) > n_posts:\n", " posts = posts[:n_posts]\n", "\n", " print('{} items retrieved from feed'.format(len(posts)))\n", " return posts\n", "\n", "# Declare a helper function for returning the message content of a post\n", "def get_post_message(post):\n", " try:\n", " message = post['story']\n", " except KeyError:\n", " # Post may have 'message' instead of 'story'\n", " pass\n", " try:\n", " message = post['message']\n", " except KeyError:\n", " # Post has neither\n", " message = ''\n", " return message.replace('\\n', ' ')\n", "\n", "# Retrieve the last 5 items from their feeds\n", "for artist in [taylor_swift_id, drake_id, beyonce_id]:\n", " print()\n", " feed = retrieve_page_feed(artist, 5)\n", " for i, post in enumerate(feed):\n", " message = get_post_message(post)[:50]\n", " print('{0} - {1}...'.format(i+1, message))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 7. Measuring engagement" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Measure the response to a post in terms of likes, shares, and comments\n", "def measure_response(post_id):\n", " \"\"\"Returns the number of likes, shares, and comments on a \n", " given post as a measure of user engagement.\"\"\"\n", " likes = g.get_object(id=post_id, \n", " fields=['likes.limit(0).summary(true)'])\\\n", " ['likes']['summary']['total_count']\n", " shares = g.get_object(id=post_id, \n", " fields=['shares.limit(0).summary(true)'])\\\n", " ['shares']['count']\n", " comments = g.get_object(id=post_id, \n", " fields=['comments.limit(0).summary(true)'])\\\n", " ['comments']['summary']['total_count']\n", " return likes, shares, comments\n", "\n", "# Measure the relative share of a page's fans engaging with a post\n", "def measure_engagement(post_id, total_fans):\n", " \"\"\"Returns the number of likes, shares, and comments on a \n", " given post as a measure of user engagement.\"\"\"\n", " likes = g.get_object(id=post_id, \n", " fields=['likes.limit(0).summary(true)'])\\\n", " ['likes']['summary']['total_count']\n", " shares = g.get_object(id=post_id, \n", " fields=['shares.limit(0).summary(true)'])\\\n", " ['shares']['count']\n", " comments = g.get_object(id=post_id, \n", " fields=['comments.limit(0).summary(true)'])\\\n", " ['comments']['summary']['total_count']\n", " likes_pct = likes / total_fans * 100.0\n", " shares_pct = shares / total_fans * 100.0\n", " comments_pct = comments / total_fans * 100.0\n", " return likes_pct, shares_pct, comments_pct\n", "\n", "# Retrieve the last 5 items from the artists' feeds, print the\n", "# reaction and the degree of engagement\n", "artist_dict = {'Taylor Swift': taylor_swift_id,\n", " 'Drake': drake_id,\n", " 'Beyoncé': beyonce_id}\n", "for name, page_id in artist_dict.items():\n", " print()\n", " print(name)\n", " print('------------')\n", " feed = retrieve_page_feed(page_id, 5)\n", " total_fans = get_total_fans(page_id)\n", " \n", " for i, post in enumerate(feed):\n", " message = get_post_message(post)[:30]\n", " post_id = post['id']\n", " likes, shares, comments = measure_response(post_id)\n", " likes_pct, shares_pct, comments_pct = measure_engagement(post_id, total_fans)\n", " print('{0} - {1}...'.format(i+1, message))\n", " print(' Likes {0} ({1:7.5f}%)'.format(likes, likes_pct))\n", " print(' Shares {0} ({1:7.5f}%)'.format(shares, shares_pct))\n", " print(' Comments {0} ({1:7.5f}%)'.format(comments, comments_pct))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 8. Storing data in a pandas DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd # pip install pandas\n", "\n", "# Create a Pandas DataFrame to contain artist page\n", "# feed information\n", "columns = ['Name',\n", " 'Total Fans',\n", " 'Post Number',\n", " 'Post Date',\n", " 'Headline',\n", " 'Likes',\n", " 'Shares',\n", " 'Comments',\n", " 'Rel. Likes',\n", " 'Rel. Shares',\n", " 'Rel. Comments']\n", "musicians = pd.DataFrame(columns=columns)\n", "\n", "# Build the DataFrame by adding the last 10 posts and their audience\n", "# reaction for each of the artists\n", "for page_id in [taylor_swift_id, drake_id, beyonce_id]:\n", " name = g.get_object(id=page_id)['name']\n", " fans = get_total_fans(page_id)\n", " feed = retrieve_page_feed(page_id, 10)\n", " for i, post in enumerate(feed):\n", " likes, shares, comments = measure_response(post['id'])\n", " likes_pct, shares_pct, comments_pct = measure_engagement(post['id'], fans)\n", " musicians = musicians.append({'Name': name,\n", " 'Total Fans': fans,\n", " 'Post Number': i+1,\n", " 'Post Date': post['created_time'],\n", " 'Headline': get_post_message(post),\n", " 'Likes': likes,\n", " 'Shares': shares,\n", " 'Comments': comments,\n", " 'Rel. Likes': likes_pct,\n", " 'Rel. Shares': shares_pct,\n", " 'Rel. Comments': comments_pct,\n", " }, ignore_index=True)\n", "# Fix the dtype of a few columns\n", "for col in ['Post Number', 'Total Fans', 'Likes', 'Shares', 'Comments']:\n", " musicians[col] = musicians[col].astype(int)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Show a preview of the DataFrame\n", "musicians.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 9. Visualizing data stored in a pandas DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib # pip install matplotlib\n", "%matplotlib inline\n", "\n", "musicians[musicians['Name'] == 'Drake'].plot(x='Post Number', y='Likes', kind='bar')\n", "musicians[musicians['Name'] == 'Drake'].plot(x='Post Number', y='Shares', kind='bar')\n", "musicians[musicians['Name'] == 'Drake'].plot(x='Post Number', y='Comments', kind='bar')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "musicians[musicians['Name'] == 'Drake'].plot(x='Post Number', y='Rel. Likes', kind='bar')\n", "musicians[musicians['Name'] == 'Drake'].plot(x='Post Number', y='Rel. Shares', kind='bar')\n", "musicians[musicians['Name'] == 'Drake'].plot(x='Post Number', y='Rel. Comments', kind='bar')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 10. Comparing different artists to each other" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Reset the index to a multi-index\n", "musicians = musicians.set_index(['Name','Post Number'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# The unstack method pivots the index labels\n", "# and lets you get data columns grouped by artist\n", "musicians.unstack(level=0)['Likes']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Plot the comparative reactions to each artist's last 10 Facebook posts\n", "plot = musicians.unstack(level=0)['Likes'].plot(kind='bar', subplots=False, figsize=(10,5), width=0.8)\n", "plot.set_xlabel('10 Latest Posts')\n", "plot.set_ylabel('Number of Likes Received')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Plot the engagement of each artist's Facebook fan base to the last 10 posts\n", "plot = musicians.unstack(level=0)['Rel. Likes'].plot(kind='bar', subplots=False, figsize=(10,5), width=0.8)\n", "plot.set_xlabel('10 Latest Posts')\n", "plot.set_ylabel('Likes / Total Fans (%)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 11. Calculate average engagement" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "print('Average Likes / Total Fans')\n", "print(musicians.unstack(level=0)['Rel. Likes'].mean())\n", "\n", "print('\\nAverage Shares / Total Fans')\n", "print(musicians.unstack(level=0)['Rel. Shares'].mean())\n", "\n", "print('\\nAverage Comments / Total Fans')\n", "print(musicians.unstack(level=0)['Rel. Comments'].mean())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }