{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Python Tour of Data Science: Data Acquisition & Exploration \n", "\n", "[Michaƫl Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1 Exercise: problem definition\n", "\n", "Theme of the exercise: **understand the impact of your communication on social networks**. A real life situation: the marketing team needs help in identifying which were the most engaging posts they made on social platforms to prepare their next [AdWords](https://www.google.com/adwords/) campaign.\n", "\n", "As you probably don't have a company (yet?), you can either use your own social network profile as if it were the company's one or choose an established entity, e.g. EPFL. You will need to be registered in FB or Twitter to generate access tokens. If you're not, either ask a classmate to create a token for you or create a fake / temporary account for yourself (no need to follow other people, we can fetch public data).\n", "\n", "At the end of the exercise, you should have two datasets (Facebook & Twitter) and have used them to answer the following questions, for both Facebook and Twitter.\n", "1. How many followers / friends / likes has your chosen profile ?\n", "2. How many posts / tweets in the last year ?\n", "3. What were the 5 most liked posts / tweets ?\n", "4. Plot histograms of number of likes and comments / retweets.\n", "5. Plot basic statistics and an histogram of text lenght.\n", "6. Is there any correlation between the lenght of the text and the number of likes ?\n", "7. Be curious and explore your data. Did you find something interesting or surprising ?\n", " 1. Create at least one interactive plot (with bokeh) to explore an intuition (e.g. does the posting time plays a role)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Ressources\n", "\n", "Here are some links you may find useful to complete that exercise.\n", "\n", "Web APIs: these are the references.\n", "* [Facebook Graph API](https://developers.facebook.com/docs/graph-api)\n", "* [Twitter REST API](https://dev.twitter.com/rest/public)\n", "\n", "Tutorials:\n", "* [Mining the Social Web](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition)\n", "* [Mining Twitter data with Python](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)\n", "* [Simple Python Facebook Scraper](http://simplebeautifuldata.com/2014/08/25/simple-python-facebook-scraper-part-1/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Web scraping\n", "\n", "Tasks:\n", "1. Download the relevant information from Facebook and Twitter. Try to minimize the quantity of collected data to the minimum required to answer the questions.\n", "2. Build two SQLite databases, one for Facebook and the other for Twitter, using [pandas](http://pandas.pydata.org/) and [SQLAlchemy](http://www.sqlalchemy.org/).\n", " 1. For FB, each row is a post, and the columns are at least (you can include more if you want): the post id, the message (i.e. the text), the time when it was posted, the number of likes and the number of comments.\n", " 2. For Twitter, each row is a tweet, and the columns are at least: the tweet id, the text, the creation time, the number of likes (was called favorite before) and the number of retweets.\n", "\n", "Note that some data cleaning is already necessary. E.g. there are some FB posts without *message*, i.e. without text. Some tweets are also just retweets without any more information. Should they be collected ?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Number of posts / tweets to retrieve.\n", "# Small value for development, then increase to collect final data.\n", "n = 20 # 4000" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 Facebook\n", "\n", "There is two ways to scrape data from Facebook, you can choose one or combine them.\n", "1. The low-level approach, sending HTTP requests and receiving JSON responses to / from their Graph API. That can be achieved with the json and [requests](python-requests.org) packages (altough you can use urllib or urllib2, requests has a better API). The knowledge you'll acquire using that method will be useful to query other web APIs than FB. This method is also more flexible.\n", "2. The high-level approach, using a [Python SDK](http://facebook-sdk.readthedocs.io). The code you'll have to write for this method is gonna be shorter, but specific to the FB Graph API.\n", "\n", "You will need an access token, which can be created with the help of the [Graph Explorer](https://developers.facebook.com/tools/explorer). That tool may prove useful to test queries. Once you have your token, you may create a `credentials.ini` file with the following content:\n", "```\n", "[facebook]\n", "token = YOUR-FB-ACCESS-TOKEN\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import configparser\n", "\n", "credentials = configparser.ConfigParser()\n", "credentials.read('credentials.ini')\n", "token = credentials.get('facebook', 'token')\n", "\n", "# Or token = 'YOUR-FB-ACCESS-TOKEN'" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import requests # pip install requests\n", "import facebook # pip install facebook-sdk" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "page = 'EPFL.ch'" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 Twitter\n", "\n", "There exists a bunch of [Python-based clients](https://dev.twitter.com/overview/api/twitter-libraries#python) for Twitter. [Tweepy](http://tweepy.readthedocs.io) is a popular choice.\n", "\n", "You will need to create a [Twitter app](https://apps.twitter.com/) and copy the four tokens and secrets in the `credentials.ini` file:\n", "```\n", "[twitter]\n", "consumer_key = YOUR-CONSUMER-KEY\n", "consumer_secret = YOUR-CONSUMER-SECRET\n", "access_token = YOUR-ACCESS-TOKEN\n", "access_secret = YOUR-ACCESS-SECRET\n", "```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import tweepy # pip install tweepy\n", "\n", "# Read the confidential tokens and authenticate.\n", "auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))\n", "auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))\n", "api = tweepy.API(auth)\n", "\n", "user = 'EPFL_en'" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 Data analysis\n", "\n", "Answer the questions using [pandas](http://pandas.pydata.org/), [statsmodels](http://statsmodels.sourceforge.net/), [scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html), [bokeh](http://bokeh.pydata.org)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "plt.style.use('ggplot')\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }