{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [NTDS'17] demo 2: Twitter data aquisition\n", "[ntds'17]: https://github.com/mdeff/ntds_2017\n", "\n", "Michael Defferrard and Effrosyni Simou" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objective" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this first lab session we will look into how we can collect data from the Internet. Specifically, we will look into the API (Application Programming Interface) of Twitter. \n", "\n", "We will also talk about the data cleaning process. While cleaning data is the [most time-consuming, least enjoyable Data Science task](http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says), it should be perfomed nonetheless.\n", "\n", "For this exercise you will need to be registered with Twitter and to generate access tokens. If you do not have an account in this social network you can ask a friend to create a token for you or you can create a temporary account just for the needs of this class. \n", "\n", "You will need to create a [Twitter app](https://apps.twitter.com/) and copy the four tokens and secrets in the `credentials.ini` file:\n", "```\n", "[twitter]\n", "consumer_key = YOUR-CONSUMER-KEY\n", "consumer_secret = YOUR-CONSUMER-SECRET\n", "access_token = YOUR-ACCESS-TOKEN\n", "access_secret = YOUR-ACCESS-SECRET\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ressources\n", "\n", "Here are some links you may find useful to complete that exercise.\n", "\n", "Web APIs: \n", "* [Twitter REST API](https://dev.twitter.com/rest/public)\n", "* [Tweepy Documentation](http://tweepy.readthedocs.io/en/v3.5.0/)\n", "\n", "Tutorials:\n", "* [Mining the Social Web](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition)\n", "* [Mining Twitter data with Python](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Web scraping\n", "There exists a bunch of [Python-based clients](https://dev.twitter.com/overview/api/twitter-libraries#python) for Twitter. [Tweepy](http://tweepy.readthedocs.io) is a popular choice.\n", "\n", "Tasks:\n", "1. Download the relevant information from Twitter. Try to minimize the quantity of collected data to the minimum required to answer the questions.\n", "2. Organize the collected data in a [panda dataframe](http://pandas.pydata.org/). Each row is a tweet, and the columns are at least: the tweet id, the text, the creation time, the number of likes (was called favorite before) and the number of retweets.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import configparser\n", "\n", "import tweepy # you will need to conda or pip install tweepy first\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Read the confidential token.\n", "credentials = configparser.ConfigParser()\n", "credentials.read(os.path.join('..', 'credentials.ini'))\n", "\n", "auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))\n", "auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))\n", "\n", "api = tweepy.API(auth) \n", "\n", "user = 'EPFL_en'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that there is rate limiting of the API on a per user access token. You can find out more about rate limits [here](https://developer.twitter.com/en/docs/basics/rate-limiting). In order to avoid getting a rate limit error when you need to make a lot of requests to gather your data you can construct your API instance as:\n", "\n", "api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)\n", "\n", "This will aslo notify you about how long the period of sleep will be.\n", "\n", "It is good practice to limit the amount of requests while developing, and then to increase to collect all the necessary data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Number of posts / tweets to retrieve.\n", "# Small value for development, then increase to collect final data.\n", "n = 20 # 4000" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "my_user=api.get_user(user)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweepy.models.User" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(my_user)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__getstate__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " '_api',\n", " '_json',\n", " 'contributors_enabled',\n", " 'created_at',\n", " 'default_profile',\n", " 'default_profile_image',\n", " 'description',\n", " 'entities',\n", " 'favourites_count',\n", " 'follow',\n", " 'follow_request_sent',\n", " 'followers',\n", " 'followers_count',\n", " 'followers_ids',\n", " 'following',\n", " 'friends',\n", " 'friends_count',\n", " 'geo_enabled',\n", " 'has_extended_profile',\n", " 'id',\n", " 'id_str',\n", " 'is_translation_enabled',\n", " 'is_translator',\n", " 'lang',\n", " 'listed_count',\n", " 'lists',\n", " 'lists_memberships',\n", " 'lists_subscriptions',\n", " 'location',\n", " 'name',\n", " 'notifications',\n", " 'parse',\n", " 'parse_list',\n", " 'profile_background_color',\n", " 'profile_background_image_url',\n", " 'profile_background_image_url_https',\n", " 'profile_background_tile',\n", " 'profile_banner_url',\n", " 'profile_image_url',\n", " 'profile_image_url_https',\n", " 'profile_link_color',\n", " 'profile_location',\n", " 'profile_sidebar_border_color',\n", " 'profile_sidebar_fill_color',\n", " 'profile_text_color',\n", " 'profile_use_background_image',\n", " 'protected',\n", " 'screen_name',\n", " 'status',\n", " 'statuses_count',\n", " 'time_zone',\n", " 'timeline',\n", " 'translator_type',\n", " 'unfollow',\n", " 'url',\n", " 'utc_offset',\n", " 'verified']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(my_user)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EPFL_en has 26096 followers\n" ] } ], "source": [ "followers = api.get_user(user).followers_count\n", "print('{} has {} followers'.format(user, followers))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweepy handles much of the dirty work, like pagination. Have a look at how you can handle pagination with the Cursor objects in Tweepy with this [tutorial](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html). " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "tw = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'shares'])\n", "for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):\n", " serie = dict(id=tweet.id, text=tweet.text, time=tweet.created_at)\n", " serie.update(dict(likes=tweet.favorite_count, shares=tweet.retweet_count))\n", " tw = tw.append(serie, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id object\n", "text object\n", "time datetime64[ns]\n", "likes object\n", "shares object\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw.dtypes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "tw.id = tw.id.astype(np.int64)\n", "tw.likes = tw.likes.astype(np.int64)\n", "tw.shares = tw.shares.astype(np.int64)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id int64\n", "text object\n", "time datetime64[ns]\n", "likes int64\n", "shares int64\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw.dtypes" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | id | \n", "text | \n", "time | \n", "likes | \n", "shares | \n", "
---|---|---|---|---|---|
0 | \n", "915836109430693888 | \n", "Two intelligent vehicles are better than one 🚗... | \n", "2017-10-05 07:08:38 | \n", "4 | \n", "3 | \n", "
1 | \n", "915582684235206656 | \n", "Congratulations to @EPFL_en start-up @lunaphor... | \n", "2017-10-04 14:21:37 | \n", "14 | \n", "4 | \n", "
2 | \n", "915515076660129792 | \n", "Our warmest congratulations to our neighbor Ja... | \n", "2017-10-04 09:52:58 | \n", "121 | \n", "60 | \n", "
3 | \n", "914817738543173632 | \n", "Our President @MartinVetterli in good company ... | \n", "2017-10-02 11:42:00 | \n", "40 | \n", "19 | \n", "
4 | \n", "914794744605265920 | \n", "\"Switzerland can and must be a leader in the d... | \n", "2017-10-02 10:10:37 | \n", "24 | \n", "11 | \n", "