{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# [NTDS'17] demo 2: Twitter data aquisition\n", "[ntds'17]: https://github.com/mdeff/ntds_2017\n", "\n", "Michael Defferrard and Effrosyni Simou" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objective" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this first lab session we will look into how we can collect data from the Internet. Specifically, we will look into the API (Application Programming Interface) of Twitter. \n", "\n", "We will also talk about the data cleaning process. While cleaning data is the [most time-consuming, least enjoyable Data Science task](http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says), it should be perfomed nonetheless.\n", "\n", "For this exercise you will need to be registered with Twitter and to generate access tokens. If you do not have an account in this social network you can ask a friend to create a token for you or you can create a temporary account just for the needs of this class. \n", "\n", "You will need to create a [Twitter app](https://apps.twitter.com/) and copy the four tokens and secrets in the `credentials.ini` file:\n", "```\n", "[twitter]\n", "consumer_key = YOUR-CONSUMER-KEY\n", "consumer_secret = YOUR-CONSUMER-SECRET\n", "access_token = YOUR-ACCESS-TOKEN\n", "access_secret = YOUR-ACCESS-SECRET\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ressources\n", "\n", "Here are some links you may find useful to complete that exercise.\n", "\n", "Web APIs: \n", "* [Twitter REST API](https://dev.twitter.com/rest/public)\n", "* [Tweepy Documentation](http://tweepy.readthedocs.io/en/v3.5.0/)\n", "\n", "Tutorials:\n", "* [Mining the Social Web](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition)\n", "* [Mining Twitter data with Python](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Web scraping\n", "There exists a bunch of [Python-based clients](https://dev.twitter.com/overview/api/twitter-libraries#python) for Twitter. [Tweepy](http://tweepy.readthedocs.io) is a popular choice.\n", "\n", "Tasks:\n", "1. Download the relevant information from Twitter. Try to minimize the quantity of collected data to the minimum required to answer the questions.\n", "2. Organize the collected data in a [panda dataframe](http://pandas.pydata.org/). Each row is a tweet, and the columns are at least: the tweet id, the text, the creation time, the number of likes (was called favorite before) and the number of retweets.\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import configparser\n", "\n", "import tweepy # you will need to conda or pip install tweepy first\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Read the confidential token.\n", "credentials = configparser.ConfigParser()\n", "credentials.read(os.path.join('..', 'credentials.ini'))\n", "\n", "auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))\n", "auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))\n", "\n", "api = tweepy.API(auth) \n", "\n", "user = 'EPFL_en'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Keep in mind that there is rate limiting of the API on a per user access token. You can find out more about rate limits [here](https://developer.twitter.com/en/docs/basics/rate-limiting). In order to avoid getting a rate limit error when you need to make a lot of requests to gather your data you can construct your API instance as:\n", "\n", "api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)\n", "\n", "This will aslo notify you about how long the period of sleep will be.\n", "\n", "It is good practice to limit the amount of requests while developing, and then to increase to collect all the necessary data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Number of posts / tweets to retrieve.\n", "# Small value for development, then increase to collect final data.\n", "n = 20 # 4000" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "my_user=api.get_user(user)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tweepy.models.User" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(my_user)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['__class__',\n", " '__delattr__',\n", " '__dict__',\n", " '__dir__',\n", " '__doc__',\n", " '__eq__',\n", " '__format__',\n", " '__ge__',\n", " '__getattribute__',\n", " '__getstate__',\n", " '__gt__',\n", " '__hash__',\n", " '__init__',\n", " '__init_subclass__',\n", " '__le__',\n", " '__lt__',\n", " '__module__',\n", " '__ne__',\n", " '__new__',\n", " '__reduce__',\n", " '__reduce_ex__',\n", " '__repr__',\n", " '__setattr__',\n", " '__sizeof__',\n", " '__str__',\n", " '__subclasshook__',\n", " '__weakref__',\n", " '_api',\n", " '_json',\n", " 'contributors_enabled',\n", " 'created_at',\n", " 'default_profile',\n", " 'default_profile_image',\n", " 'description',\n", " 'entities',\n", " 'favourites_count',\n", " 'follow',\n", " 'follow_request_sent',\n", " 'followers',\n", " 'followers_count',\n", " 'followers_ids',\n", " 'following',\n", " 'friends',\n", " 'friends_count',\n", " 'geo_enabled',\n", " 'has_extended_profile',\n", " 'id',\n", " 'id_str',\n", " 'is_translation_enabled',\n", " 'is_translator',\n", " 'lang',\n", " 'listed_count',\n", " 'lists',\n", " 'lists_memberships',\n", " 'lists_subscriptions',\n", " 'location',\n", " 'name',\n", " 'notifications',\n", " 'parse',\n", " 'parse_list',\n", " 'profile_background_color',\n", " 'profile_background_image_url',\n", " 'profile_background_image_url_https',\n", " 'profile_background_tile',\n", " 'profile_banner_url',\n", " 'profile_image_url',\n", " 'profile_image_url_https',\n", " 'profile_link_color',\n", " 'profile_location',\n", " 'profile_sidebar_border_color',\n", " 'profile_sidebar_fill_color',\n", " 'profile_text_color',\n", " 'profile_use_background_image',\n", " 'protected',\n", " 'screen_name',\n", " 'status',\n", " 'statuses_count',\n", " 'time_zone',\n", " 'timeline',\n", " 'translator_type',\n", " 'unfollow',\n", " 'url',\n", " 'utc_offset',\n", " 'verified']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dir(my_user)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EPFL_en has 26096 followers\n" ] } ], "source": [ "followers = api.get_user(user).followers_count\n", "print('{} has {} followers'.format(user, followers))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweepy handles much of the dirty work, like pagination. Have a look at how you can handle pagination with the Cursor objects in Tweepy with this [tutorial](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html). " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "tw = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'shares'])\n", "for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):\n", " serie = dict(id=tweet.id, text=tweet.text, time=tweet.created_at)\n", " serie.update(dict(likes=tweet.favorite_count, shares=tweet.retweet_count))\n", " tw = tw.append(serie, ignore_index=True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id object\n", "text object\n", "time datetime64[ns]\n", "likes object\n", "shares object\n", "dtype: object" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw.dtypes" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "tw.id = tw.id.astype(np.int64)\n", "tw.likes = tw.likes.astype(np.int64)\n", "tw.shares = tw.shares.astype(np.int64)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id int64\n", "text object\n", "time datetime64[ns]\n", "likes int64\n", "shares int64\n", "dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw.dtypes" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtexttimelikesshares
0915836109430693888Two intelligent vehicles are better than one 🚗...2017-10-05 07:08:3843
1915582684235206656Congratulations to @EPFL_en start-up @lunaphor...2017-10-04 14:21:37144
2915515076660129792Our warmest congratulations to our neighbor Ja...2017-10-04 09:52:5812160
3914817738543173632Our President @MartinVetterli in good company ...2017-10-02 11:42:004019
4914794744605265920\"Switzerland can and must be a leader in the d...2017-10-02 10:10:372411
\n", "
" ], "text/plain": [ " id text \\\n", "0 915836109430693888 Two intelligent vehicles are better than one 🚗... \n", "1 915582684235206656 Congratulations to @EPFL_en start-up @lunaphor... \n", "2 915515076660129792 Our warmest congratulations to our neighbor Ja... \n", "3 914817738543173632 Our President @MartinVetterli in good company ... \n", "4 914794744605265920 \"Switzerland can and must be a leader in the d... \n", "\n", " time likes shares \n", "0 2017-10-05 07:08:38 4 3 \n", "1 2017-10-04 14:21:37 14 4 \n", "2 2017-10-04 09:52:58 121 60 \n", "3 2017-10-02 11:42:00 40 19 \n", "4 2017-10-02 10:10:37 24 11 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Cleaning\n", "\n", "Problems come in two flavours:\n", "\n", "1. Missing data, i.e. unknown values.\n", "1. Errors in data, i.e. wrong values.\n", "\n", "The actions to be taken in each case is highly **data and problem specific**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance, some tweets are just retweets without any more information. Should they be collected ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, it is time for you to start collecting data from Twitter! Have fun!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "#your code here" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 1 }