# [NTDS'17] demo 2: Twitter data aquisition
[ntds'17]: https://github.com/mdeff/ntds_2017

Michael Defferrard and Effrosyni Simou

## Objective

In this first lab session we will look into how we can collect data from the Internet. Specifically, we will look into the API (Application Programming Interface) of Twitter. 

We will also talk about the data cleaning process. While cleaning data is the [most time-consuming, least enjoyable Data Science task](http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says), it should be perfomed nonetheless.

For this exercise you will need to be registered with Twitter and to generate access tokens. If you do not have an account in this social network you can ask a friend to create a token for you or you can create a temporary account just for the needs of this class. 

You will need to create a [Twitter app](https://apps.twitter.com/) and copy the four tokens and secrets in the `credentials.ini` file:
```
[twitter]
consumer_key = YOUR-CONSUMER-KEY
consumer_secret = YOUR-CONSUMER-SECRET
access_token = YOUR-ACCESS-TOKEN
access_secret = YOUR-ACCESS-SECRET
```


## Ressources

Here are some links you may find useful to complete that exercise.

Web APIs: 
* [Twitter REST API](https://dev.twitter.com/rest/public)
* [Tweepy Documentation](http://tweepy.readthedocs.io/en/v3.5.0/)

Tutorials:
* [Mining the Social Web](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition)
* [Mining Twitter data with Python](https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/)

## Web scraping
There exists a bunch of [Python-based clients](https://dev.twitter.com/overview/api/twitter-libraries#python) for Twitter. [Tweepy](http://tweepy.readthedocs.io) is a popular choice.

Tasks:
1. Download the relevant information from Twitter. Try to minimize the quantity of collected data to the minimum required to answer the questions.
2. Organize the collected data in a [panda dataframe](http://pandas.pydata.org/). Each row is a tweet, and the columns are at least: the tweet id, the text, the creation time, the number of likes (was called favorite before) and the number of retweets.



In [1]:
import os
import configparser

import tweepy  # you will need to conda or pip install tweepy first
import numpy as np
import pandas as pd

In [2]:
# Read the confidential token.
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))

auth = tweepy.OAuthHandler(credentials.get('twitter', 'consumer_key'), credentials.get('twitter', 'consumer_secret'))
auth.set_access_token(credentials.get('twitter', 'access_token'), credentials.get('twitter', 'access_secret'))

api = tweepy.API(auth) 

user = 'EPFL_en'

Keep in mind that there is rate limiting of the API on a per user access token. You can find out more about rate limits [here](https://developer.twitter.com/en/docs/basics/rate-limiting). In order to avoid getting a rate limit error when you need to make a lot of requests to gather your data you can construct your API instance as:

api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

This will aslo notify you about how long the period of sleep will be.

It is good practice to limit the amount of requests while developing, and then to increase to collect all the necessary data.

In [3]:
# Number of posts / tweets to retrieve.
# Small value for development, then increase to collect final data.
n = 20  # 4000

In [4]:
my_user=api.get_user(user)

In [5]:
type(my_user)

tweepy.models.User

In [6]:
dir(my_user)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_api',
 '_json',
 'contributors_enabled',
 'created_at',
 'default_profile',
 'default_profile_image',
 'description',
 'entities',
 'favourites_count',
 'follow',
 'follow_request_sent',
 'followers',
 'followers_count',
 'followers_ids',
 'following',
 'friends',
 'friends_count',
 'geo_enabled',
 'has_extended_profile',
 'id',
 'id_str',
 'is_translation_enabled',
 'is_translator',
 'lang',
 'listed_count',
 'lists',
 'lists_memberships',
 'lists_subscriptions',
 'location',
 'name',
 'notifications',
 'parse',
 'parse_list',
 'profile_background_color',
 'profile_background_image_url',
 'profile_back

In [7]:
followers = api.get_user(user).followers_count
print('{} has {} followers'.format(user, followers))

EPFL_en has 26096 followers


Tweepy handles much of the dirty work, like pagination. Have a look at how you can handle pagination with the Cursor objects in Tweepy with this [tutorial](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html). 

In [8]:
tw = pd.DataFrame(columns=['id', 'text', 'time', 'likes', 'shares'])
for tweet in tweepy.Cursor(api.user_timeline, screen_name=user).items(n):
    serie = dict(id=tweet.id, text=tweet.text, time=tweet.created_at)
    serie.update(dict(likes=tweet.favorite_count, shares=tweet.retweet_count))
    tw = tw.append(serie, ignore_index=True)

In [9]:
tw.dtypes

id                object
text              object
time      datetime64[ns]
likes             object
shares            object
dtype: object

In [10]:
tw.id = tw.id.astype(np.int64)
tw.likes = tw.likes.astype(np.int64)
tw.shares = tw.shares.astype(np.int64)

In [11]:
tw.dtypes

id                 int64
text              object
time      datetime64[ns]
likes              int64
shares             int64
dtype: object

In [12]:
tw.head()

Unnamed: 0,id,text,time,likes,shares
0,915836109430693888,Two intelligent vehicles are better than one ðŸš—...,2017-10-05 07:08:38,4,3
1,915582684235206656,Congratulations to @EPFL_en start-up @lunaphor...,2017-10-04 14:21:37,14,4
2,915515076660129792,Our warmest congratulations to our neighbor Ja...,2017-10-04 09:52:58,121,60
3,914817738543173632,Our President @MartinVetterli in good company ...,2017-10-02 11:42:00,40,19
4,914794744605265920,"""Switzerland can and must be a leader in the d...",2017-10-02 10:10:37,24,11


## Data Cleaning

Problems come in two flavours:

1. Missing data, i.e. unknown values.
1. Errors in data, i.e. wrong values.

The actions to be taken in each case is highly **data and problem specific**.

For instance, some tweets are just retweets without any more information. Should they be collected ?

Now, it is time for you to start collecting data from Twitter! Have fun!

In [13]:
#your code here