{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrangle Report\n",
    "\n",
    "Wrangling of WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This document describes the efforts we did for wrangling and data analysis for the WeRateDogs data. Indeed the purpose of those efforts was to lead us to interesting insights regarding those data.  \n",
    "This document is like an internal document for the wrangling effort. It shows the path we followed and shares all the key technical point during this work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data sources and gathering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We gather data from 3 sources :\n",
    "1. The WeRateDogs twitter archive, made available for us as `twitter_archive_enhanced.csv`\n",
    "2. The tweets images predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network, available at the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv\n",
    "3. Twitter, especially the WeRateDogs account."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. WeRateDogs twitter archive\n",
    "We load the csv file - as a dataframe - using read_csv() function from pandas."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Tweets images predictions\n",
    "First we use the python `requests` library to get the file from the provided URL and write it as a local file.  \n",
    "Then, we load this local tsv file - as a dataframe - using pandas read_csv() fonction, with the option `sep='\\t'`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Gather additional informations from twitter\n",
    "\"retweet count\" and \"favorite count\" were missing from the WeRateDogs twitter archive provided to us. \n",
    "We gather this additional information using twitter API. Using the tweet IDs from the WeRateDogs twitter archive, we gather all this missing info through queries towards Twitter's API - we used tweepy.  \n",
    "\n",
    "To use the twitter API to connec to twitter, it was necessary to create a twitter account and request a developer account here : https://developer.twitter.com/en/docs/basics/developer-portal/overview\n",
    "\n",
    "The gathering process was the following:\n",
    "* For each tweet ID in the WeRateDogs archive (`df_archive_clean`),\n",
    "* We query the Twitter API and get a tweet JSON data\n",
    "* We store the entire set of JSON data in a file called `tweet_json.txt`, on a new line\n",
    "* Once `tweet_json.txt` is completed, we read it line by line into a pandas DataFrame (at a minimum with tweet ID, retweet count, favorite count)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Twitter API creation, howto :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tweepy\n",
    "\n",
    "consumer_key = 'YOUR CONSUMER KEY'\n",
    "consumer_secret = 'YOUR CONSUMER SECRET'\n",
    "access_token = 'YOUR ACCESS TOKEN'\n",
    "access_secret = 'YOUR ACCESS SECRET'\n",
    "\n",
    "auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n",
    "auth.set_access_token(access_token, access_secret)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following parameters were important during the API object creation :\n",
    "1. set the wait_on_rate_limit to True to automatically wait for rate limits to refill\n",
    "2. set wait_on_rate_limit_notify to True to print a notification when Tweepy is waiting for rate limits to refill\n",
    "The code is like the following :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Due to those rate limits, the json file create takes more or less 30 minutes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've got a basic understanding of the tweet json object from here : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html. So we choose to collect \"created_at\", \"id\", \"retweet_count\", \"favorite_count\" and \"full_text\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datasets assessments\n",
    "We used a systematic approach to assess each datasets. During this step by step approach, we raised and numbered each issues we encountered. We focus on quality issues (missing data, format, completness) and tidiness issues.  \n",
    "We focus on the issues according to the objective we had for the insights : Try to understand what makes the success for a tweet on WeRateDogs account.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the approach we used :\n",
    "* Get a view on the dataset using `head()` or `sample()` functions\n",
    "* Get the dataset size using `shape` function\n",
    "* Find out if we have duplicates with `duplicated()` \n",
    "* Identify any missing information with `isnull().sum()`\n",
    "* Observe the unique values for the most relevant columns for us with `unique()`\n",
    "* Check the columns - most relevant for us - types with `dtypes`. Go deeper with the `type()`function if required."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this way of walking through, we identified the following issues :"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Quality issues identified :\n",
    "1. The WeRateDgos twitter archive enhanced contains retweets, which could not be part or our study\n",
    "2. The WeRateDgos twitter archive enhanced has 59 missing value from the \"expanded_urls\" column\n",
    "3. REMOVED ! - This quality is no longer part of our study -\n",
    "4. WeRateDgos twitter archive - in some cases, the relevant rating information, consequently rating_numerator and rating_denominator, have not been extracted.\n",
    "5. WeRateDgos twitter archive - We have \"None\" as name even if, sometimes, there is an available name in the text. So few names are missing and could be retrieved.\n",
    "6. WeRateDgos twitter archive - We have missing dog stages, meaning stages not properly extracted. This is the case for \"puppo\", \"doggo\" and \"pupper\".\n",
    "7. WeRateDgos twitter archive - \"expanded_urls\" column contains duplicated urls.\n",
    "8. WeRateDgos twitter archive - timestamp is using string type.\n",
    "9. Tweets images predictions - We have rows which are not dogs images predictions.\n",
    "10. Data retrieved via twitter API - \"created_at\" column type is not timestamp.\n",
    "11. Tweets images predictions - predictions and the associated confidences are spread over several columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Tidiness issues identified :\n",
    "1. WeRateDgos twitter archive - Dog stages have been spread as columns.\n",
    "2. Tweets images predictions - predictions and the associated confidences are spread over several columns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Approach\n",
    "1. First, we address the missing data and completness issues (Quality issues 1, 2, 4, 5, 6, 9)\n",
    "2. Then we resolve the tidiness problems we identified.\n",
    "3. Finally we will correct the quality issues (Quality issues 7, 8, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each issues, we used :\n",
    "1. Define, \n",
    "2. Code\n",
    "3. Test."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While cleaning, we were looking for a way to apply a function to any row in a pandas DataFrame. We found the way to do searching \"function every row pandas\" on a search engine :\n",
    "http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/   \n",
    "We used this approach a lot during the cleaning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For regular expressions, we tested with : https://regex101.com/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On how to use it in a python program : https://docs.python.org/3/library/re.html#re.findall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cleaning of the quality issue 9, was done with the correction of the tidiness issue 2."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the cleaning, we ensure the following requirement \"we do not keep the tweets beyond August 1st 2017, because we won't be able to have the associated images predictions\" by deleting some rows in the twitter's dataframes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Merging all the 3 datasets for a master dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We merge the twitter enhanced archive already cleaned `df_archive_clean` with the data collected via a twitter API, also already cleaned `df_twitter_clean`.   \n",
    "Then we merge the result to the prediction dataframe `df_image_clean`.\n",
    "All on the `tweet_id` column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We created only one column `rating` for the rating as the `rating_numerator / rating_denominator`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Storing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. We store the master dataframe into a csv files:  `twitter_archive_master.csv` \n",
    "2. We also store the dataframes into a SQLite database.  \n",
    "We've learnt the way to do from here : https://stackoverflow.com/questions/50803109/how-to-store-pandas-dataframe-in-sqlite-db#50803252"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last step was to perform analysis and visualization using seaborn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We tried to develop an understanding of what make a dog picture to be a success.  \n",
    "We use categorical plots, as discovered from here : https://seaborn.pydata.org/tutorial/categorical.html#categorical-scatterplots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate the HTML version of this notebook\n",
    "from subprocess import call\n",
    "call(['python', '-m', 'nbconvert', 'wrangle_report.ipynb'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}