{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/data-banner-copy.jpg\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Data Design and Representation | Final Project</h1>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Part 1: Relational Data](#1) <br>\n",
    "2. [Part 2: JSON, Data Streaming](#2)<br>\n",
    "3. [Part 3: Text Data](#3) <br>\n",
    "4. [References](#4) <br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1 Relational Data <a class=\"anchor\" id=\"1\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the business problem\n",
    "Jones Dozers is a construction equipment company which needs to keep track of the manufactured equipment and respective transactions so as to leverage the data in a systematic way to support their business strategies. As consultants to Jones Dozers, our aim is to help the company build a relational database (as relational database systems work best for any business transaction data storage and processing) and our first step is to gather requirements from stakeholders and build a conceptual model. This database will store the information on the equipment it makes, rentals and sales of the equipment, customers who rent or buy the equipment and the sales reps who conduct the rent or sale transactions. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"https://github.com/shanxingg/Data_Design_and_Representation_Project/blob/master/.ipynb_checkpoints/part1.1.png?raw=true\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution details 1: Requirements for the ER diagram for the Jones Dozers Sales and Rentals Database shown above:\n",
    "*\tThe Jones Dozers Sales and Rentals Database (DB) will keep track of the equipment it makes, rentals and sales of the equipment, customers who rent or buy the equipment and the sales reps who conduct the rent or sale transactions.\n",
    "*\tFor each piece of equipment, the DB will keep track of the unique equipment serial number, date when the equipment was made and the last inspection date.\n",
    "*\tFor each equipment detail, the DB will keep track of a unique equipment detail identifier, the equipment detail make, type and model\n",
    "*\tFor each customer, the DB will keep track of the unique customer identifier, the customer name and customer category\n",
    "*\tFor each sales rep, the DB will keep track of a unique sales rep identifier, the sales rep name which is composed of a first and last name, and the sales rep rank.\n",
    "*\tFor each rental, we will keep track of a unique rental transaction identifier, date of the rental, and the total price of the rental\n",
    "*\tFor each sale, we will keep track of the unique sales transaction identifier, the sale date and price.\n",
    "*\tEach piece of equipment has one equipment detail. Each equipment detail applies to at least one piece of the equipment, but can apply to many.\n",
    "*\tAn equipment is rented via a rental. An equipment may not be rented or can be rented through many rentals (same equipment can be rented out multiple times). But each rental will contain one and only one equipment.\n",
    "*\tAn equipment is sold via a sale. An equipment may not be sold or can be sold only once. Each sale should contain one and only one equipment.\n",
    "*\tA customer can rent an equipment via a rental. A customer can rent via multiple rental transactions or may not have any rental transactions. A rental transaction should be done by one and only one customer. \n",
    "*\tA customer can buy an equipment via a sale. A customer can buy via multiple sales transactions or may not have any sale transaction. A sale should be done by one and only one customer. \n",
    "*\tA sales rep conducts a rental. A sales rep can conduct multiple rentals or none. A rental has to be conducted by one and only one sales rep.\n",
    "*\tA sales rep conducts a sale. A sales rep can conduct multiple sale transactions or none. A sale transaction has to be conducted by one and only one sales rep.\n",
    "*   A sales rep (Mentor) can mentor other sales reps (Protégé). A sales rep can be a mentor to up to 3 other sales reps or to none. A sales rep can have one mentor or none."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution details 2: Relational schema mapped to the ER diagram"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Link to the ERD Plus Schema file - https://github.com/p-sama/Data-Design/blob/master/423-FinalProject-JD.erdplus <br>\n",
    "Also, pasted as image below:\n",
    "<img src=\"https://github.com/shanxingg/Data_Design_and_Representation_Project/blob/master/.ipynb_checkpoints/part1.2.png?raw=true\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the solution and key highlights:\n",
    "We resorted to Entity-Relationship (ER) modeling for conceptualizing the database model using the online platform, ERDplus.com. The first step was to identify each entity to track, their attributes (unique and others) and relationship (along with type of relationship) between entities (including self). Once the ER diagram was built, the next step was to create a relational schema to identify the structure of each table to be included in the actual database. Once the model is finalized, the final step is to identify the choice of the database system (MySQL, BigQuery, SQLServer, etc.) and use SQL codes to insert, update or query the database. Such data models also serve as building blocks in creating analytical databases to analyze the sales, revenue and inventory data to produce business insights and help them make more data-driven decisions. A dimensional model for a data warehouse can be easily created using this database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key learnings:\n",
    "* ERDs can be useful in understanding the data elements involved in a project suitable for relational database systems and how they work together\n",
    "* Reverse-engineering the ER diagram can help in requirements collection process\n",
    "* Model the data in terms of the different types/categories of data and their relationships to each other\n",
    "* Understand the level of the data, reveal ambiguities and identify the constraints of the data\n",
    "* Provide a model for the actual database design\n",
    "* Facilitate building codes for easy storage and retrieval of data\n",
    "* Pave the way for analytical projects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2 JSON, Data Streaming <a class=\"anchor\" id=\"2\"></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the business problem\n",
    "* Find any data streaming service with JSON data streaming (e.g. Twitter Stream - see the following link for reference - http://docs.tweepy.org/en/v3.6.0/streaming_how_to.html)\n",
    "* Understand the data model and JSON representation used in the data stream\n",
    "* Write Python code to connect to the data streaming service to receive a limited data feed\n",
    "* Parse the JSON data and load it into a relational database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution details: Using Tweepy API to stream real-time tweets from Twitter\n",
    "* **About Tweepy**\n",
    "\n",
    "Current version: 3.6.0\n",
    "\n",
    "Tweepy is an open-sourced library which can be used with python to communicate with Twitter platform and its various API. With tweepy, it is possible to get any object and use any method that the official Twitter API offers, for instance, user, status etc.\n",
    "\n",
    "\n",
    "In this part of the project, we will use Tweepy’s StreamingAPI to capture real-time tweets based on some tags (filters) in an asynchronous call and save the data to a relational database using sqlite3 library.  \n",
    "\n",
    "   ** Authentication Steps:**\n",
    "\n",
    "  * **STEP 1**: Create an app at https://apps.twitter.com/app/new. Refer https://developer.twitter.com/en/docs/basics/getting-started#get-started-app for more documentation\n",
    "  \n",
    "<center><img src='https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/pic1.png' width = \"400\"/></center>\n",
    "\n",
    "  * **STEP 2**: Once the app is successfully created, get key and access token\n",
    "  \n",
    "<center><img src='https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/pic2.png' width = \"400\"/></center>\n",
    "\n",
    "<center><img src='https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/pic3.png' width = \"400\"/></center>\n",
    "\n",
    "  * **STEP 3**: Authentication using OAuth\n",
    "  \n",
    "<center><img src='https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/pic4.png' width = \"500\"/></center>\n",
    "  \n",
    "  \n",
    "* **Fascinating features**\n",
    "\n",
    "Twitter streaming API is used to download tweets in real time. It is useful for obtaining a high volume of tweets, or for creating a live feed using a site stream or user stream. It is significantly different from REST api because the REST api is used to pull data from twitter but the streaming api pushes messages to a persistent session. This allows the streaming api to download more data in real time than could be done using the REST API.\n",
    "\n",
    "Moreover, in this project, the way we have used the StreamingAPI is by calling it asynchronously, which increases the performance of this program significantly. If done otherwise, the main thread will have to wait for the streaming execution to complete and the way this has been designed is that it does not disconnect automatically. This means an external interruption is needed for disconnection like a KeyBoard interrupt, hence delaying the steps thereafter. Moreover, asynchronous implementation allows for parallel processing.\n",
    "\n",
    "* **Data Model and JSON Representation**\n",
    "\n",
    "We studied the Twitter JSON response object and structure here - https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html. Following is the fundamental structure of a Twitter object.\n",
    "\n",
    "<center><img src='https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/Twitter_object_fundamental.png' width = \"400\"/></center>\n",
    "\n",
    "There are several other entities and sub-attributes in the JSON object for instance for user for instance name, description, followers_count, location, geo, statuses_count etc. In this project, we have selected few attributes related to two entities - 'user' and 'tweet' and build the following relational schema:\n",
    "\n",
    "<center><img src='https://raw.githubusercontent.com/shrutisaxena0617/Machine_Learning/master/images/erdplus-diagram.png' width = \"500\"/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### STEP 1: Loading Dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#!pip3 install tweepy\n",
    "#!pip3 install -U -q PyDrive\n",
    "#!pip3 install gsutil\n",
    "#!pip3 install dataset\n",
    "#!pip3 install textblob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import tweepy\n",
    "import pandas as pd\n",
    "import requests\n",
    "import json\n",
    "import sqlite3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### STEP 2: Create Database and tables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create twitter.db database and associated tables \n",
    "conn = sqlite3.connect('twitter.db', check_same_thread=False)\n",
    "c = conn.cursor()\n",
    "\n",
    "try:\n",
    "    # create tables\n",
    "    c.execute('''create table if not exists tweets (\n",
    "        tweet_id integer,\n",
    "        created_at text,\n",
    "        tweet_text text,\n",
    "        source text,\n",
    "        user_id integer\n",
    "    )''')\n",
    "    c.execute('''create table if not exists users (\n",
    "        user_id integer,\n",
    "        name text,\n",
    "        description text,\n",
    "        follower_count integer,\n",
    "        statuses_count integer\n",
    "    )''')\n",
    "    conn.close()\n",
    "except Exception as e:\n",
    "    print(\"Exception occured while creating tables %s\" % e)\n",
    "    conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### STEP 3: Connecting to Twitter API to receive real time tweets and push the data to database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Access keys to connect to Twitter API\n",
    "TWITTER_KEY = '2693355300-SKXNm8qxdkfdQJzR3paNvbge3n1DaJRVaHgYDxD'\n",
    "TWITTER_SECRET = 'N7tw0btVqVMvNkAgyDesIfJLn0CRk6HjLdKHGZ3ymY6lU'\n",
    "TWITTER_APP_KEY = 'PGouZB8NYgIjCo3uEM5URhMRH'\n",
    "TWITTER_APP_SECRET = 'DAAbhNzrF7qSxsJIkxhCQGjELOx6MrXXuZXVeGGsTu74TrLVhe'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Class for defining a Tweet\n",
    "class Tweet():\n",
    "\n",
    "    # Data on the tweet\n",
    "    def __init__(self, id, created_at, text, source, user_id):\n",
    "        self.id = id\n",
    "        self.created_at = created_at\n",
    "        self.text = text\n",
    "        self.source = source\n",
    "        self.user_id = user_id\n",
    "\n",
    "    # Inserting that data into the DB\n",
    "    def insert_tweet(self, c):\n",
    "        try:\n",
    "            c.execute('INSERT INTO tweets (tweet_id, created_at, tweet_text, source, user_id) VALUES (?, ?, ?, ?, ?)',\n",
    "                (self.id, self.created_at, self.text, self.source, self.user_id))\n",
    "            return self.id\n",
    "        except Exception as e:\n",
    "            print(\"Exception occured in insert_tweet() %s\" % e)\n",
    "            \n",
    "# Class for defining a User\n",
    "class User():\n",
    "\n",
    "    # Data on the user\n",
    "    def __init__(self, user_id, name, description = None, follower_count = 0, statuses_count = 0):\n",
    "        self.user_id = user_id\n",
    "        self.name = name\n",
    "        self.description = description\n",
    "        self.follower_count = follower_count\n",
    "        self.statuses_count = statuses_count\n",
    "\n",
    "    # Inserting that data into the DB\n",
    "    def insert_user(self, c):\n",
    "        try:\n",
    "            c.execute(\"INSERT INTO users (user_id, name, description, follower_count, statuses_count) VALUES (?, ?, ?, ?, ?)\",\n",
    "                (self.user_id, self.name, self.description, self.follower_count, self.statuses_count))\n",
    "            return self.user_id\n",
    "        except Exception as e:\n",
    "            print(\"Exception occured in insert_user() %s\" % e)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:17:54 +0000 2018', 'id': 1002056554797977601, 'id_str': '1002056554797977601', 'text': '@namu_ram you need to decide.. https://t.co/WQDW9jdoIj', 'display_text_range': [0, 30], 'source': '<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': 147837724, 'in_reply_to_user_id_str': '147837724', 'in_reply_to_screen_name': 'namu_ram', 'user': {'id': 2556978238, 'id_str': '2556978238', 'name': 'ghansham kamath', 'screen_name': 'GhannuKamath', 'location': 'Bengaluru South, India', 'url': None, 'description': \"fill in the blanks.. yes that's what I am when you know me. I can be good, bad or politician. depends on how and why you know me.\", 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 11, 'friends_count': 40, 'listed_count': 0, 'favourites_count': 9, 'statuses_count': 17, 'created_at': 'Wed May 21 04:56:36 +0000 2014', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'C0DEED', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/844300470097207296/5ZI46dmI_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/844300470097207296/5ZI46dmI_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/2556978238/1490131892', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'quoted_status_id': 994200379800932358, 'quoted_status_id_str': '994200379800932358', 'quoted_status': {'created_at': 'Wed May 09 13:00:16 +0000 2018', 'id': 994200379800932358, 'id_str': '994200379800932358', 'text': 'Lovingly, Wade. #deadpool2 https://t.co/FEPOCVKRDc', 'display_text_range': [0, 26], 'source': '<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 2893511188, 'id_str': '2893511188', 'name': 'Ryan Reynolds', 'screen_name': 'VancityReynolds', 'location': '🇨🇦', 'url': 'http://www.deadpool.com', 'description': 'Owner: @AviationGin', 'translator_type': 'none', 'protected': False, 'verified': True, 'followers_count': 11170720, 'friends_count': 334, 'listed_count': 9213, 'favourites_count': 8572, 'statuses_count': 1003, 'created_at': 'Wed Nov 26 16:12:27 +0000 2014', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'C0DEED', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/741703039355064320/ClVbjlG-_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/741703039355064320/ClVbjlG-_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/2893511188/1462825390', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'quote_count': 4576, 'reply_count': 1201, 'retweet_count': 74443, 'favorite_count': 240490, 'entities': {'hashtags': [{'text': 'deadpool2', 'indices': [16, 26]}], 'urls': [], 'user_mentions': [], 'symbols': [], 'media': [{'id': 994200354039492608, 'id_str': '994200354039492608', 'indices': [27, 50], 'media_url': 'http://pbs.twimg.com/media/Dcwb9FoXcAAE_O4.jpg', 'media_url_https': 'https://pbs.twimg.com/media/Dcwb9FoXcAAE_O4.jpg', 'url': 'https://t.co/FEPOCVKRDc', 'display_url': 'pic.twitter.com/FEPOCVKRDc', 'expanded_url': 'https://twitter.com/VancityReynolds/status/994200379800932358/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 680, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1024, 'h': 1024, 'resize': 'fit'}, 'medium': {'w': 1024, 'h': 1024, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 994200354039492608, 'id_str': '994200354039492608', 'indices': [27, 50], 'media_url': 'http://pbs.twimg.com/media/Dcwb9FoXcAAE_O4.jpg', 'media_url_https': 'https://pbs.twimg.com/media/Dcwb9FoXcAAE_O4.jpg', 'url': 'https://t.co/FEPOCVKRDc', 'display_url': 'pic.twitter.com/FEPOCVKRDc', 'expanded_url': 'https://twitter.com/VancityReynolds/status/994200379800932358/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 680, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1024, 'h': 1024, 'resize': 'fit'}, 'medium': {'w': 1024, 'h': 1024, 'resize': 'fit'}}}]}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en'}, 'quoted_status_permalink': {'url': 'https://t.co/WQDW9jdoIj', 'expanded': 'https://twitter.com/VancityReynolds/status/994200379800932358', 'display': 'twitter.com/VancityReynold…'}, 'is_quote_status': True, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/WQDW9jdoIj', 'expanded_url': 'https://twitter.com/VancityReynolds/status/994200379800932358', 'display_url': 'twitter.com/VancityReynold…', 'indices': [31, 54]}], 'user_mentions': [{'screen_name': 'namu_ram', 'name': 'Ram', 'id': 147837724, 'id_str': '147837724', 'indices': [0, 9]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743874631'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n",
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:18:00 +0000 2018', 'id': 1002056578701381632, 'id_str': '1002056578701381632', 'text': '#AvengersInfinityWar problem - \\nWhat happens to CLOTHES in The Snap? Directors finally reveal all (not literally!)… https://t.co/bySDpuMjKa', 'display_text_range': [0, 140], 'source': '<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 17895820, 'id_str': '17895820', 'name': 'Daily Express', 'screen_name': 'Daily_Express', 'location': 'London', 'url': 'http://www.express.co.uk', 'description': 'http://Express.co.uk - Home of the Daily and Sunday Express', 'translator_type': 'none', 'protected': False, 'verified': True, 'followers_count': 724622, 'friends_count': 559, 'listed_count': 3070, 'favourites_count': 25, 'statuses_count': 540540, 'created_at': 'Fri Dec 05 12:05:48 +0000 2008', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '7B6085', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme6/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme6/bg.gif', 'profile_background_tile': False, 'profile_link_color': '13AEB8', 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'FFFFFF', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/915525294978666496/ALgq4U5j_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/915525294978666496/ALgq4U5j_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/17895820/1475598917', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': '#AvengersInfinityWar problem - \\nWhat happens to CLOTHES in The Snap? Directors finally reveal all (not literally!) https://t.co/ckw2lkH0Xc #InfinityWar https://t.co/SBGKRW5DOi', 'display_text_range': [0, 151], 'entities': {'hashtags': [{'text': 'AvengersInfinityWar', 'indices': [0, 20]}, {'text': 'InfinityWar', 'indices': [139, 151]}], 'urls': [{'url': 'https://t.co/ckw2lkH0Xc', 'expanded_url': 'https://bit.ly/2LEWqIN', 'display_url': 'bit.ly/2LEWqIN', 'indices': [115, 138]}], 'user_mentions': [], 'symbols': [], 'media': [{'id': 1001930943580565505, 'id_str': '1001930943580565505', 'indices': [152, 175], 'media_url': 'http://pbs.twimg.com/media/DeeS4n2XcAEw0o7.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DeeS4n2XcAEw0o7.jpg', 'url': 'https://t.co/SBGKRW5DOi', 'display_url': 'pic.twitter.com/SBGKRW5DOi', 'expanded_url': 'https://twitter.com/Daily_Express/status/1002056578701381632/photo/1', 'type': 'photo', 'sizes': {'small': {'w': 674, 'h': 400, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 674, 'h': 400, 'resize': 'fit'}, 'medium': {'w': 674, 'h': 400, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 1001930943580565505, 'id_str': '1001930943580565505', 'indices': [152, 175], 'media_url': 'http://pbs.twimg.com/media/DeeS4n2XcAEw0o7.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DeeS4n2XcAEw0o7.jpg', 'url': 'https://t.co/SBGKRW5DOi', 'display_url': 'pic.twitter.com/SBGKRW5DOi', 'expanded_url': 'https://twitter.com/Daily_Express/status/1002056578701381632/photo/1', 'type': 'photo', 'sizes': {'small': {'w': 674, 'h': 400, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'large': {'w': 674, 'h': 400, 'resize': 'fit'}, 'medium': {'w': 674, 'h': 400, 'resize': 'fit'}}}]}}, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'AvengersInfinityWar', 'indices': [0, 20]}], 'urls': [{'url': 'https://t.co/bySDpuMjKa', 'expanded_url': 'https://twitter.com/i/web/status/1002056578701381632', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [116, 139]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743880330'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:18:38 +0000 2018', 'id': 1002056737719861249, 'id_str': '1002056737719861249', 'text': 'RT @vivekdahiya08: One of the rare times when you enjoy the Hindi trailer as much or actually more than the English one! Love that the tran…', 'source': '<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1001687938147799041, 'id_str': '1001687938147799041', 'name': 'Piyali', 'screen_name': 'Piyali57721558', 'location': 'कोलकाता, भारत', 'url': None, 'description': 'This is the first step toward becoming better then you are', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1, 'friends_count': 8, 'listed_count': 0, 'favourites_count': 138, 'statuses_count': 115, 'created_at': 'Wed May 30 04:53:09 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1001694923861708800/2d80KdT0_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1001694923861708800/2d80KdT0_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1001687938147799041/1527657650', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'retweeted_status': {'created_at': 'Tue Mar 27 06:59:41 +0000 2018', 'id': 978526957947174912, 'id_str': '978526957947174912', 'text': 'One of the rare times when you enjoy the Hindi trailer as much or actually more than the English one! Love that the… https://t.co/ROikLSaTYF', 'display_text_range': [0, 140], 'source': '<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 142991009, 'id_str': '142991009', 'name': 'Vivek Dahiya', 'screen_name': 'vivekdahiya08', 'location': 'Mumbai ', 'url': 'http://facebook.com/VivekDahiyaOfficial', 'description': 'You are your biggest objection', 'translator_type': 'none', 'protected': False, 'verified': True, 'followers_count': 92911, 'friends_count': 47, 'listed_count': 41, 'favourites_count': 533, 'statuses_count': 1270, 'created_at': 'Wed May 12 08:31:11 +0000 2010', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'EBEBEB', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme7/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme7/bg.gif', 'profile_background_tile': False, 'profile_link_color': '3B94D9', 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'F3F3F3', 'profile_text_color': '333333', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/958288994399174657/whOB02V6_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/958288994399174657/whOB02V6_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/142991009/1524024364', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'quoted_status_id': 978169017230753792, 'quoted_status_id_str': '978169017230753792', 'quoted_status': {'created_at': 'Mon Mar 26 07:17:21 +0000 2018', 'id': 978169017230753792, 'id_str': '978169017230753792', 'text': 'तैयार हो जाइये एक्शन, हंसी और मनोरंजन के डबल डोस के लिए! पेश हैं @DeadpoolMovie का हिंदी ट्रेलर!\\n\\nमिलिए आपके चहीते… https://t.co/dGLB9uum2h', 'source': '<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 110978637, 'id_str': '110978637', 'name': 'Fox Star India', 'screen_name': 'FoxStarIndia', 'location': 'India', 'url': 'https://bookmy.show/Deadpool2Film', 'description': \"Fox Star India's official handle; and the source of all your exclusive & latest updates on our Hollywood releases.\", 'translator_type': 'regular', 'protected': False, 'verified': True, 'followers_count': 47837, 'friends_count': 232, 'listed_count': 96, 'favourites_count': 1764, 'statuses_count': 22941, 'created_at': 'Wed Feb 03 12:12:37 +0000 2010', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': 'FF0000', 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/876693483519475712/cHkqlsWi_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/876693483519475712/cHkqlsWi_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/110978637/1526971049', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': 'तैयार हो जाइये एक्शन, हंसी और मनोरंजन के डबल डोस के लिए! पेश हैं @DeadpoolMovie का हिंदी ट्रेलर!\\n\\nमिलिए आपके चहीते सुपरहीरो #Deadpool से अपने नज़दीकी सिनेमा घरों में, 18 May से| #Deadpool2\\n\\nhttps://t.co/UPzf3JwMtH', 'display_text_range': [0, 213], 'entities': {'hashtags': [{'text': 'Deadpool', 'indices': [124, 133]}, {'text': 'Deadpool2', 'indices': [178, 188]}], 'urls': [{'url': 'https://t.co/UPzf3JwMtH', 'expanded_url': 'http://bit.ly/Deadpool2-HindiTrailer', 'display_url': 'bit.ly/Deadpool2-Hind…', 'indices': [190, 213]}], 'user_mentions': [{'screen_name': 'deadpoolmovie', 'name': 'Deadpool Movie', 'id': 2818322000, 'id_str': '2818322000', 'indices': [65, 79]}], 'symbols': []}}, 'quote_count': 33, 'reply_count': 55, 'retweet_count': 106, 'favorite_count': 145, 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/dGLB9uum2h', 'expanded_url': 'https://twitter.com/i/web/status/978169017230753792', 'display_url': 'twitter.com/i/web/status/9…', 'indices': [116, 139]}], 'user_mentions': [{'screen_name': 'deadpoolmovie', 'name': 'Deadpool Movie', 'id': 2818322000, 'id_str': '2818322000', 'indices': [65, 79]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'hi'}, 'quoted_status_permalink': {'url': 'https://t.co/VX1vimaysl', 'expanded': 'https://twitter.com/foxstarindia/status/978169017230753792', 'display': 'twitter.com/foxstarindia/s…'}, 'is_quote_status': True, 'extended_tweet': {'full_text': 'One of the rare times when you enjoy the Hindi trailer as much or actually more than the English one! Love that the translations are relevant to the Indian audience! Looking forward to watching this one in both languages! MarvelFan #Deadpool #SwachhBharatAbhiyan #Sultaan #Dangal https://t.co/VX1vimaysl', 'display_text_range': [0, 279], 'entities': {'hashtags': [{'text': 'Deadpool', 'indices': [232, 241]}, {'text': 'SwachhBharatAbhiyan', 'indices': [242, 262]}, {'text': 'Sultaan', 'indices': [263, 271]}, {'text': 'Dangal', 'indices': [272, 279]}], 'urls': [{'url': 'https://t.co/VX1vimaysl', 'expanded_url': 'https://twitter.com/foxstarindia/status/978169017230753792', 'display_url': 'twitter.com/foxstarindia/s…', 'indices': [280, 303]}], 'user_mentions': [], 'symbols': []}}, 'quote_count': 1, 'reply_count': 3, 'retweet_count': 22, 'favorite_count': 259, 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/ROikLSaTYF', 'expanded_url': 'https://twitter.com/i/web/status/978526957947174912', 'display_url': 'twitter.com/i/web/status/9…', 'indices': [117, 140]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en'}, 'quoted_status_id': 978169017230753792, 'quoted_status_id_str': '978169017230753792', 'quoted_status': {'created_at': 'Mon Mar 26 07:17:21 +0000 2018', 'id': 978169017230753792, 'id_str': '978169017230753792', 'text': 'तैयार हो जाइये एक्शन, हंसी और मनोरंजन के डबल डोस के लिए! पेश हैं @DeadpoolMovie का हिंदी ट्रेलर!\\n\\nमिलिए आपके चहीते… https://t.co/dGLB9uum2h', 'source': '<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 110978637, 'id_str': '110978637', 'name': 'Fox Star India', 'screen_name': 'FoxStarIndia', 'location': 'India', 'url': 'https://bookmy.show/Deadpool2Film', 'description': \"Fox Star India's official handle; and the source of all your exclusive & latest updates on our Hollywood releases.\", 'translator_type': 'regular', 'protected': False, 'verified': True, 'followers_count': 47837, 'friends_count': 232, 'listed_count': 96, 'favourites_count': 1764, 'statuses_count': 22941, 'created_at': 'Wed Feb 03 12:12:37 +0000 2010', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': 'FF0000', 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/876693483519475712/cHkqlsWi_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/876693483519475712/cHkqlsWi_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/110978637/1526971049', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': 'तैयार हो जाइये एक्शन, हंसी और मनोरंजन के डबल डोस के लिए! पेश हैं @DeadpoolMovie का हिंदी ट्रेलर!\\n\\nमिलिए आपके चहीते सुपरहीरो #Deadpool से अपने नज़दीकी सिनेमा घरों में, 18 May से| #Deadpool2\\n\\nhttps://t.co/UPzf3JwMtH', 'display_text_range': [0, 213], 'entities': {'hashtags': [{'text': 'Deadpool', 'indices': [124, 133]}, {'text': 'Deadpool2', 'indices': [178, 188]}], 'urls': [{'url': 'https://t.co/UPzf3JwMtH', 'expanded_url': 'http://bit.ly/Deadpool2-HindiTrailer', 'display_url': 'bit.ly/Deadpool2-Hind…', 'indices': [190, 213]}], 'user_mentions': [{'screen_name': 'deadpoolmovie', 'name': 'Deadpool Movie', 'id': 2818322000, 'id_str': '2818322000', 'indices': [65, 79]}], 'symbols': []}}, 'quote_count': 33, 'reply_count': 55, 'retweet_count': 106, 'favorite_count': 145, 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/dGLB9uum2h', 'expanded_url': 'https://twitter.com/i/web/status/978169017230753792', 'display_url': 'twitter.com/i/web/status/9…', 'indices': [116, 139]}], 'user_mentions': [{'screen_name': 'deadpoolmovie', 'name': 'Deadpool Movie', 'id': 2818322000, 'id_str': '2818322000', 'indices': [65, 79]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'hi'}, 'quoted_status_permalink': {'url': 'https://t.co/VX1vimaysl', 'expanded': 'https://twitter.com/foxstarindia/status/978169017230753792', 'display': 'twitter.com/foxstarindia/s…'}, 'is_quote_status': True, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'vivekdahiya08', 'name': 'Vivek Dahiya', 'id': 142991009, 'id_str': '142991009', 'indices': [3, 17]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743918243'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:18:44 +0000 2018', 'id': 1002056762890117121, 'id_str': '1002056762890117121', 'text': 'NEW EP&gt; #111 Time for #mcmcomiccon #MCMLondonComicCon as we bring panels with @iamtaylorgray @RealKevinConroy… https://t.co/awIKrMiHn9', 'source': '<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1308062562, 'id_str': '1308062562', 'name': '365Flicks Podcast', 'screen_name': '365FlicksPod', 'location': 'Berwick-upon-Tweed, England', 'url': 'https://itunes.apple.com/gb/podcast/365flicks-podcast/id1030800334?mt=2', 'description': 'The 365FlicksPodcast where we talk all things Movie, TV, Comics and Games. #PodernFamily #BritPodScene A UK Top20 Film and TV Podcast', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 5112, 'friends_count': 5089, 'listed_count': 354, 'favourites_count': 14967, 'statuses_count': 23808, 'created_at': 'Wed Mar 27 17:22:46 +0000 2013', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '0084B4', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1001525640472989701/K7GsI-7z_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1001525640472989701/K7GsI-7z_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1308062562/1414929651', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': 'NEW EP&gt; #111 Time for #mcmcomiccon #MCMLondonComicCon as we bring panels with @iamtaylorgray @RealKevinConroy @briannahilde @StefanKapicic @StevenOgg and many more... We also review #Deadpool2 and... https://t.co/Qv9TguKhWq', 'display_text_range': [0, 226], 'entities': {'hashtags': [{'text': 'mcmcomiccon', 'indices': [25, 37]}, {'text': 'MCMLondonComicCon', 'indices': [38, 56]}, {'text': 'Deadpool2', 'indices': [185, 195]}], 'urls': [{'url': 'https://t.co/Qv9TguKhWq', 'expanded_url': 'http://directory.libsyn.com/episode/index/id/6651351/tdest_id/299110', 'display_url': 'directory.libsyn.com/episode/index/…', 'indices': [203, 226]}], 'user_mentions': [{'screen_name': 'iamtaylorgray', 'name': 'Taylor Gray', 'id': 75726298, 'id_str': '75726298', 'indices': [81, 95]}, {'screen_name': 'RealKevinConroy', 'name': 'Kevin Conroy', 'id': 1292288262, 'id_str': '1292288262', 'indices': [96, 112]}, {'screen_name': 'briannahilde', 'name': 'Brianna Hildebrand', 'id': 325661896, 'id_str': '325661896', 'indices': [113, 126]}, {'screen_name': 'StefanKapicic', 'name': 'Stefan Kapičić', 'id': 18497168, 'id_str': '18497168', 'indices': [127, 141]}, {'screen_name': 'StevenOgg', 'name': 'Steven Ogg', 'id': 1416055418, 'id_str': '1416055418', 'indices': [142, 152]}], 'symbols': []}}, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'mcmcomiccon', 'indices': [25, 37]}, {'text': 'MCMLondonComicCon', 'indices': [38, 56]}], 'urls': [{'url': 'https://t.co/awIKrMiHn9', 'expanded_url': 'https://twitter.com/i/web/status/1002056762890117121', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [114, 137]}], 'user_mentions': [{'screen_name': 'iamtaylorgray', 'name': 'Taylor Gray', 'id': 75726298, 'id_str': '75726298', 'indices': [81, 95]}, {'screen_name': 'RealKevinConroy', 'name': 'Kevin Conroy', 'id': 1292288262, 'id_str': '1292288262', 'indices': [96, 112]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743924244'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n",
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:18:44 +0000 2018', 'id': 1002056765494644736, 'id_str': '1002056765494644736', 'text': '#deadpool2 was just great. Make sure you watch the ending!', 'source': '<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 963663128163749888, 'id_str': '963663128163749888', 'name': 'MarcoCpolo', 'screen_name': 'marco_C_polo', 'location': None, 'url': 'http://twitch.tv/marcoCpolo', 'description': 'Part time streamer, full time dreamer *csgo *pubg *fortnite *hearthstone', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 67, 'friends_count': 20, 'listed_count': 0, 'favourites_count': 127, 'statuses_count': 236, 'created_at': 'Wed Feb 14 06:36:08 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/963668818085244929/1US4DwmG_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/963668818085244929/1US4DwmG_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/963663128163749888/1518758094', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'deadpool2', 'indices': [0, 10]}], 'urls': [], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743924865'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n",
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:18:49 +0000 2018', 'id': 1002056786092830721, 'id_str': '1002056786092830721', 'text': 'RT @sexyaleksandra: Who want free camsex with me?  #Repost and #Like this post, join to my webcam chat https://t.co/uMhZEu07sM   and write…', 'source': '<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Twitter Lite</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1172547452, 'id_str': '1172547452', 'name': 'Dude Thorny', 'screen_name': 'gareroovy', 'location': 'Canada', 'url': None, 'description': \"Married man likes looking at pics of 18 & over beautiful women & you're all beautiful :) Master retweeter!\", 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 198, 'friends_count': 568, 'listed_count': 19, 'favourites_count': 11914, 'statuses_count': 9375, 'created_at': 'Tue Feb 12 17:05:12 +0000 2013', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '3B94D9', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/3249586041/df59539a9eed86bf9522b753c292adc6_normal.jpeg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/3249586041/df59539a9eed86bf9522b753c292adc6_normal.jpeg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1172547452/1449521019', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'retweeted_status': {'created_at': 'Fri May 18 06:17:15 +0000 2018', 'id': 997360448562434048, 'id_str': '997360448562434048', 'text': 'Who want free camsex with me?  #Repost and #Like this post, join to my webcam chat https://t.co/uMhZEu07sM   and wr… https://t.co/McJRiLrUd6', 'display_text_range': [0, 140], 'source': '<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 874369139044909056, 'id_str': '874369139044909056', 'name': 'Alexandra', 'screen_name': 'sexyaleksandra', 'location': 'United States', 'url': None, 'description': 'I love to take pictures, dance, movies, series and very much I love to travel😍', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 1702, 'friends_count': 3420, 'listed_count': 4, 'favourites_count': 754, 'statuses_count': 94, 'created_at': 'Mon Jun 12 20:53:42 +0000 2017', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ru', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': '', 'profile_background_image_url_https': '', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/950255026949193728/pSlFzA5N_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/950255026949193728/pSlFzA5N_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/874369139044909056/1515329672', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': 'Who want free camsex with me?  #Repost and #Like this post, join to my webcam chat https://t.co/uMhZEu07sM   and write me \"I want camsex\"\\n\\n#Deadpool2 #ShawnOneNightOnly #IDontTrustPeopleThat #DragRace #BuySangriaWineOniTunes #GreysAnatomy #camsex #webcamsex #porn #pornvideo #nude https://t.co/BidrPW4Sk1', 'display_text_range': [0, 280], 'entities': {'hashtags': [{'text': 'Repost', 'indices': [31, 38]}, {'text': 'Like', 'indices': [43, 48]}, {'text': 'Deadpool2', 'indices': [139, 149]}, {'text': 'ShawnOneNightOnly', 'indices': [150, 168]}, {'text': 'IDontTrustPeopleThat', 'indices': [169, 190]}, {'text': 'DragRace', 'indices': [191, 200]}, {'text': 'BuySangriaWineOniTunes', 'indices': [201, 224]}, {'text': 'GreysAnatomy', 'indices': [225, 238]}, {'text': 'camsex', 'indices': [239, 246]}, {'text': 'webcamsex', 'indices': [247, 257]}, {'text': 'porn', 'indices': [258, 263]}, {'text': 'pornvideo', 'indices': [264, 274]}, {'text': 'nude', 'indices': [275, 280]}], 'urls': [{'url': 'https://t.co/uMhZEu07sM', 'expanded_url': 'http://bit.ly/sashacampage', 'display_url': 'bit.ly/sashacampage', 'indices': [83, 106]}], 'user_mentions': [], 'symbols': [], 'media': [{'id': 997360397857501184, 'id_str': '997360397857501184', 'indices': [281, 304], 'media_url': 'http://pbs.twimg.com/media/DddV_2NWsAAFkcY.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DddV_2NWsAAFkcY.jpg', 'url': 'https://t.co/BidrPW4Sk1', 'display_url': 'pic.twitter.com/BidrPW4Sk1', 'expanded_url': 'https://twitter.com/sexyaleksandra/status/997360448562434048/photo/1', 'type': 'photo', 'sizes': {'small': {'w': 498, 'h': 680, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 527, 'h': 720, 'resize': 'fit'}, 'large': {'w': 527, 'h': 720, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 997360397857501184, 'id_str': '997360397857501184', 'indices': [281, 304], 'media_url': 'http://pbs.twimg.com/media/DddV_2NWsAAFkcY.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DddV_2NWsAAFkcY.jpg', 'url': 'https://t.co/BidrPW4Sk1', 'display_url': 'pic.twitter.com/BidrPW4Sk1', 'expanded_url': 'https://twitter.com/sexyaleksandra/status/997360448562434048/photo/1', 'type': 'photo', 'sizes': {'small': {'w': 498, 'h': 680, 'resize': 'fit'}, 'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 527, 'h': 720, 'resize': 'fit'}, 'large': {'w': 527, 'h': 720, 'resize': 'fit'}}}]}}, 'quote_count': 0, 'reply_count': 2, 'retweet_count': 8, 'favorite_count': 32, 'entities': {'hashtags': [{'text': 'Repost', 'indices': [31, 38]}, {'text': 'Like', 'indices': [43, 48]}], 'urls': [{'url': 'https://t.co/uMhZEu07sM', 'expanded_url': 'http://bit.ly/sashacampage', 'display_url': 'bit.ly/sashacampage', 'indices': [83, 106]}, {'url': 'https://t.co/McJRiLrUd6', 'expanded_url': 'https://twitter.com/i/web/status/997360448562434048', 'display_url': 'twitter.com/i/web/status/9…', 'indices': [117, 140]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': True, 'filter_level': 'low', 'lang': 'en'}, 'is_quote_status': False, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'Repost', 'indices': [51, 58]}, {'text': 'Like', 'indices': [63, 68]}], 'urls': [{'url': 'https://t.co/uMhZEu07sM', 'expanded_url': 'http://bit.ly/sashacampage', 'display_url': 'bit.ly/sashacampage', 'indices': [103, 126]}], 'user_mentions': [{'screen_name': 'sexyaleksandra', 'name': 'Alexandra', 'id': 874369139044909056, 'id_str': '874369139044909056', 'indices': [3, 18]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743929776'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:18:53 +0000 2018', 'id': 1002056803054596097, 'id_str': '1002056803054596097', 'text': 'Who did this?? 😂😂\\n\\n#avengers #InfinityWar #AvengersInfinityWar #thanos #gamora #movie #theater #snacks #howmuch… https://t.co/AlIZYMSIO5', 'display_text_range': [0, 140], 'source': '<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 760084764082606080, 'id_str': '760084764082606080', 'name': 'Hyde Hooligans', 'screen_name': 'HydeHooligans', 'location': 'Los Angeles, CA', 'url': 'http://www.hydehooliganfilms.com', 'description': 'Professional Indie Filmmakers specializing in narrative feature films. \"We are the music-makers, And we are the dreamers of dreams.\" -Arthur O\\'Shaughnessy', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 209, 'friends_count': 580, 'listed_count': 9, 'favourites_count': 821, 'statuses_count': 1551, 'created_at': 'Mon Aug 01 12:08:24 +0000 2016', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '000000', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/966142473771741184/yJ8B3Dh4_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/966142473771741184/yJ8B3Dh4_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/760084764082606080/1519676275', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': 'Who did this?? 😂😂\\n\\n#avengers #InfinityWar #AvengersInfinityWar #thanos #gamora #movie #theater #snacks #howmuch #humor #hydehooligans https://t.co/QKG14YLjhu', 'display_text_range': [0, 133], 'entities': {'hashtags': [{'text': 'avengers', 'indices': [19, 28]}, {'text': 'InfinityWar', 'indices': [29, 41]}, {'text': 'AvengersInfinityWar', 'indices': [42, 62]}, {'text': 'thanos', 'indices': [63, 70]}, {'text': 'gamora', 'indices': [71, 78]}, {'text': 'movie', 'indices': [79, 85]}, {'text': 'theater', 'indices': [86, 94]}, {'text': 'snacks', 'indices': [95, 102]}, {'text': 'howmuch', 'indices': [103, 111]}, {'text': 'humor', 'indices': [112, 118]}, {'text': 'hydehooligans', 'indices': [119, 133]}], 'urls': [], 'user_mentions': [], 'symbols': [], 'media': [{'id': 1002056798650580992, 'id_str': '1002056798650580992', 'indices': [134, 157], 'media_url': 'http://pbs.twimg.com/media/DegFWWhVQAAujxy.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DegFWWhVQAAujxy.jpg', 'url': 'https://t.co/QKG14YLjhu', 'display_url': 'pic.twitter.com/QKG14YLjhu', 'expanded_url': 'https://twitter.com/HydeHooligans/status/1002056803054596097/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 334, 'h': 680, 'resize': 'fit'}, 'large': {'w': 640, 'h': 1303, 'resize': 'fit'}, 'medium': {'w': 589, 'h': 1200, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 1002056798650580992, 'id_str': '1002056798650580992', 'indices': [134, 157], 'media_url': 'http://pbs.twimg.com/media/DegFWWhVQAAujxy.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DegFWWhVQAAujxy.jpg', 'url': 'https://t.co/QKG14YLjhu', 'display_url': 'pic.twitter.com/QKG14YLjhu', 'expanded_url': 'https://twitter.com/HydeHooligans/status/1002056803054596097/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'small': {'w': 334, 'h': 680, 'resize': 'fit'}, 'large': {'w': 640, 'h': 1303, 'resize': 'fit'}, 'medium': {'w': 589, 'h': 1200, 'resize': 'fit'}}}]}}, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'avengers', 'indices': [19, 28]}, {'text': 'InfinityWar', 'indices': [29, 41]}, {'text': 'AvengersInfinityWar', 'indices': [42, 62]}, {'text': 'thanos', 'indices': [63, 70]}, {'text': 'gamora', 'indices': [71, 78]}, {'text': 'movie', 'indices': [79, 85]}, {'text': 'theater', 'indices': [86, 94]}, {'text': 'snacks', 'indices': [95, 102]}, {'text': 'howmuch', 'indices': [103, 111]}], 'urls': [{'url': 'https://t.co/AlIZYMSIO5', 'expanded_url': 'https://twitter.com/i/web/status/1002056803054596097', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [113, 136]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743933820'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n",
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:19:21 +0000 2018', 'id': 1002056919341715457, 'id_str': '1002056919341715457', 'text': 'Thread! I will probably squander my twelve bucks and two hours on something other than Deadpool, but the FotF stuff… https://t.co/ANvZ0DKk6a', 'display_text_range': [0, 140], 'source': '<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 439309182, 'id_str': '439309182', 'name': 'Zina Petersen', 'screen_name': 'ZinaNPetersen', 'location': None, 'url': None, 'description': 'Triage medievalist: contact for all your emergency medieval needs. The Mormon Sappho.', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 700, 'friends_count': 423, 'listed_count': 26, 'favourites_count': 38794, 'statuses_count': 15720, 'created_at': 'Sat Dec 17 16:30:28 +0000 2011', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': 'C0DEED', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1DA1F2', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/917959921098022912/QdZTSz5Q_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/917959921098022912/QdZTSz5Q_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/439309182/1507693676', 'default_profile': True, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'quoted_status_id': 1002047804255203328, 'quoted_status_id_str': '1002047804255203328', 'quoted_status': {'created_at': 'Thu May 31 04:43:08 +0000 2018', 'id': 1002047804255203328, 'id_str': '1002047804255203328', 'text': 'Expanding on the convo I had with @TheBreeMae earlier today about the @pluggedin review of #Deadpool2, I think that… https://t.co/Dbue6tjoBb', 'source': '<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>', 'truncated': True, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 14769495, 'id_str': '14769495', 'name': 'Kathryn Brightbill', 'screen_name': 'KEBrightbill', 'location': 'Florida, USA', 'url': 'http://kathrynbrightbill.com', 'description': 'Board member & Leg. Policy Analyst @ResponsibleHS | Sec. @DemocratManatee | @UFlaw & @CovenantCollege alum | Bylines in @LATimes & @RNS | https://t.co/2vGvAM3q8i', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 6142, 'friends_count': 2024, 'listed_count': 137, 'favourites_count': 50027, 'statuses_count': 107013, 'created_at': 'Wed May 14 05:13:30 +0000 2008', 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '0033FF', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme12/bg.gif', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme12/bg.gif', 'profile_background_tile': False, 'profile_link_color': '130889', 'profile_sidebar_border_color': 'FFFFFF', 'profile_sidebar_fill_color': 'FFF7CC', 'profile_text_color': '0C3E53', 'profile_use_background_image': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/919326887289282560/j4RSSvB8_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/919326887289282560/j4RSSvB8_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/14769495/1458894514', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'extended_tweet': {'full_text': \"Expanding on the convo I had with @TheBreeMae earlier today about the @pluggedin review of #Deadpool2, I think that review from @FocusFamily's movie mag shows just how subversive the movie actually is. #Spoilers to commence in the thread. https://t.co/s2irbwo0IO\", 'display_text_range': [0, 262], 'entities': {'hashtags': [{'text': 'Deadpool2', 'indices': [91, 101]}, {'text': 'Spoilers', 'indices': [202, 211]}], 'urls': [{'url': 'https://t.co/s2irbwo0IO', 'expanded_url': 'https://www.pluggedin.com/movie-reviews/deadpool-2', 'display_url': 'pluggedin.com/movie-reviews/…', 'indices': [239, 262]}], 'user_mentions': [{'screen_name': 'TheBreeMae', 'name': 'Bree Mae', 'id': 2870058663, 'id_str': '2870058663', 'indices': [34, 45]}, {'screen_name': 'pluggedin', 'name': 'Plugged In', 'id': 137737488, 'id_str': '137737488', 'indices': [70, 80]}, {'screen_name': 'FocusFamily', 'name': 'Focus on the Family', 'id': 111704438, 'id_str': '111704438', 'indices': [128, 140]}], 'symbols': []}}, 'quote_count': 1, 'reply_count': 1, 'retweet_count': 3, 'favorite_count': 3, 'entities': {'hashtags': [{'text': 'Deadpool2', 'indices': [91, 101]}], 'urls': [{'url': 'https://t.co/Dbue6tjoBb', 'expanded_url': 'https://twitter.com/i/web/status/1002047804255203328', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}], 'user_mentions': [{'screen_name': 'TheBreeMae', 'name': 'Bree Mae', 'id': 2870058663, 'id_str': '2870058663', 'indices': [34, 45]}, {'screen_name': 'pluggedin', 'name': 'Plugged In', 'id': 137737488, 'id_str': '137737488', 'indices': [70, 80]}], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en'}, 'quoted_status_permalink': {'url': 'https://t.co/aatJi7R7ii', 'expanded': 'https://twitter.com/kebrightbill/status/1002047804255203328', 'display': 'twitter.com/kebrightbill/s…'}, 'is_quote_status': True, 'extended_tweet': {'full_text': 'Thread! I will probably squander my twelve bucks and two hours on something other than Deadpool, but the FotF stuff is spot on. https://t.co/aatJi7R7ii', 'display_text_range': [0, 127], 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/aatJi7R7ii', 'expanded_url': 'https://twitter.com/kebrightbill/status/1002047804255203328', 'display_url': 'twitter.com/kebrightbill/s…', 'indices': [128, 151]}], 'user_mentions': [], 'symbols': []}}, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [], 'urls': [{'url': 'https://t.co/ANvZ0DKk6a', 'expanded_url': 'https://twitter.com/i/web/status/1002056919341715457', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [117, 140]}], 'user_mentions': [], 'symbols': []}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743961545'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "incoming\n",
      "------------------------------\n",
      "{'created_at': 'Thu May 31 05:19:30 +0000 2018', 'id': 1002056957304307713, 'id_str': '1002056957304307713', 'text': 'RT @osro_o: ★INFINITYY FRIENDS★ #infinitywar #AvengersInfinityWar #Avengers #FANART https://t.co/SseL8F3dsf', 'source': '<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 1000511554213724160, 'id_str': '1000511554213724160', 'name': 'sena @new acc :(', 'screen_name': 'senabbunn', 'location': 'sena | f | 20 | adl, aus', 'url': 'https://www.instagram.com/senabbun/', 'description': 'previously @senabbun but twitter locked me out :( | illustration+animation student | aa/bnha/hq/rwby | vip/army | 💖💜💙', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 28, 'friends_count': 116, 'listed_count': 0, 'favourites_count': 199, 'statuses_count': 49, 'created_at': 'Sat May 26 22:58:37 +0000 2018', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'en', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': 'E81C4F', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1000532188008366085/87ZTRCld_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/1000532188008366085/87ZTRCld_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/1000511554213724160/1527435004', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'retweeted_status': {'created_at': 'Tue May 29 15:12:12 +0000 2018', 'id': 1001481337339957248, 'id_str': '1001481337339957248', 'text': '★INFINITYY FRIENDS★ #infinitywar #AvengersInfinityWar #Avengers #FANART https://t.co/SseL8F3dsf', 'display_text_range': [0, 71], 'source': '<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>', 'truncated': False, 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 440730267, 'id_str': '440730267', 'name': 'Seoro O/OSRO', 'screen_name': 'osro_o', 'location': '한국 서울', 'url': 'http://oseoro.tumblr.com/', 'description': '그림그려요 애니만들어요', 'translator_type': 'none', 'protected': False, 'verified': False, 'followers_count': 565, 'friends_count': 15, 'listed_count': 3, 'favourites_count': 0, 'statuses_count': 6, 'created_at': 'Mon Dec 19 10:36:10 +0000 2011', 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'lang': 'ko', 'contributors_enabled': False, 'is_translator': False, 'profile_background_color': '000000', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': '000000', 'profile_sidebar_fill_color': '000000', 'profile_text_color': '000000', 'profile_use_background_image': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/766206323323043840/b9d8_DAz_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/766206323323043840/b9d8_DAz_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/440730267/1527640802', 'default_profile': False, 'default_profile_image': False, 'following': None, 'follow_request_sent': None, 'notifications': None}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'quote_count': 10, 'reply_count': 4, 'retweet_count': 388, 'favorite_count': 973, 'entities': {'hashtags': [{'text': 'infinitywar', 'indices': [20, 32]}, {'text': 'AvengersInfinityWar', 'indices': [33, 53]}, {'text': 'Avengers', 'indices': [54, 63]}, {'text': 'FANART', 'indices': [64, 71]}], 'urls': [], 'user_mentions': [], 'symbols': [], 'media': [{'id': 1001480767820607493, 'id_str': '1001480767820607493', 'indices': [72, 95], 'media_url': 'http://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'url': 'https://t.co/SseL8F3dsf', 'display_url': 'pic.twitter.com/SseL8F3dsf', 'expanded_url': 'https://twitter.com/osro_o/status/1001481337339957248/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 849, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 481, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1280, 'h': 1809, 'resize': 'fit'}}}]}, 'extended_entities': {'media': [{'id': 1001480767820607493, 'id_str': '1001480767820607493', 'indices': [72, 95], 'media_url': 'http://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'url': 'https://t.co/SseL8F3dsf', 'display_url': 'pic.twitter.com/SseL8F3dsf', 'expanded_url': 'https://twitter.com/osro_o/status/1001481337339957248/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 849, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 481, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1280, 'h': 1809, 'resize': 'fit'}}}]}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en'}, 'is_quote_status': False, 'quote_count': 0, 'reply_count': 0, 'retweet_count': 0, 'favorite_count': 0, 'entities': {'hashtags': [{'text': 'infinitywar', 'indices': [32, 44]}, {'text': 'AvengersInfinityWar', 'indices': [45, 65]}, {'text': 'Avengers', 'indices': [66, 75]}, {'text': 'FANART', 'indices': [76, 83]}], 'urls': [], 'user_mentions': [{'screen_name': 'osro_o', 'name': 'Seoro O/OSRO', 'id': 440730267, 'id_str': '440730267', 'indices': [3, 10]}], 'symbols': [], 'media': [{'id': 1001480767820607493, 'id_str': '1001480767820607493', 'indices': [84, 107], 'media_url': 'http://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'url': 'https://t.co/SseL8F3dsf', 'display_url': 'pic.twitter.com/SseL8F3dsf', 'expanded_url': 'https://twitter.com/osro_o/status/1001481337339957248/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 849, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 481, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1280, 'h': 1809, 'resize': 'fit'}}, 'source_status_id': 1001481337339957248, 'source_status_id_str': '1001481337339957248', 'source_user_id': 440730267, 'source_user_id_str': '440730267'}]}, 'extended_entities': {'media': [{'id': 1001480767820607493, 'id_str': '1001480767820607493', 'indices': [84, 107], 'media_url': 'http://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DeX5c8dVMAUxFn2.jpg', 'url': 'https://t.co/SseL8F3dsf', 'display_url': 'pic.twitter.com/SseL8F3dsf', 'expanded_url': 'https://twitter.com/osro_o/status/1001481337339957248/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 849, 'h': 1200, 'resize': 'fit'}, 'small': {'w': 481, 'h': 680, 'resize': 'fit'}, 'large': {'w': 1280, 'h': 1809, 'resize': 'fit'}}, 'source_status_id': 1001481337339957248, 'source_status_id_str': '1001481337339957248', 'source_user_id': 440730267, 'source_user_id_str': '440730267'}]}, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'filter_level': 'low', 'lang': 'en', 'timestamp_ms': '1527743970596'}\n",
      "Inside insert_data\n",
      "Committing :: \n",
      "------------------------------\n"
     ]
    }
   ],
   "source": [
    "# Our implementation of Tweepy's StreamListener class\n",
    "class MyListener(tweepy.StreamListener):\n",
    "\n",
    "    # Our function to create database connection\n",
    "    def create_connection(self):\n",
    "        self.conn = sqlite3.connect('twitter.db', check_same_thread=False)\n",
    "        self.c = self.conn.cursor()\n",
    "\n",
    "    # StreamListener class function implementation to display error if a connection issue occurs\n",
    "    def on_error(self, status):\n",
    "        print('error is')\n",
    "        print(status)\n",
    "        if status == 420:\n",
    "            return False\n",
    "\n",
    "    # StreamListener class function implementation to get twwets in real time\n",
    "    def on_status(self, status):\n",
    "        print('incoming')\n",
    "        self.tweet = json.loads(json.dumps(status._json))\n",
    "        print(\"------------------------------\")\n",
    "        print(self.tweet)\n",
    "        self.insert_data(self.tweet)\n",
    "        print(\"------------------------------\")\n",
    "\n",
    "    # Our helper function\n",
    "    def insert_data(self, tweet):\n",
    "        try:\n",
    "            print('Inside insert_data')\n",
    "            tweet_id = self.insert_tweet_data(tweet)\n",
    "            user_id = self.insert_user_data(tweet)\n",
    "            #self.insert_tweet_user_data(tweet_id, user_id)\n",
    "            print(\"Committing :: \")\n",
    "            self.conn.commit()\n",
    "        except Exception as e:\n",
    "            print('Exception occured while inserting records in db :: %s' % e)\n",
    "            self.conn.rollback()\n",
    "\n",
    "    # Our function to insert user data\n",
    "    def insert_tweet_data(self, tweet):\n",
    "        user = tweet.get('user')\n",
    "        if user:\n",
    "            t = Tweet(tweet['id'], tweet['created_at'], tweet['text'], tweet.get('source') or '', user['id'])\n",
    "            return t.insert_tweet(self.c)\n",
    "\n",
    "    # Our function to insert tweet data\n",
    "    def insert_user_data(self, tweet):\n",
    "        user = tweet.get('user')\n",
    "        if user:\n",
    "            u = User(user['id'], user['name'], user.get('description') or '', user.get('followers_count') or 0, user.get('statuses_count') or 0)\n",
    "            return u.insert_user(self.c)\n",
    "\n",
    "    # Our function to close databse connection\n",
    "    def connection_close(self):\n",
    "        self.conn.close()\n",
    "\n",
    "# Authentication steps to connect to the Twitter API using OAuth\n",
    "auth = tweepy.OAuthHandler(TWITTER_APP_KEY, TWITTER_APP_SECRET)\n",
    "auth.set_access_token(TWITTER_KEY, TWITTER_SECRET)\n",
    "\n",
    "api = tweepy.API(auth)\n",
    "mylistener = MyListener()\n",
    "mylistener.create_connection()\n",
    "\n",
    "# Connect to the stream of tweets \n",
    "twitter_stream = tweepy.Stream(auth, mylistener)\n",
    "twitter_stream.filter(track=['avengersinfinitywar', 'deadpool2'], languages=[\"en\"], async = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Close all open connections\n",
    "mylistener.connection_close()\n",
    "twitter_stream.disconnect()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Reopen database connection from outside of class\n",
    "conn = sqlite3.connect('twitter.db')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>tweet_id</th>\n",
       "      <th>created_at</th>\n",
       "      <th>tweet_text</th>\n",
       "      <th>source</th>\n",
       "      <th>user_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1002056554797977601</td>\n",
       "      <td>Thu May 31 05:17:54 +0000 2018</td>\n",
       "      <td>@namu_ram you need to decide.. https://t.co/WQ...</td>\n",
       "      <td>&lt;a href=\"http://twitter.com/download/android\" ...</td>\n",
       "      <td>2556978238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1002056578701381632</td>\n",
       "      <td>Thu May 31 05:18:00 +0000 2018</td>\n",
       "      <td>#AvengersInfinityWar problem - \\nWhat happens ...</td>\n",
       "      <td>&lt;a href=\"https://about.twitter.com/products/tw...</td>\n",
       "      <td>17895820</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1002056737719861249</td>\n",
       "      <td>Thu May 31 05:18:38 +0000 2018</td>\n",
       "      <td>RT @vivekdahiya08: One of the rare times when ...</td>\n",
       "      <td>&lt;a href=\"http://twitter.com/download/android\" ...</td>\n",
       "      <td>1001687938147799041</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1002056762890117121</td>\n",
       "      <td>Thu May 31 05:18:44 +0000 2018</td>\n",
       "      <td>NEW EP&amp;gt; #111 Time for #mcmcomiccon #MCMLond...</td>\n",
       "      <td>&lt;a href=\"http://www.facebook.com/twitter\" rel=...</td>\n",
       "      <td>1308062562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1002056765494644736</td>\n",
       "      <td>Thu May 31 05:18:44 +0000 2018</td>\n",
       "      <td>#deadpool2 was just great. Make sure you watch...</td>\n",
       "      <td>&lt;a href=\"http://twitter.com/download/iphone\" r...</td>\n",
       "      <td>963663128163749888</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1002056786092830721</td>\n",
       "      <td>Thu May 31 05:18:49 +0000 2018</td>\n",
       "      <td>RT @sexyaleksandra: Who want free camsex with ...</td>\n",
       "      <td>&lt;a href=\"https://mobile.twitter.com\" rel=\"nofo...</td>\n",
       "      <td>1172547452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1002056803054596097</td>\n",
       "      <td>Thu May 31 05:18:53 +0000 2018</td>\n",
       "      <td>Who did this?? 😂😂\\n\\n#avengers #InfinityWar #A...</td>\n",
       "      <td>&lt;a href=\"http://twitter.com/download/iphone\" r...</td>\n",
       "      <td>760084764082606080</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1002056919341715457</td>\n",
       "      <td>Thu May 31 05:19:21 +0000 2018</td>\n",
       "      <td>Thread! I will probably squander my twelve buc...</td>\n",
       "      <td>&lt;a href=\"http://twitter.com/download/iphone\" r...</td>\n",
       "      <td>439309182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1002056957304307713</td>\n",
       "      <td>Thu May 31 05:19:30 +0000 2018</td>\n",
       "      <td>RT @osro_o: ★INFINITYY FRIENDS★ #infinitywar #...</td>\n",
       "      <td>&lt;a href=\"http://twitter.com\" rel=\"nofollow\"&gt;Tw...</td>\n",
       "      <td>1000511554213724160</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              tweet_id                      created_at  \\\n",
       "0  1002056554797977601  Thu May 31 05:17:54 +0000 2018   \n",
       "1  1002056578701381632  Thu May 31 05:18:00 +0000 2018   \n",
       "2  1002056737719861249  Thu May 31 05:18:38 +0000 2018   \n",
       "3  1002056762890117121  Thu May 31 05:18:44 +0000 2018   \n",
       "4  1002056765494644736  Thu May 31 05:18:44 +0000 2018   \n",
       "5  1002056786092830721  Thu May 31 05:18:49 +0000 2018   \n",
       "6  1002056803054596097  Thu May 31 05:18:53 +0000 2018   \n",
       "7  1002056919341715457  Thu May 31 05:19:21 +0000 2018   \n",
       "8  1002056957304307713  Thu May 31 05:19:30 +0000 2018   \n",
       "\n",
       "                                          tweet_text  \\\n",
       "0  @namu_ram you need to decide.. https://t.co/WQ...   \n",
       "1  #AvengersInfinityWar problem - \\nWhat happens ...   \n",
       "2  RT @vivekdahiya08: One of the rare times when ...   \n",
       "3  NEW EP&gt; #111 Time for #mcmcomiccon #MCMLond...   \n",
       "4  #deadpool2 was just great. Make sure you watch...   \n",
       "5  RT @sexyaleksandra: Who want free camsex with ...   \n",
       "6  Who did this?? 😂😂\\n\\n#avengers #InfinityWar #A...   \n",
       "7  Thread! I will probably squander my twelve buc...   \n",
       "8  RT @osro_o: ★INFINITYY FRIENDS★ #infinitywar #...   \n",
       "\n",
       "                                              source              user_id  \n",
       "0  <a href=\"http://twitter.com/download/android\" ...           2556978238  \n",
       "1  <a href=\"https://about.twitter.com/products/tw...             17895820  \n",
       "2  <a href=\"http://twitter.com/download/android\" ...  1001687938147799041  \n",
       "3  <a href=\"http://www.facebook.com/twitter\" rel=...           1308062562  \n",
       "4  <a href=\"http://twitter.com/download/iphone\" r...   963663128163749888  \n",
       "5  <a href=\"https://mobile.twitter.com\" rel=\"nofo...           1172547452  \n",
       "6  <a href=\"http://twitter.com/download/iphone\" r...   760084764082606080  \n",
       "7  <a href=\"http://twitter.com/download/iphone\" r...            439309182  \n",
       "8  <a href=\"http://twitter.com\" rel=\"nofollow\">Tw...  1000511554213724160  "
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Query data from tweets table\n",
    "pd.read_sql_query(\"SELECT * from tweets limit 10\", conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>name</th>\n",
       "      <th>description</th>\n",
       "      <th>follower_count</th>\n",
       "      <th>statuses_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2556978238</td>\n",
       "      <td>ghansham kamath</td>\n",
       "      <td>fill in the blanks.. yes that's what I am when...</td>\n",
       "      <td>11</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>17895820</td>\n",
       "      <td>Daily Express</td>\n",
       "      <td>http://Express.co.uk - Home of the Daily and S...</td>\n",
       "      <td>724622</td>\n",
       "      <td>540540</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1001687938147799041</td>\n",
       "      <td>Piyali</td>\n",
       "      <td>This is the first step toward becoming better ...</td>\n",
       "      <td>1</td>\n",
       "      <td>115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1308062562</td>\n",
       "      <td>365Flicks Podcast</td>\n",
       "      <td>The 365FlicksPodcast where we talk all things ...</td>\n",
       "      <td>5112</td>\n",
       "      <td>23808</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>963663128163749888</td>\n",
       "      <td>MarcoCpolo</td>\n",
       "      <td>Part time streamer, full time dreamer *csgo *p...</td>\n",
       "      <td>67</td>\n",
       "      <td>236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1172547452</td>\n",
       "      <td>Dude Thorny</td>\n",
       "      <td>Married man likes looking at pics of 18 &amp; over...</td>\n",
       "      <td>198</td>\n",
       "      <td>9375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>760084764082606080</td>\n",
       "      <td>Hyde Hooligans</td>\n",
       "      <td>Professional Indie Filmmakers specializing in ...</td>\n",
       "      <td>209</td>\n",
       "      <td>1551</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>439309182</td>\n",
       "      <td>Zina Petersen</td>\n",
       "      <td>Triage medievalist: contact for all your emerg...</td>\n",
       "      <td>700</td>\n",
       "      <td>15720</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1000511554213724160</td>\n",
       "      <td>sena @new acc :(</td>\n",
       "      <td>previously @senabbun but twitter locked me out...</td>\n",
       "      <td>28</td>\n",
       "      <td>49</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               user_id               name  \\\n",
       "0           2556978238    ghansham kamath   \n",
       "1             17895820      Daily Express   \n",
       "2  1001687938147799041             Piyali   \n",
       "3           1308062562  365Flicks Podcast   \n",
       "4   963663128163749888         MarcoCpolo   \n",
       "5           1172547452        Dude Thorny   \n",
       "6   760084764082606080     Hyde Hooligans   \n",
       "7            439309182      Zina Petersen   \n",
       "8  1000511554213724160   sena @new acc :(   \n",
       "\n",
       "                                         description  follower_count  \\\n",
       "0  fill in the blanks.. yes that's what I am when...              11   \n",
       "1  http://Express.co.uk - Home of the Daily and S...          724622   \n",
       "2  This is the first step toward becoming better ...               1   \n",
       "3  The 365FlicksPodcast where we talk all things ...            5112   \n",
       "4  Part time streamer, full time dreamer *csgo *p...              67   \n",
       "5  Married man likes looking at pics of 18 & over...             198   \n",
       "6  Professional Indie Filmmakers specializing in ...             209   \n",
       "7  Triage medievalist: contact for all your emerg...             700   \n",
       "8  previously @senabbun but twitter locked me out...              28   \n",
       "\n",
       "   statuses_count  \n",
       "0              17  \n",
       "1          540540  \n",
       "2             115  \n",
       "3           23808  \n",
       "4             236  \n",
       "5            9375  \n",
       "6            1551  \n",
       "7           15720  \n",
       "8              49  "
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Query data from users table\n",
    "pd.read_sql_query(\"SELECT * from users limit 10\", conn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Close database connection\n",
    "conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the solution and key highlights:\n",
    "\n",
    "**Foundational Work**\n",
    "\n",
    "We began with understanding the Twitter API in depth by reading through its documentation followed by Tweepy. We created a developer application named 'analyzing_tweets_123' and secured the requisite access tokens to connect with the API. We then understood the response json and data model of Twitter going through the object details and attributes, especially the different services and features available to query this data. \n",
    "\n",
    "**Data Model**\n",
    "\n",
    "Based on our understanding of the response JSON, we designed a data model with two entities - 'user' and 'tweet' We used ERDPlus tool to create the relational schema with each table containing select fields from the response json.\n",
    "\n",
    "**Code**\n",
    "\n",
    "We used sqlite3 package to create a SQL database and the two tables. We then created classes for each entity and encapsulated functions with SQL queries to insert the data for class attributes to the database. On the API end, we used StreamListener service by Tweepy to receive real time streaming tweets on two tracks ('Avengers' and 'Deadpool2'). This API call was made asynchronously. Once we received sufficient data, we closed all the connections and re-established a database connection from outside of the class to query the two tables to test the pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key learnings:\n",
    "\n",
    "**Data**\n",
    "\n",
    "* There is an enormous potential for analysis and a plethora of use cases based on Twitter's data especially as we learned from the different objects and attributes in the response json\n",
    "* Ingesting the Twitter data in a relational database has many advantages such as structured schema, simplicity of design, vertical scalability though they are slower than NoSQL databases\n",
    "* Even though in this project we pushed the data in a relational database, we could have also used a NoSQL database such as MongoDB or an in memory database such as Redis depending upon the use case\n",
    "\n",
    "**Code**\n",
    "\n",
    "* Tweepy gives a feature to make asynchronous call to the Twitter API and I believe it is a great feature to have especially when streaming high velocity and high volume real time tweets and doing real time analytics on them. It is especially useful to ensure the streaming response works parallely while the main thread continues with the processing of consequent code\n",
    "* Decoupling the code improves the efficiency. For instance, in our code above, even though we created the database connection inside the MyListener class object, we closed it once the data was received and inserted in the database. For querying the data we created another connection outside of the class. This is what happens in the industry as every team / application / task needs to be separate from the other. REST SOAP services architecture is one example for instance. Microservices are more widely used now\n",
    "* Object oriented coding by making classes and using its object instances is an effective approach for problems such as these especially when upscaling and productionizing of the code is essential.\n",
    "\n",
    "**API**\n",
    "\n",
    "* Working with Opensource APIs is became a lot more fun when we encountered a bug in the response json sent by Twitter. Basically, if the retweeted variable is set to False, ideally we should be able to filter out the retweeted logs. However, the response json for retweeted data\n",
    "* Working with twitter API is another world as there are various services and the applications are humungous for instance sentiment analysis, customer profiling, targeted advertising, designing digital marketing campaigns etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 3 Text Data <a class=\"anchor\" id=\"3\"></a>\n",
    "Apply the following text representation techniques (and any variations, such as stopword removals in BoW) on the Movies Review dataset (http://ai.stanford.edu/~amaas/data/sentiment/). Consider experimenting with the following: n-grams, stopword removal, punctuation removal, lemmatization, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the business problem\n",
    "* Not all data is quantitative. Online reviews, social media posts, and other text datasets can contain valuable information.\n",
    "* We will analyze movie reviews using text representation techniques to convert our dataset before conducting our analysis.\n",
    "* First, we clean the dataset using lemmatization, punctuation and stopword removal to prepare for vectorization\n",
    "* Then we vectorize using Bag of Words and count the frequencies\n",
    "* TF-IDF allows us to weight the frequencies of the word in a particular document by the frequency of the documents containing the word in the corpus\n",
    "* We conducted an analysis using both unigram and bigram tokens\n",
    "* Feature hashing allows for faster processing and simplified representation\n",
    "* Finally, we are able to conduct a sentiment analysis on movie reviews using a multimodal Naive Bayes model, and validate our model, comparing between the data validation results of the different text representation techniques"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Solution details:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from itertools import chain\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.feature_extraction import FeatureHasher\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "import nltk\n",
    "import nltk.corpus as corpus\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "import boto3\n",
    "import json\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# use glob to select all the train data\n",
    "import glob\n",
    "train_neg = glob.glob(\"/Users/shanxing/Desktop/aclImdb/train/neg/*.txt\")\n",
    "train_pos = glob.glob(\"/Users/shanxing/Desktop/aclImdb/train/pos/*.txt\")\n",
    "test_neg = glob.glob(\"/Users/shanxing/Desktop/aclImdb/test/neg/*.txt\")\n",
    "test_pos = glob.glob(\"/Users/shanxing/Desktop/aclImdb/test/pos/*.txt\")\n",
    "train_test_text = train_neg + train_pos + test_neg + test_pos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# append all train data to variable, list_text\n",
    "list_text = []\n",
    "for file_name in train_test_text:\n",
    "    data = open(file_name).readlines()\n",
    "    list_text.append(data)\n",
    "    \n",
    "list_text = list(chain.from_iterable(list_text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience.<br /><br />Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.\",\n",
       " 'Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question \"why in Gods name would they create another one of these dumpster dives of a movie?\" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we\\'re from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn well.']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# observe the top 2 document\n",
    "list_text[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Punctuation removal, lemmatization and stopword removal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# remove '<br /><br />'\n",
    "corpus_initial = [document.split('<br /><br />') for document in list_text]\n",
    "# combine lines in each document\n",
    "corpus_combined = []\n",
    "for document in corpus_initial:\n",
    "    corpus_combined.append(' '.join(document))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"Working with one of the best Shakespeare sources, this film manages to be creditable to it's source, whilst still appealing to a wider audience. Branagh steals the film from under Fishburne's nose, and there's a talented cast on good form.\",\n",
       " 'Well...tremors I, the original started off in 1990 and i found the movie quite enjoyable to watch. however, they proceeded to make tremors II and III. Trust me, those movies started going downhill right after they finished the first one, i mean, ass blasters??? Now, only God himself is capable of answering the question \"why in Gods name would they create another one of these dumpster dives of a movie?\" Tremors IV cannot be considered a bad movie, in fact it cannot be even considered an epitome of a bad movie, for it lives up to more than that. As i attempted to sit though it, i noticed that my eyes started to bleed, and i hoped profusely that the little girl from the ring would crawl through the TV and kill me. did they really think that dressing the people who had stared in the other movies up as though they we\\'re from the wild west would make the movie (with the exact same occurrences) any better? honestly, i would never suggest buying this movie, i mean, there are cheaper ways to find things that burn well.']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_combined[0:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# remove punctuation\n",
    "corpus_punctuation_removed = []\n",
    "\n",
    "tokenizer = RegexpTokenizer(r'\\w+')\n",
    "for document in corpus_combined:\n",
    "    corpus_punctuation_removed.append(' '.join(tokenizer.tokenize(document)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Working with one of the best Shakespeare sources this film manages to be creditable to it s source whilst still appealing to a wider audience Branagh steals the film from under Fishburne s nose and there s a talented cast on good form',\n",
       " 'Well tremors I the original started off in 1990 and i found the movie quite enjoyable to watch however they proceeded to make tremors II and III Trust me those movies started going downhill right after they finished the first one i mean ass blasters Now only God himself is capable of answering the question why in Gods name would they create another one of these dumpster dives of a movie Tremors IV cannot be considered a bad movie in fact it cannot be even considered an epitome of a bad movie for it lives up to more than that As i attempted to sit though it i noticed that my eyes started to bleed and i hoped profusely that the little girl from the ring would crawl through the TV and kill me did they really think that dressing the people who had stared in the other movies up as though they we re from the wild west would make the movie with the exact same occurrences any better honestly i would never suggest buying this movie i mean there are cheaper ways to find things that burn well']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_punctuation_removed[0:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# word lemmatization for noun and verb\n",
    "#nltk.download('wordnet')\n",
    "corpus_lemmatized = []\n",
    "\n",
    "lemmatizer = WordNetLemmatizer()\n",
    "for document in corpus_punctuation_removed:\n",
    "    document_words = []\n",
    "    for word in document.lower().split(' '):\n",
    "        noun = lemmatizer.lemmatize(word, 'n')\n",
    "        verb = lemmatizer.lemmatize(word, 'v')\n",
    "        if word != noun:\n",
    "            document_words.append(noun)\n",
    "        elif word != verb:\n",
    "            document_words.append(verb)\n",
    "        else:\n",
    "            document_words.append(word)\n",
    "    corpus_lemmatized.append(' '.join(document_words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['work with one of the best shakespeare source this film manage to be creditable to it s source whilst still appeal to a wider audience branagh steal the film from under fishburne s nose and there s a talented cast on good form',\n",
       " 'well tremor i the original start off in 1990 and i find the movie quite enjoyable to watch however they proceed to make tremor ii and iii trust me those movie start go downhill right after they finish the first one i mean as blaster now only god himself be capable of answer the question why in god name would they create another one of these dumpster dive of a movie tremor iv cannot be consider a bad movie in fact it cannot be even consider an epitome of a bad movie for it life up to more than that a i attempt to sit though it i notice that my eye start to bleed and i hop profusely that the little girl from the ring would crawl through the tv and kill me do they really think that dress the people who have star in the other movie up a though they we re from the wild west would make the movie with the exact same occurrence any better honestly i would never suggest buy this movie i mean there be cheaper way to find thing that burn well']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_lemmatized[0:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# remove stopwords\n",
    "#nltk.download(\"stopwords\")\n",
    "corpus_stop_removed = []\n",
    "\n",
    "stop_words = set(corpus.stopwords.words())\n",
    "for document in corpus_lemmatized:\n",
    "    document_words = []\n",
    "    for word in document.split(' '):\n",
    "        if word not in stop_words:\n",
    "            document_words.append(word)\n",
    "    corpus_stop_removed.append(' '.join(document_words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['work one best shakespeare source film manage creditable source whilst still appeal wider audience branagh steal film fishburne nose talented cast good form',\n",
       " 'well tremor original start 1990 find movie quite enjoyable watch however proceed make tremor iii trust movie start go downhill right finish first one mean blaster god capable answer question god name would create another one dumpster dive movie tremor iv cannot consider bad movie fact cannot even consider epitome bad movie life attempt though notice eye start bleed hop profusely little girl ring would crawl tv kill really think dress people star movie though wild west would make movie exact occurrence better honestly would never suggest buy movie mean cheaper way find thing burn well']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus_cleaned = corpus_stop_removed.copy()\n",
    "corpus_cleaned[0:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Bag of Words (BoW)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<50000x82081 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 4592463 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fit the CountVectorizer and transform the text\n",
    "vect1 = CountVectorizer()\n",
    "vect1.fit_transform(corpus_cleaned)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('כרמון', 82080),\n",
       " ('יגאל', 82079),\n",
       " ('żmijewski', 82078),\n",
       " ('þór', 82077),\n",
       " ('þorleifsson', 82076),\n",
       " ('ýs', 82075),\n",
       " ('üzümcü', 82074),\n",
       " ('üvegtigris', 82073),\n",
       " ('ünfaithful', 82072)]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# word frequency (only show the first 10 words)\n",
    "sorted(vect1.vocabulary_.items(), reverse=True)[0:9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# BoW representation for the corpus - create a document-term matrix (only show the first 5 document)\n",
    "bag_of_words = vect1.fit_transform(corpus_cleaned)\n",
    "feature_names = vect1.get_feature_names()\n",
    "dt_matrix = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The matrix has rows * columns = (50000, 82081), below shows the first 5 rows:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>00</th>\n",
       "      <th>000</th>\n",
       "      <th>00000000000</th>\n",
       "      <th>0000000000001</th>\n",
       "      <th>00000001</th>\n",
       "      <th>00001</th>\n",
       "      <th>00015</th>\n",
       "      <th>000dm</th>\n",
       "      <th>000s</th>\n",
       "      <th>001</th>\n",
       "      <th>...</th>\n",
       "      <th>ünel</th>\n",
       "      <th>ünfaithful</th>\n",
       "      <th>üvegtigris</th>\n",
       "      <th>üzümcü</th>\n",
       "      <th>ýs</th>\n",
       "      <th>þorleifsson</th>\n",
       "      <th>þór</th>\n",
       "      <th>żmijewski</th>\n",
       "      <th>יגאל</th>\n",
       "      <th>כרמון</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 82081 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   00  000  00000000000  0000000000001  00000001  00001  00015  000dm  000s  \\\n",
       "0   0    0            0              0         0      0      0      0     0   \n",
       "1   0    0            0              0         0      0      0      0     0   \n",
       "2   0    0            0              0         0      0      0      0     0   \n",
       "3   0    0            0              0         0      0      0      0     0   \n",
       "4   0    0            0              0         0      0      0      0     0   \n",
       "\n",
       "   001  ...    ünel  ünfaithful  üvegtigris  üzümcü  ýs  þorleifsson  þór  \\\n",
       "0    0  ...       0           0           0       0   0            0    0   \n",
       "1    0  ...       0           0           0       0   0            0    0   \n",
       "2    0  ...       0           0           0       0   0            0    0   \n",
       "3    0  ...       0           0           0       0   0            0    0   \n",
       "4    0  ...       0           0           0       0   0            0    0   \n",
       "\n",
       "   żmijewski  יגאל  כרמון  \n",
       "0          0     0      0  \n",
       "1          0     0      0  \n",
       "2          0     0      0  \n",
       "3          0     0      0  \n",
       "4          0     0      0  \n",
       "\n",
       "[5 rows x 82081 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'The matrix has rows * columns = {dt_matrix.shape}, below shows the first 5 rows:')\n",
    "dt_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Term Frequency - Inverse Document Frequency (TF-IDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TF-IDF with unigram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# perform TF-IDF to the cleaned corpus using unigram\n",
    "vect2_unigram = TfidfVectorizer(ngram_range=(1, 1))\n",
    "tfidf_unigram = vect2_unigram.fit_transform(corpus_cleaned)\n",
    "feature_names_unigram = vect2_unigram.get_feature_names()\n",
    "tfidf_matrix_unigram = pd.DataFrame(tfidf_unigram.toarray(), columns=feature_names_unigram)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The matrix has rows * columns = (50000, 82081), below shows the first 5 rows:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>00</th>\n",
       "      <th>000</th>\n",
       "      <th>00000000000</th>\n",
       "      <th>0000000000001</th>\n",
       "      <th>00000001</th>\n",
       "      <th>00001</th>\n",
       "      <th>00015</th>\n",
       "      <th>000dm</th>\n",
       "      <th>000s</th>\n",
       "      <th>001</th>\n",
       "      <th>...</th>\n",
       "      <th>ünel</th>\n",
       "      <th>ünfaithful</th>\n",
       "      <th>üvegtigris</th>\n",
       "      <th>üzümcü</th>\n",
       "      <th>ýs</th>\n",
       "      <th>þorleifsson</th>\n",
       "      <th>þór</th>\n",
       "      <th>żmijewski</th>\n",
       "      <th>יגאל</th>\n",
       "      <th>כרמון</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 82081 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    00  000  00000000000  0000000000001  00000001  00001  00015  000dm  000s  \\\n",
       "0  0.0  0.0          0.0            0.0       0.0    0.0    0.0    0.0   0.0   \n",
       "1  0.0  0.0          0.0            0.0       0.0    0.0    0.0    0.0   0.0   \n",
       "2  0.0  0.0          0.0            0.0       0.0    0.0    0.0    0.0   0.0   \n",
       "3  0.0  0.0          0.0            0.0       0.0    0.0    0.0    0.0   0.0   \n",
       "4  0.0  0.0          0.0            0.0       0.0    0.0    0.0    0.0   0.0   \n",
       "\n",
       "   001  ...    ünel  ünfaithful  üvegtigris  üzümcü   ýs  þorleifsson  þór  \\\n",
       "0  0.0  ...     0.0         0.0         0.0     0.0  0.0          0.0  0.0   \n",
       "1  0.0  ...     0.0         0.0         0.0     0.0  0.0          0.0  0.0   \n",
       "2  0.0  ...     0.0         0.0         0.0     0.0  0.0          0.0  0.0   \n",
       "3  0.0  ...     0.0         0.0         0.0     0.0  0.0          0.0  0.0   \n",
       "4  0.0  ...     0.0         0.0         0.0     0.0  0.0          0.0  0.0   \n",
       "\n",
       "   żmijewski  יגאל  כרמון  \n",
       "0        0.0   0.0    0.0  \n",
       "1        0.0   0.0    0.0  \n",
       "2        0.0   0.0    0.0  \n",
       "3        0.0   0.0    0.0  \n",
       "4        0.0   0.0    0.0  \n",
       "\n",
       "[5 rows x 82081 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'The matrix has rows * columns = {tfidf_matrix_unigram.shape}, below shows the first 5 rows:')\n",
    "tfidf_matrix_unigram.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TF-IDF with bigram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# perform TF-IDF to the cleaned corpus using bigram\n",
    "vect2_bigram = TfidfVectorizer(ngram_range=(2, 2))\n",
    "tfidf_bigram = vect2_bigram.fit_transform(corpus_cleaned)\n",
    "feature_names_bigram = vect2_bigram.get_feature_names()\n",
    "tfidf_matrix_bigram = pd.DataFrame(tfidf_bigram.toarray(), columns=feature_names_bigram)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The matrix has rows * columns = (50000, 2609102), below shows the first 5 rows:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>00 01</th>\n",
       "      <th>00 10</th>\n",
       "      <th>00 19</th>\n",
       "      <th>00 80</th>\n",
       "      <th>00 84</th>\n",
       "      <th>00 90</th>\n",
       "      <th>00 acorn</th>\n",
       "      <th>00 agent</th>\n",
       "      <th>00 air</th>\n",
       "      <th>00 alison</th>\n",
       "      <th>...</th>\n",
       "      <th>ünel documentary</th>\n",
       "      <th>ünfaithful diane</th>\n",
       "      <th>üvegtigris far</th>\n",
       "      <th>üzümcü forensic</th>\n",
       "      <th>ýs one</th>\n",
       "      <th>þorleifsson get</th>\n",
       "      <th>þór director</th>\n",
       "      <th>żmijewski stenka</th>\n",
       "      <th>יגאל כרמון</th>\n",
       "      <th>כרמון president</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 2609102 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   00 01  00 10  00 19  00 80  00 84  00 90  00 acorn  00 agent  00 air  \\\n",
       "0    0.0    0.0    0.0    0.0    0.0    0.0       0.0       0.0     0.0   \n",
       "1    0.0    0.0    0.0    0.0    0.0    0.0       0.0       0.0     0.0   \n",
       "2    0.0    0.0    0.0    0.0    0.0    0.0       0.0       0.0     0.0   \n",
       "3    0.0    0.0    0.0    0.0    0.0    0.0       0.0       0.0     0.0   \n",
       "4    0.0    0.0    0.0    0.0    0.0    0.0       0.0       0.0     0.0   \n",
       "\n",
       "   00 alison       ...         ünel documentary  ünfaithful diane  \\\n",
       "0        0.0       ...                      0.0               0.0   \n",
       "1        0.0       ...                      0.0               0.0   \n",
       "2        0.0       ...                      0.0               0.0   \n",
       "3        0.0       ...                      0.0               0.0   \n",
       "4        0.0       ...                      0.0               0.0   \n",
       "\n",
       "   üvegtigris far  üzümcü forensic  ýs one  þorleifsson get  þór director  \\\n",
       "0             0.0              0.0     0.0              0.0           0.0   \n",
       "1             0.0              0.0     0.0              0.0           0.0   \n",
       "2             0.0              0.0     0.0              0.0           0.0   \n",
       "3             0.0              0.0     0.0              0.0           0.0   \n",
       "4             0.0              0.0     0.0              0.0           0.0   \n",
       "\n",
       "   żmijewski stenka  יגאל כרמון  כרמון president  \n",
       "0               0.0         0.0              0.0  \n",
       "1               0.0         0.0              0.0  \n",
       "2               0.0         0.0              0.0  \n",
       "3               0.0         0.0              0.0  \n",
       "4               0.0         0.0              0.0  \n",
       "\n",
       "[5 rows x 2609102 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'The matrix has rows * columns = {tfidf_matrix_bigram.shape}, below shows the first 5 rows:')\n",
    "tfidf_matrix_bigram.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Feature hashing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# perform feature hashing to the cleaned corpus and select features of 50 as an example\n",
    "vect3 = FeatureHasher(n_features=50, input_type='string')\n",
    "hasher = vect3.transform(corpus_cleaned)\n",
    "feature_names = ['feature' + num for num in list(np.arange(1,51,1).astype(str))]\n",
    "hasher_df = pd.DataFrame(abs(hasher).toarray(), columns=feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The matrix has rows * columns = (50000, 50), below shows the first 5 rows:\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature1</th>\n",
       "      <th>feature2</th>\n",
       "      <th>feature3</th>\n",
       "      <th>feature4</th>\n",
       "      <th>feature5</th>\n",
       "      <th>feature6</th>\n",
       "      <th>feature7</th>\n",
       "      <th>feature8</th>\n",
       "      <th>feature9</th>\n",
       "      <th>feature10</th>\n",
       "      <th>...</th>\n",
       "      <th>feature41</th>\n",
       "      <th>feature42</th>\n",
       "      <th>feature43</th>\n",
       "      <th>feature44</th>\n",
       "      <th>feature45</th>\n",
       "      <th>feature46</th>\n",
       "      <th>feature47</th>\n",
       "      <th>feature48</th>\n",
       "      <th>feature49</th>\n",
       "      <th>feature50</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>13.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>13.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>86.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>43.0</td>\n",
       "      <td>41.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>20.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>43.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>15.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>70.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>139.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>73.0</td>\n",
       "      <td>75.0</td>\n",
       "      <td>59.0</td>\n",
       "      <td>30.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>39.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>200.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>376.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>220.0</td>\n",
       "      <td>167.0</td>\n",
       "      <td>149.0</td>\n",
       "      <td>130.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>126.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 50 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   feature1  feature2  feature3  feature4  feature5  feature6  feature7  \\\n",
       "0      13.0       0.0       0.0       0.0       0.0       0.0       0.0   \n",
       "1      27.0       0.0       0.0       0.0       0.0       0.0       0.0   \n",
       "2      35.0       0.0       0.0       0.0       0.0       0.0       0.0   \n",
       "3      70.0       0.0       0.0       0.0       0.0       2.0       0.0   \n",
       "4     200.0       0.0       0.0       0.0       0.0       4.0       0.0   \n",
       "\n",
       "   feature8  feature9  feature10    ...      feature41  feature42  feature43  \\\n",
       "0       0.0       0.0       24.0    ...            0.0        0.0       13.0   \n",
       "1       1.0       0.0       86.0    ...            0.0        1.0       43.0   \n",
       "2       0.0       0.0       43.0    ...            0.0        2.0       26.0   \n",
       "3       3.0       0.0      139.0    ...            0.0        0.0       73.0   \n",
       "4       0.0       0.0      376.0    ...            0.0        4.0      220.0   \n",
       "\n",
       "   feature44  feature45  feature46  feature47  feature48  feature49  feature50  \n",
       "0        8.0        7.0        5.0        0.0        0.0        0.0       11.0  \n",
       "1       41.0       29.0       20.0        0.0        0.0        0.0       20.0  \n",
       "2       29.0       21.0       17.0        0.0        0.0        0.0       15.0  \n",
       "3       75.0       59.0       30.0        0.0        0.0        0.0       39.0  \n",
       "4      167.0      149.0      130.0        0.0        0.0        0.0      126.0  \n",
       "\n",
       "[5 rows x 50 columns]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(f'The matrix has rows * columns = {hasher_df.shape}, below shows the first 5 rows:')\n",
    "hasher_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Sentiment Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define labels, using 0 represents negative and 1 represents positive\n",
    "neg = 0\n",
    "pos = 1\n",
    "train_num_neg = len(train_neg)\n",
    "train_num_pos = len(train_pos)\n",
    "test_num_neg = len(test_neg)\n",
    "test_num_pos = len(test_pos)\n",
    "train_label = [neg]*train_num_neg + [pos]*train_num_pos\n",
    "test_label = [neg]*test_num_neg + [pos]*test_num_pos"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training and testing a Naive Bayes classifier using *BoW* text representation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define train and test dataset\n",
    "X_train = dt_matrix[:len(train_label)]\n",
    "y_train = train_label\n",
    "X_test = dt_matrix[len(train_label):]\n",
    "y_test = test_label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train a Multimoda Naive Bayes classifier\n",
    "clf_BoW = MultinomialNB()\n",
    "clf_BoW.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.81903999999999999"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predicting the Test set results, find accuracy\n",
    "y_pred = clf_BoW.predict(X_test)\n",
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10968,  1532],\n",
       "       [ 2992,  9508]])"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# making the Confusion Matrix\n",
    "conf_matrix = confusion_matrix(y_test, y_pred)\n",
    "conf_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision score: 0.861231884057971\n",
      "Recall score: 0.76064\n",
      "F1 score: 0.8078164825828379\n",
      "Accuracy of positive class: 0.76064\n",
      "Accuracy of negative class: 0.87744\n"
     ]
    }
   ],
   "source": [
    "# Precision score\n",
    "precision = precision_score(y_test, y_pred)\n",
    "print(f'Precision score: {precision}')\n",
    "\n",
    "# recall score\n",
    "recall = recall_score(y_test, y_pred)\n",
    "print(f'Recall score: {recall}')\n",
    "\n",
    "# F1 score\n",
    "f1 = f1_score(y_test, y_pred)\n",
    "print(f'F1 score: {f1}')\n",
    "\n",
    "# accuracy of two classes\n",
    "acc_p = conf_matrix[1,1]/(conf_matrix[1,0]+conf_matrix[1,1])\n",
    "acc_n = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])\n",
    "print(\"Accuracy of positive class:\",acc_p)\n",
    "print(\"Accuracy of negative class:\",acc_n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training and testing a Naive Bayes classifier using *UNIGRAM TF-IDF* text representation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define train and test dataset\n",
    "X_train = tfidf_matrix_unigram[:len(train_label)]\n",
    "y_train = train_label\n",
    "X_test = tfidf_matrix_unigram[len(train_label):]\n",
    "y_test = test_label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train a Multimoda Naive Bayes classifier\n",
    "clf_tfidf = MultinomialNB()\n",
    "clf_tfidf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.82667999999999997"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predicting the Test set results, find accuracy\n",
    "y_pred = clf_tfidf.predict(X_test)\n",
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10979,  1521],\n",
       "       [ 2812,  9688]])"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# making the Confusion Matrix\n",
    "conf_matrix = confusion_matrix(y_test, y_pred)\n",
    "conf_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision score: 0.8643054688196985\n",
      "Recall score: 0.77504\n",
      "F1 score: 0.8172423974018305\n",
      "Accuracy of positive class: 0.77504\n",
      "Accuracy of negative class: 0.87832\n"
     ]
    }
   ],
   "source": [
    "# Precision score\n",
    "precision = precision_score(y_test, y_pred)\n",
    "print(f'Precision score: {precision}')\n",
    "\n",
    "# recall score\n",
    "recall = recall_score(y_test, y_pred)\n",
    "print(f'Recall score: {recall}')\n",
    "\n",
    "# F1 score\n",
    "f1 = f1_score(y_test, y_pred)\n",
    "print(f'F1 score: {f1}')\n",
    "\n",
    "# accuracy of two classes\n",
    "acc_p = conf_matrix[1,1]/(conf_matrix[1,0]+conf_matrix[1,1])\n",
    "acc_n = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])\n",
    "print(\"Accuracy of positive class:\",acc_p)\n",
    "print(\"Accuracy of negative class:\",acc_n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training and testing a Naive Bayes classifier using *Feature Hashing* text representation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define train and test dataset\n",
    "X_train = hasher_df[:len(train_label)]\n",
    "y_train = train_label\n",
    "X_test = hasher_df[len(train_label):]\n",
    "y_test = test_label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train a Multimoda Naive Bayes classifier\n",
    "clf_fh = MultinomialNB()\n",
    "clf_fh.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.58896000000000004"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predicting the Test set results, find accuracy\n",
    "y_pred = clf_fh.predict(X_test)\n",
    "accuracy_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[7650, 4850],\n",
       "       [5426, 7074]])"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# making the Confusion Matrix\n",
    "conf_matrix = confusion_matrix(y_test, y_pred)\n",
    "conf_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision score: 0.5932572962093258\n",
      "Recall score: 0.56592\n",
      "F1 score: 0.5792662954471013\n",
      "Accuracy of positive class: 0.56592\n",
      "Accuracy of negative class: 0.612\n"
     ]
    }
   ],
   "source": [
    "# Precision score\n",
    "precision = precision_score(y_test, y_pred)\n",
    "print(f'Precision score: {precision}')\n",
    "\n",
    "# recall score\n",
    "recall = recall_score(y_test, y_pred)\n",
    "print(f'Recall score: {recall}')\n",
    "\n",
    "# F1 score\n",
    "f1 = f1_score(y_test, y_pred)\n",
    "print(f'F1 score: {f1}')\n",
    "\n",
    "# accuracy of two classes\n",
    "acc_p = conf_matrix[1,1]/(conf_matrix[1,0]+conf_matrix[1,1])\n",
    "acc_n = conf_matrix[0,0]/(conf_matrix[0,0]+conf_matrix[0,1])\n",
    "print(\"Accuracy of positive class:\",acc_p)\n",
    "print(\"Accuracy of negative class:\",acc_n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the Naive Bayes classifier, the unigram TF-IDF text representation beat feature hashing (50 features) and BoW. <br>**Let's test this winning classifier on some IMDB movie reviews.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6*. Sentiment analysis on IMDB movie reviews"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Naive Bayes classifier with unigram TF-IDF text representation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# movie reviews of Harry Potter and the Sorcerer's Stone (2001) in IMDB\n",
    "reviews = [\"\"\"This film was to put it simply rubbish. The child actors couldn't act, as can be seen by Harry's supposed surprise on learning he's a wizard. \"I'm a wizard!\" is said with such indifference you'd think he's not surprised at all. I've never read the books and this film did nothing to make me want to read them. The only spell this cast over me was one to get me out of the room as quickly as possible every time I have seen this on after my first viewing of it in the cinema. If you want to see a decent book made film watch Lord of the Rings or possibly the sequel to this which I thought was actually good. 1/10\"\"\",\n",
    "          \"\"\"When I knew the film was being made, I thought how could they make a film that would be up to the standard of such a perfect book. But they did! Sure they missed bits out but they captured the essence of the book brilliantly. One member of the cast was mis-cast for me but my children disagreed.I even found myself believing they were flying and not wondering \"how are they doing that?\" So 10 out 10 Warner Brothers. Bring on the next one!\"\"\",\n",
    "          \"\"\"This film is simply awful, the acting is a joke. Being a fan of the book I decided to buy the DVD without seeing at the cinema or renting it before, but I was extremely dissapointed. Daniel Radcliff's acting is terrible, in my honest opinion he was only hired because he looks like the illustrations of Harry Potter and he is definatly what I had pictured in my mind but his acting makes a mockery out of the film and turns it into a spoof of the book, instead of bringing it to life. I definatly won't be rushing to the shops to get the second film too soon.\"\"\",\n",
    "          \"\"\"It's a movie made for people who think that magic is cute and fun. If you're a fantasy fan like I am, and believe in magic (in a fictional sense) then you may very well detest this movie, as I did. Magic comes easily to little Harry Potter, who has no personality to speak of except that he is occasionally sad that he never met his mum and dad. Years of abuse and neglect have left not a mark on him, so when he gets invited to Hogwarts Academy of Magic, he has nowhere for his character to go. There he flies around on his broomstick and gets to be a big hero quite by accident without ever confronting any serious obstacles or anything resembling a plot. In all this, the film is an excellent adaptation of the book, which also thought that magic is cutesy wootesy.\"\"\",\n",
    "          \"\"\"2 1/2 hours of Boredom. Half the audience fell asleep, including most of the kiddies. Beautiful to look at, but that does not make for a interesting film. Rather spend your money on Lord of the Rings.\"\"\",\n",
    "          \"\"\"Ya~ When my first grade elementary school, my mom took me to watch the Harry Potter movie. Harry Potter series of movies really accompanied I grow up.\"\"\",\n",
    "          \"\"\"I really love this movie. There is something very magical about it, and gives me such nostalgic feelings seeing it as a child (and reading the books too) and I really think they succeeded with the movie adaption from the book.\"\"\",\n",
    "          \"\"\"That's all there is to say. It'll never be as good as the books, but it came damn close. They picked three great kids for the leads, and did a good job with the other roles as well. As far as how the story translated to the big screen, well, obviously, some parts recived more or less attention than in the book, and the same applies to characters. But overall, it was great!\"\"\",\n",
    "          \"\"\"Harry Potter and the Sorcerers Stone was an excellent film, in a series of Harry Potter films. Although not as good as Harry Potter and the Chamber of Secrets it was very high up in its ratings. The acting in this movie was wonderful. It had great special effects and dialog.I personally felt it was amazing that the whole movie was done by kids who have just started acting, but was still an amazing movie. Emma Watson was perfect in her acting as Hermione.Good Job.\"\"\",\n",
    "          \"\"\"This movie was so corny and unimaginative, a real Chris Colombus piece of hack work. The great cast saves the film from being utter dreck, but otherwise this is pretty dim, formulaic movie-making. Compare it to its announced competition, movies like The Wizard of Oz, Willie Wonka and the Chocolate Factory, The Never Ending Story, or even more recent fare like The Secret Garden or Photographing Fairies. C'mon, you've got to admit this flick is mediocre at best. Stop lapping up the hype and buying into every stupid trend they throw at you. Discriminate a little.\"\"\"]\n",
    "\n",
    "review_labels = [0, 1, 0 ,0, 0, 1, 1, 1, 1, 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define text data preprocessing function: Punctuation removal, lemmatization and stopword removal\n",
    "def preprocessing(text_data):\n",
    "    # remove punctuation\n",
    "    corpus_punctuation_removed = []\n",
    "    \n",
    "    tokenizer = RegexpTokenizer(r'\\w+')\n",
    "    for document in text_data:\n",
    "        corpus_punctuation_removed.append(' '.join(tokenizer.tokenize(document)))\n",
    "    \n",
    "    # word lemmatization for noun and verb\n",
    "    #nltk.download('wordnet')\n",
    "    corpus_lemmatized = []\n",
    "\n",
    "    lemmatizer = WordNetLemmatizer()\n",
    "    for document in corpus_punctuation_removed:\n",
    "        document_words = []\n",
    "        for word in document.lower().split(' '):\n",
    "            noun = lemmatizer.lemmatize(word, 'n')\n",
    "            verb = lemmatizer.lemmatize(word, 'v')\n",
    "            if word != noun:\n",
    "                document_words.append(noun)\n",
    "            elif word != verb:\n",
    "                document_words.append(verb)\n",
    "            else:\n",
    "                document_words.append(word)\n",
    "        corpus_lemmatized.append(' '.join(document_words))\n",
    "    \n",
    "    # remove stopwords\n",
    "    #nltk.download(\"stopwords\")\n",
    "    corpus_stop_removed = []\n",
    "\n",
    "    stop_words = set(corpus.stopwords.words())\n",
    "    for document in corpus_lemmatized:\n",
    "        document_words = []\n",
    "        for word in document.split(' '):\n",
    "            if word not in stop_words:\n",
    "                document_words.append(word)\n",
    "        corpus_stop_removed.append(' '.join(document_words))\n",
    "    \n",
    "    return(corpus_stop_removed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# clean the review data\n",
    "cleaned_reviews = preprocessing(reviews)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# transform to unigram TF-IDF\n",
    "review_tfidf_unigram = vect2_unigram.transform(cleaned_reviews)\n",
    "review_tfidf_matrix_unigram = review_tfidf_unigram.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.90000000000000002"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predicting the Test set results, find accuracy\n",
    "y_pred = clf_tfidf.predict(review_tfidf_matrix_unigram)\n",
    "accuracy_score(review_labels, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The trained classifier achieves 90% accuracy on real IMDB data!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### AWS Sentiment Analysis API"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Go to AWS IAM, add a user to a group with policy, “AdministratorAccess”\n",
    "2. Create a access key for this user\n",
    "3. Go to terminal and use below commands:\n",
    "    - pip3 install awscli --upgrade --user\n",
    "    - ~/Library/Python/3.6/bin/aws configure\n",
    "    - enter your Access key ID and Secret access key\n",
    "    - region: us-east-1\n",
    "    - file type: json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "comprehend = boto3.client(service_name='comprehend')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Calling DetectSentiment:\n",
      "\n",
      "NEGATIVE\n",
      "POSITIVE\n",
      "NEGATIVE\n",
      "POSITIVE\n",
      "MIXED\n",
      "NEUTRAL\n",
      "POSITIVE\n",
      "POSITIVE\n",
      "POSITIVE\n",
      "NEGATIVE\n",
      "\n",
      "End of DetectSentiment\n"
     ]
    }
   ],
   "source": [
    "# perform Sentiment Analysis using AWS API\n",
    "print('Calling DetectSentiment:\\n')\n",
    "for movie_rev in reviews:\n",
    "    print(comprehend.detect_sentiment(Text=movie_rev, LanguageCode='en')['Sentiment'])\n",
    "print('\\nEnd of DetectSentiment')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary of the solution and key highlights:\n",
    "* First, we use glob to import all text data and load in the sequence of train negative, train positive, test negative and test positive. Doing this can help us to devide the dataset to train and test, negative and positive in later session.\n",
    "* Clean the row log files by adopting English lemmatization, punctuation removal and English words removal. The NLP tools that we used in this project is from the python package nltk.\n",
    "* Punctuation removal: Remove punctuation in text and sentences such as apostrophe, brackets, colon, comma and etc.\n",
    "* English lemmatization: remove inflectional endings only and to return the base or dictionary form of a word.\n",
    "* Stopword removal: remove all stopword words in the text such as \"the\", \"for\", \"a\" and etc.\n",
    "* Performed Bag of Words (BoW) to collect the document-term frequency table.\n",
    "* Performed Term Frequency - Inverse Document Frequency (TF-IDF) and tried both unigram and bigram methods.\n",
    "* Performed Feature hashing to reduce the number of features (reducing to 50 features as an example).\n",
    "* Sentiment Analysis - Training and testing a Naive Bayes classifier using BoW text representation(acc. socre = 0.819).\n",
    "* Sentiment Analysis - Training and testing a Naive Bayes classifier using UNIGRAM TF-IDF text representation(acc. socre = 0.827).\n",
    "* Sentiment Analysis - Training and testing a Naive Bayes classifier using Feature Hashing text representation(acc. socre = 0.589).\n",
    "* End up with choosing the Naive Bayes classifier using UNIGRAM TF-IDF text representation. And tested this winning classifier on some IMDB movie reviews.\n",
    "* The trained classifier achieves 90% accuracy on real IMDB data!\n",
    "* Explored Sentiment Analysis using AWS API."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key learnings:\n",
    "* Whenever we are dealing with Machine learning algorithm the most important part we have to keep in mind is to pre-process data before feeding to machine learning. The accuracy of predictions depends directly on the training data set which further depends on the data preprocessing and data wrangling.\n",
    "* Punctuation removal, Lemmatization and Stopword removal are some key steps to preprocess the text data. It can reduce the noice in a very large extent.\n",
    "* TF-IDF is a method to select the words that are frequently appear in certain document but not frequently appear in the overall corpus. This method is powerful when problem meet this condition but sometimes the problem may have differnt characteristics.\n",
    "* Feature hashing is a simple but powerful method to reduce data dimension but how to choose the number of features is an important but tricky problem.\n",
    "* N-gram generally performes better than unigram, however, n-gram will cause a significant increase of the computational resource.\n",
    "* APIs are very powerful in many cases, we should take utilization of it!\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References <a class=\"anchor\" id=\"4\"></a>\n",
    "\n",
    "http://docs.tweepy.org/en/v3.5.0/streaming_how_to.html\n",
    "\n",
    "https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object\n",
    "\n",
    "https://www.dataquest.io/blog/streaming-data-python/\n",
    "\n",
    "https://github.com/dataquestio/twitter-scrape/blob/master/scraper.py\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "165px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}