{ "metadata": { "name": "", "signature": "sha256:fdc85388b4ed5740646d4ef62c0789d598261117e8baf30f601b74e3dd27e569" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%%html\n", "" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "" ], "metadata": {}, "output_type": "display_data", "text": [ "" ] } ], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "REST, JSON, and text" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Class Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Understand what REST and JSON are, usage in python\n", "* Review string parsing functionality and dummies in pandas (with merge, join, and concat)\n", "* Reviewing some of the useful functionality in `nltk`\n", "* Reviewing concepts of text handling (counting, stopwords, corpi and dictionaries)" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Data Goals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Given twitter data, we can build some understanding of what news to expect given a hashtag?\n", "2. Is there any difference in sentiment across hashtags about the same topic?\n", "3. What creative ways can we engineer features in order to predict a hashtag?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## To Start:\n", "\n", "We'll talk about the difficulty of text parsing by looking at this website first:\n", "\n", "http://translationparty.com/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## REST and JSON\n", "\n", "REST stands for Representational State Transfer. It's a simplification of architecture for networked applications. The basics found in Rails and Django apps insist on a RESTful framework for its ease of use compared to other architectures, such as CORBA, or [SOAP](http://www.w3schools.com/webservices/ws_soap_example.asp).\n", "\n", "RESTful applications do four primary things around \"resources:\"\n", "\n", "* `GET`: retrieve the collection of data\n", "* `PUT`: Replace or update the collection of data\n", "* `POST`: Create a new collection of data\n", "* `DELETE`: Delete the collection of data\n", "\n", "It's also the premise for [AJAX](http://api.jquery.com/jquery.ajax/) requests, and primarily how web services interact between your client (your computer) and the web server (where the web application resides). These AJAX calls primarily use JSON to send data between the server and client.\n", "\n", "JSON stands for JavaScript Object Notation. For Python, it essentially looks like a stringified version of a dictionary:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "valid_json = '{ \"some_kinda_key\": \"some_kinda_value\" }'\n", "\n", "also_valid_json = '[{ \"some_kinda_key\": \"some_kinda_value\" }, { \"some_other_key\": \"some_other_value\" }]'\n", "\n", "actually_a_dictionary = {'some_kinda_key': 'some_kinda_value'}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since JSON uses a hash/dictionary-like format, it is traditionally unstructured, compared to CSV files, though when interacting with APIs (Application Programming Interface), if you're `GET`ing a series of the same collection, generally, the data will look the same. A typical JSON request would look something like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```javascript\n", "{\n", " \"response\": 200,\n", " \"params\": {\n", " \"q\": \"dinosaurs\",\n", " \"count\": 50\n", " },\n", " \"results\": [{ ... }]\n", "} \n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which likely was sent from a url request that looks like this:\n", "\n", "```\n", "https://api.awebsite.com/v1/search/?q=dinosuars&count=50\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To load JSON into python, we use the `json` module. To find JSON, we'll dig it out of a network transfer from a website (petfinder will do, in this case)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "\n", "pets = json.load(open('../data/petfinder.json'))\n", "\n", "print type(pets)\n", "print pets.keys()\n", "print type(pets['results'])\n", "print len(pets['results'])\n", "print json.dumps(pets['meta_data'])\n", "print type(json.dumps(pets))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "[u'meta_data', u'results', u'links']\n", "\n", "15\n", "{\"total_results\": 346182, \"query\": {\"status\": \"Adoptable\", \"lon\": \"-73.9892\", \"page_size\": \"15\", \"page_number\": \"0\", \"location\": \"10003\", \"lat\": \"40.7316\", \"uri_prefix\": \"http://www.petfinder\"}, \"rows\": 15, \"page_number\": 0, \"query_id\": \"2A0BD832-546C-11E4-9E25-F1C9FFEE616C\"}\n", "\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above JSON, it looks like petfinder by default returns a list of 15 animals.\n", "\n", "Petfinder also has an API, so you can get an API key and build requests using the `requests` package in python. You should get similar results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with Text: Pandas\n", "\n", "The `results` field in the json includes a list of data. Pandas should handle the transformation of this data into a DataFrame fine using the pd.DataFrame() function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "petdf = pd.DataFrame(pets['results'])\n", "petdf" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageanimalcontact_namecoordscountry_codedate_updateddescriptionemail_addressexportexport_api...primary_breedrecord_nav_linksregion_codesecondary_breedshelter_idshelter_namesizespeciesstatusstreet_address
0 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-10-08T18:34:07Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Boxer {u'next': u'http://www.petfinder/2A0BD832-546C... NY Staffordshire Bull Terrier NY835 Social Tees Animal Rescue Foundation Medium Dog Adoptable please email dimitra.socialtees@gmail.com
1 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-08-20T20:56:44Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Manchester Terrier {u'previous': u'http://www.petfinder/2A0BD832-... NY NaN NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
2 Adult Dog Samantha Brody 40.7316,-73.9892 US 2014-03-07T20:09:56Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Chihuahua {u'next': u'http://www.petfinder/2A0BD832-546C... NY NaN NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
3 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-07-15T01:11:37Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Boston Terrier {u'previous': u'http://www.petfinder/2A0BD832-... NY Chihuahua NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
4 Adult Dog Samantha Brody 40.7316,-73.9892 US 2014-10-13T13:14:10Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Shih Tzu {u'previous': u'http://www.petfinder/2A0BD832-... NY Jack Russell Terrier NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
5 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-07-28T15:09:15Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Chihuahua {u'previous': u'http://www.petfinder/2A0BD832-... NY Terrier NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
6 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-04-27T19:30:38Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Shih Tzu {u'next': u'http://www.petfinder/2A0BD832-546C... NY NaN NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
7 Young Cat Samantha Brody 40.7316,-73.9892 US 2014-06-06T13:59:06Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Tortoiseshell {u'next': u'http://www.petfinder/2A0BD832-546C... NY NaN NY835 Social Tees Animal Rescue Foundation Medium Cat Adoptable please email dimitra.socialtees@gmail.com
8 Adult Dog Samantha Brody 40.7316,-73.9892 US 2014-07-21T23:29:08Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Chihuahua {u'previous': u'http://www.petfinder/2A0BD832-... NY Terrier NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
9 Young Dog Dimitra Bennett 40.7316,-73.9892 US 2014-09-20T23:27:52Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... dimitra@socialteesnyc.org True True... Labrador Retriever {u'next': u'http://www.petfinder/2A0BD832-546C... NY Husky NY835 Social Tees Animal Rescue Foundation Medium Dog Adoptable please email dimitra.socialtees@gmail.com
10 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-09-09T16:09:07Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Black and Tan Coonhound {u'next': u'http://www.petfinder/2A0BD832-546C... NY Hound NY835 Social Tees Animal Rescue Foundation Medium Dog Adoptable please email dimitra.socialtees@gmail.com
11 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-09-15T15:27:49Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Catahoula Leopard Dog {u'next': u'http://www.petfinder/2A0BD832-546C... NY Pit Bull Terrier NY835 Social Tees Animal Rescue Foundation Medium Dog Adoptable please email dimitra.socialtees@gmail.com
12 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-09-16T15:32:26Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Shih Tzu {u'previous': u'http://www.petfinder/2A0BD832-... NY NaN NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
13 Baby Dog Dimitra Bennett 40.7316,-73.9892 US 2014-10-12T04:56:18Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... dimitra@socialteesnyc.org True True... Weimaraner {u'next': u'http://www.petfinder/2A0BD832-546C... NY NaN NY835 Social Tees Animal Rescue Foundation Medium Dog Adoptable please email dimitra.socialtees@gmail.com
14 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-09-08T18:09:12Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Maltese {u'next': u'http://www.petfinder/2A0BD832-546C... NY Poodle NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com
\n", "

15 rows \u00d7 33 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ " age animal contact_name coords country_code \\\n", "0 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "1 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "2 Adult Dog Samantha Brody 40.7316,-73.9892 US \n", "3 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "4 Adult Dog Samantha Brody 40.7316,-73.9892 US \n", "5 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "6 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "7 Young Cat Samantha Brody 40.7316,-73.9892 US \n", "8 Adult Dog Samantha Brody 40.7316,-73.9892 US \n", "9 Young Dog Dimitra Bennett 40.7316,-73.9892 US \n", "10 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "11 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "12 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "13 Baby Dog Dimitra Bennett 40.7316,-73.9892 US \n", "14 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "\n", " date_updated description \\\n", "0 2014-10-08T18:34:07Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "1 2014-08-20T20:56:44Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "2 2014-03-07T20:09:56Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "3 2014-07-15T01:11:37Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "4 2014-10-13T13:14:10Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "5 2014-07-28T15:09:15Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "6 2014-04-27T19:30:38Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "7 2014-06-06T13:59:06Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "8 2014-07-21T23:29:08Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "9 2014-09-20T23:27:52Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "10 2014-09-09T16:09:07Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "11 2014-09-15T15:27:49Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "12 2014-09-16T15:32:26Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "13 2014-10-12T04:56:18Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "14 2014-09-08T18:09:12Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "\n", " email_address export export_api \\\n", "0 samantha@socialteesnyc.org True True \n", "1 samantha@socialteesnyc.org True True \n", "2 samantha@socialteesnyc.org True True \n", "3 samantha@socialteesnyc.org True True \n", "4 samantha@socialteesnyc.org True True \n", "5 samantha@socialteesnyc.org True True \n", "6 samantha@socialteesnyc.org True True \n", "7 samantha@socialteesnyc.org True True \n", "8 samantha@socialteesnyc.org True True \n", "9 dimitra@socialteesnyc.org True True \n", "10 samantha@socialteesnyc.org True True \n", "11 samantha@socialteesnyc.org True True \n", "12 samantha@socialteesnyc.org True True \n", "13 dimitra@socialteesnyc.org True True \n", "14 samantha@socialteesnyc.org True True \n", "\n", " ... primary_breed \\\n", "0 ... Boxer \n", "1 ... Manchester Terrier \n", "2 ... Chihuahua \n", "3 ... Boston Terrier \n", "4 ... Shih Tzu \n", "5 ... Chihuahua \n", "6 ... Shih Tzu \n", "7 ... Tortoiseshell \n", "8 ... Chihuahua \n", "9 ... Labrador Retriever \n", "10 ... Black and Tan Coonhound \n", "11 ... Catahoula Leopard Dog \n", "12 ... Shih Tzu \n", "13 ... Weimaraner \n", "14 ... Maltese \n", "\n", " record_nav_links region_code \\\n", "0 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "1 {u'previous': u'http://www.petfinder/2A0BD832-... NY \n", "2 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "3 {u'previous': u'http://www.petfinder/2A0BD832-... NY \n", "4 {u'previous': u'http://www.petfinder/2A0BD832-... NY \n", "5 {u'previous': u'http://www.petfinder/2A0BD832-... NY \n", "6 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "7 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "8 {u'previous': u'http://www.petfinder/2A0BD832-... NY \n", "9 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "10 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "11 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "12 {u'previous': u'http://www.petfinder/2A0BD832-... NY \n", "13 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "14 {u'next': u'http://www.petfinder/2A0BD832-546C... NY \n", "\n", " secondary_breed shelter_id \\\n", "0 Staffordshire Bull Terrier NY835 \n", "1 NaN NY835 \n", "2 NaN NY835 \n", "3 Chihuahua NY835 \n", "4 Jack Russell Terrier NY835 \n", "5 Terrier NY835 \n", "6 NaN NY835 \n", "7 NaN NY835 \n", "8 Terrier NY835 \n", "9 Husky NY835 \n", "10 Hound NY835 \n", "11 Pit Bull Terrier NY835 \n", "12 NaN NY835 \n", "13 NaN NY835 \n", "14 Poodle NY835 \n", "\n", " shelter_name size species status \\\n", "0 Social Tees Animal Rescue Foundation Medium Dog Adoptable \n", "1 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "2 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "3 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "4 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "5 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "6 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "7 Social Tees Animal Rescue Foundation Medium Cat Adoptable \n", "8 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "9 Social Tees Animal Rescue Foundation Medium Dog Adoptable \n", "10 Social Tees Animal Rescue Foundation Medium Dog Adoptable \n", "11 Social Tees Animal Rescue Foundation Medium Dog Adoptable \n", "12 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "13 Social Tees Animal Rescue Foundation Medium Dog Adoptable \n", "14 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "\n", " street_address \n", "0 please email dimitra.socialtees@gmail.com \n", "1 please email dimitra.socialtees@gmail.com \n", "2 please email dimitra.socialtees@gmail.com \n", "3 please email dimitra.socialtees@gmail.com \n", "4 please email dimitra.socialtees@gmail.com \n", "5 please email dimitra.socialtees@gmail.com \n", "6 please email dimitra.socialtees@gmail.com \n", "7 please email dimitra.socialtees@gmail.com \n", "8 please email dimitra.socialtees@gmail.com \n", "9 please email dimitra.socialtees@gmail.com \n", "10 please email dimitra.socialtees@gmail.com \n", "11 please email dimitra.socialtees@gmail.com \n", "12 please email dimitra.socialtees@gmail.com \n", "13 please email dimitra.socialtees@gmail.com \n", "14 please email dimitra.socialtees@gmail.com \n", "\n", "[15 rows x 33 columns]" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pulling data from text\n", "\n", "One technique for working with strings in pandas: pulling out strings. Imagine we wanted to create a feature called \"is_terrier.\" Our steps would be to:\n", "\n", "1. Use the two breed fields\n", "2. Check if either field if the \"terrier\" text is included\n", "3. Set a 1 if either field has \"terrier\" or a 0 if it does not.\n", "\n", "Sample Code:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "petdf['is_terrier'] = 0 # let's default to 0, so we only need to update as we recognize values\n", "petdf.ix[petdf.primary_breed.str.contains('Terrier'), 'is_terrier'] = 1\n", "petdf.ix[(petdf.secondary_breed.notnull()) & (petdf.secondary_breed.str.contains('Terrier')), 'is_terrier'] = 1\n", "\n", "print petdf.groupby('is_terrier').age.count()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "is_terrier\n", "0 8\n", "1 7\n", "Name: age, dtype: int64\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Learn more about additional string methods [here](http://pandas.pydata.org/pandas-docs/dev/api.html#string-handling).\n", "\n", "### Creating dummy variables\n", "\n", "Another instance we'll commonly need to handle is converting categorical data into numerical data. While some domain languages (such as R) have built in ways to numerate categorical data, we'll need to depend on keeping everything numerical in python.\n", "\n", "After creating a dummy matrix, we'll want to use a join to bridge the data back together. Note the variety of ways to join data together in pandas:\n", "\n", "function | description\n", "---------|------------\n", "append | functionality that works much like list.append(). imagine taking one data frame and appending it to another at the bottom (by rows)\n", "concat | this functionality would take a list of dataframes, where the first is the \"primary\", and all others are `append`ed, like a queue. \n", "merge | link two data frames together, by merging columns together when the two columns defined to merge on are equal. \n", "join | runs very similarly to merge, but specifically uses the indexes. very common to use alongside `get_dummies`.\n", "\n", "Below, we'll create a new data frame instance that \"dummies\" the size of the dog, and then attaches the sizes as new columns." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "\n", "# simple proof of concept on how concat works; otherwise, ignore\n", "newdf = pd.concat([petdf[:7], petdf[7:]])\n", "print newdf == petdf\n", "# NaN is always false on equality (python gotcha!)\n", "print\n", "print 'numpy nan truth check!'\n", "print np.nan == np.nan\n", "print\n", "sizes = pd.get_dummies(petdf['size'])\n", "print sizes.head()\n", "\n", "petdf_wsizes = petdf.join(sizes)\n", "petdf_wsizes.head()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " age animal contact_name coords country_code date_updated description \\\n", "0 True True True True True True True \n", "1 True True True True True True True \n", "2 True True True True True True True \n", "3 True True True True True True True \n", "4 True True True True True True True \n", "5 True True True True True True True \n", "6 True True True True True True True \n", "7 True True True True True True True \n", "8 True True True True True True True \n", "9 True True True True True True True \n", "10 True True True True True True True \n", "11 True True True True True True True \n", "12 True True True True True True True \n", "13 True True True True True True True \n", "14 True True True True True True True \n", "\n", " email_address export export_api ... record_nav_links region_code \\\n", "0 True True True ... True True \n", "1 True True True ... True True \n", "2 True True True ... True True \n", "3 True True True ... True True \n", "4 True True True ... True True \n", "5 True True True ... True True \n", "6 True True True ... True True \n", "7 True True True ... True True \n", "8 True True True ... True True \n", "9 True True True ... True True \n", "10 True True True ... True True \n", "11 True True True ... True True \n", "12 True True True ... True True \n", "13 True True True ... True True \n", "14 True True True ... True True \n", "\n", " secondary_breed shelter_id shelter_name size species status \\\n", "0 True True True True True True \n", "1 False True True True True True \n", "2 False True True True True True \n", "3 True True True True True True \n", "4 True True True True True True \n", "5 True True True True True True \n", "6 False True True True True True \n", "7 False True True True True True \n", "8 True True True True True True \n", "9 True True True True True True \n", "10 True True True True True True \n", "11 True True True True True True \n", "12 False True True True True True \n", "13 False True True True True True \n", "14 True True True True True True \n", "\n", " street_address is_terrier \n", "0 True True \n", "1 True True \n", "2 True True \n", "3 True True \n", "4 True True \n", "5 True True \n", "6 True True \n", "7 True True \n", "8 True True \n", "9 True True \n", "10 True True \n", "11 True True \n", "12 True True \n", "13 True True \n", "14 True True \n", "\n", "[15 rows x 34 columns]\n", "\n", "numpy nan truth check!\n", "False\n", "\n", " Medium Small\n", "0 1 0\n", "1 0 1\n", "2 0 1\n", "3 0 1\n", "4 0 1\n" ] }, { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageanimalcontact_namecoordscountry_codedate_updateddescriptionemail_addressexportexport_api...secondary_breedshelter_idshelter_namesizespeciesstatusstreet_addressis_terrierMediumSmall
0 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-10-08T18:34:07Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Staffordshire Bull Terrier NY835 Social Tees Animal Rescue Foundation Medium Dog Adoptable please email dimitra.socialtees@gmail.com 1 1 0
1 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-08-20T20:56:44Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... NaN NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com 1 0 1
2 Adult Dog Samantha Brody 40.7316,-73.9892 US 2014-03-07T20:09:56Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... NaN NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com 0 0 1
3 Young Dog Samantha Brody 40.7316,-73.9892 US 2014-07-15T01:11:37Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Chihuahua NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com 1 0 1
4 Adult Dog Samantha Brody 40.7316,-73.9892 US 2014-10-13T13:14:10Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... samantha@socialteesnyc.org True True... Jack Russell Terrier NY835 Social Tees Animal Rescue Foundation Small Dog Adoptable please email dimitra.socialtees@gmail.com 1 0 1
\n", "

5 rows \u00d7 36 columns

\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ " age animal contact_name coords country_code \\\n", "0 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "1 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "2 Adult Dog Samantha Brody 40.7316,-73.9892 US \n", "3 Young Dog Samantha Brody 40.7316,-73.9892 US \n", "4 Adult Dog Samantha Brody 40.7316,-73.9892 US \n", "\n", " date_updated description \\\n", "0 2014-10-08T18:34:07Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "1 2014-08-20T20:56:44Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "2 2014-03-07T20:09:56Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "3 2014-07-15T01:11:37Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "4 2014-10-13T13:14:10Z Please VISIT OUR FACEBOOK PAGE AND WEBSITE for... \n", "\n", " email_address export export_api ... \\\n", "0 samantha@socialteesnyc.org True True ... \n", "1 samantha@socialteesnyc.org True True ... \n", "2 samantha@socialteesnyc.org True True ... \n", "3 samantha@socialteesnyc.org True True ... \n", "4 samantha@socialteesnyc.org True True ... \n", "\n", " secondary_breed shelter_id \\\n", "0 Staffordshire Bull Terrier NY835 \n", "1 NaN NY835 \n", "2 NaN NY835 \n", "3 Chihuahua NY835 \n", "4 Jack Russell Terrier NY835 \n", "\n", " shelter_name size species status \\\n", "0 Social Tees Animal Rescue Foundation Medium Dog Adoptable \n", "1 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "2 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "3 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "4 Social Tees Animal Rescue Foundation Small Dog Adoptable \n", "\n", " street_address is_terrier Medium Small \n", "0 please email dimitra.socialtees@gmail.com 1 1 0 \n", "1 please email dimitra.socialtees@gmail.com 1 0 1 \n", "2 please email dimitra.socialtees@gmail.com 0 0 1 \n", "3 please email dimitra.socialtees@gmail.com 1 0 1 \n", "4 please email dimitra.socialtees@gmail.com 1 0 1 \n", "\n", "[5 rows x 36 columns]" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Quick Note: modules\n", "\n", "The next section requires us to use a config module to store our twitter api configuration.
\n", "Modules, for the most part, are just python files. You'll see an empty file called `twitterconfig.py` in the class folder. If you're working directly on this notebook, we'll edit the file. Otherwise, make sure you have a `twitterconfig.py` file where you started the notebook! the file should look something like this:\n", "\n", "```python\n", "TOKEN = ''\n", "TOKEN_SECRET = ''\n", "CONSUMER_KEY = ''\n", "CONSUMER_SECRET = ''\n", "```\n", "\n", "using an import statement (`import twitterconfig`) brings the file and any names defined into it's space called \"twitterconfig\". It uses dot notation to return values. So with the above, you'd have access to:\n", "\n", "```python\n", "twitterconfig.TOKEN\n", "twitterconfig.TOKEN_SECRET\n", "twitterconfig.CONSUMER_KEY\n", "twitterconfig.CONSUMER_SECRET\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Libraries that do the work for us\n", "\n", "Building url requests, while not incredibly complicated (usually), are still a task. You could write a python library to do the work for you [automagically](https://github.com/gtaylor/petfinder-api).\n", "\n", "Since we will be working with text data, a good location for live text data is from [twitter](https://github.com/bear/python-twitter). We can disect the code below:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# coding=utf-8\n", "from twitter import *\n", "import twitterconfig as config\n", "\n", "\"\"\"\n", "there would be code here to set up a dictionary including everything required to load the authentication for twitter.\n", "What are the advantages of setting up a dictionary here instead of just defining it below?\n", "\"\"\"\n", "t = Twitter(\n", " auth=OAuth(\n", " token=config.TOKEN,\n", " token_secret=config.TOKEN_SECRET,\n", " consumer_key=config.CONSUMER_KEY,\n", " consumer_secret=config.CONSUMER_SECRET)\n", " )\n", "\n", "hashtags=[\n", " u'hongkong',\n", " u'occupycentral',\n", " u'umbrellarevolution',\n", " u'china',\n", " u'hk',\n", " u'admiralty',\n", " u'occupyhk',\n", "]\n", "\n", "all_results = []\n", "for hashtag in hashtags:\n", " results = t.search.tweets(q='#'+hashtag, count=100, result_type='mixed')\n", " for r in results['statuses']:\n", " try:\n", " clean_tweet = unicode(r['text']).encode('utf-8').replace('\\n', ' ').replace('\\r', ' ').replace('\"', \"'\")\n", " print u','.join([unicode(r['id']),'\"'+clean_tweet+'\"', '\"'+hashtag+'\"'])\n", " except UnicodeDecodeError:\n", " pass\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "572306942825140224,\"The very famous #HongKong celebrity I just interviewed, a cat called Cream Brother. Stay tuned for the story! http://t.co/Yd2K1tUNtF\",\"hongkong\"\n", "572380843739463681,\"The irresistibly romantic Victoria Harbor in #HongKong: http://t.co/XxnC77E7h3 http://t.co/pfEEwNg9Fo\",\"hongkong\"\n", "572060408552423425,\"Arrests made in #HongKong as #Protesters clash with Police http://t.co/LXEQNh7cR5 #1u http://t.co/OTAALGob5c\",\"hongkong\"\n", "572501639157833728,\"Manufacturing Jobs in Hong Kong - http://t.co/Ag2UryCD0Q #jobs #HongKong #vacancy #jobsinHongKong #career #oppertunity #jobsearch\",\"hongkong\"\n", "572501411822350337,\"Ancient #Chinese vase set to fetch $7.7 mn in #HongKong http://t.co/5apmOU7F2B #ZippedNews http://t.co/AOsXPZKokw\",\"hongkong\"\n", "572501395288555520,\"Better #property deals on the horizon in #HongKong http://t.co/GuYOGNiODV #realestate http://t.co/xnYNhcSRHw\",\"hongkong\"\n", "572501210940362752,\"iTunes Hong Kong:Mac App:top free mac apps:Blackmagic Disk Speed Test - Blackmagic Design ... http://t.co/Hg7891Bdj6 #followme #hongkong\",\"hongkong\"\n", "572500398902263808,\"yayyyy just brought my first 4 steps for my impending @ElCaminoB bracelet!!!!! #hongkong #israel #macedonia #slovenia\",\"hongkong\"\n", "572499956583440385,\"#EXO #hongkong #FROMLOSTPLANET #20140601 #luhan http://t.co/S1Tl9ZXhWH\",\"hongkong\"\n", "572499853088960512,\"RT @pinmoralesjr: I thought this was a perfect spot. #disney #orlando #hongkong #holidays #lovetravel https://t.co/7FbiTDcErO\",\"hongkong\"\n", "572499804187717632,\"I thought this was a perfect spot. #disney #orlando #hongkong #holidays #lovetravel https://t.co/7FbiTDcErO\",\"hongkong\"\n", "572499520556167168,\"Hong Kong woman jailed for six years for abusing Indonesian maid - Reuters #hongkong http://t.co/pZrOpHN52W\",\"hongkong\"\n", "572497552878120960,\"Just Back from #Klia2 with MrAzryn / Sending #Shahdihah >> #Holidays > #HongKong http://t.co/w4ouIWLyDt\",\"hongkong\"\n", "572497487078019072,\"Exciting journey ahead #HongKong #jewellery #display#trends#new#designs\",\"hongkong\"\n", "572497440185577472,\"Genetics & Rheumatoid Arthritis http://t.co/1efRLBGJ0C #HongKong #News\",\"hongkong\"\n", "572497436104507392,\"Retinal Tears http://t.co/sqSGPqjrOs #HongKong #News\",\"hongkong\"\n", "572497430517706752,\"Home Fitness http://t.co/tcfxHlB1Ey #HongKong #News\",\"hongkong\"\n", "572497425769742336,\"Report: England footballer in child sex case http://t.co/ly99eMJLhg #HongKong #News\",\"hongkong\"\n", "572497422024232960,\"Boris Nemtsov: Defiant in the face of death http://t.co/zXj4sNlkbY #HongKong #News\",\"hongkong\"\n", "572497417326632960,\"'Walking Dead' town for sale http://t.co/Fn72cWFmya #HongKong #News\",\"hongkong\"\n", "572497328415870976,\"#China #HongKong #stocks: #Markets: Sunny Optical Technology : Mar 2015 growth predicted as downcast. http://t.co/ZBpQMDdftw\",\"hongkong\"\n", "572497120210653185,\"Sup, no referer #proxy from #HongKong: https://t.co/dhBw2PxREG\",\"hongkong\"\n", "572496075321298944,\"#ProjectMgmt #Job in #HongKong: Senior Business Specialist at Manulife http://t.co/tEUwFk1nTM #Jobs #TweetMyJobs\",\"hongkong\"\n", "572495152759095297,\"RT @ajam: Hong Kong arrests 38 people in protests over mainland Ch... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @PhotoRaptor\",\"hongkong\"\n", "572494777930911744,\"RT @ajam: #HongKong arrests 38 people in protests over mainland Chinese shoppers http://t.co/HD71Om4KlW http://t.co/9b2JtKlLo2 #China\",\"hongkong\"\n", "572494034591211520,\"RT @sophiedee: Light show not as good as the last few years. But the view still looks amazing ! #hongkong https://t.co/81mNOIAEJD\",\"hongkong\"\n", "572493555505053696,\"#exo #HongKong #LOSTPLANET #kai #jongin http://t.co/ZUtGlKdlfr\",\"hongkong\"\n", "572493469077413888,\"So looking forward to visiting #HongKong next week. Experiencing #culture #fashion #food Visitors tips welcome\",\"hongkong\"\n", "572493452182597632,\"Menginap Di Disney Hollywood Hotel; #Hongkong Disneyland Trip - #BacaYuk #Jalan2Yuk http://t.co/lm20Opzb0O #LuarNegeri\",\"hongkong\"\n", "572492802883514368,\"#China #HongKong #stocks: #Markets: Tomorrow International : Mar 2015 growth predicted as concrete. http://t.co/5QFP9BvOSa\",\"hongkong\"\n", "572492095526596608,\"' anders sein ' http://t.co/0977bmpZrl #tokyo #berlin #HongKong #usa #startup #losangeles #London hj #newyork http://t.co/6hOJitO0Kr .\",\"hongkong\"\n", "572491640495054848,\"RT @hongkong_agent: Opinion: Hong Kong's rich-government, poor-growth economy - MarketWatch http://t.co/od2zXyXFHa #hongkong\",\"hongkong\"\n", "572491622283382784,\"#HongKong Rolls Out Measures to Cool Booming #property Market http://t.co/c0k3TyNMOv #realestate\",\"hongkong\"\n", "572491549390581760,\"Opinion: Hong Kong's rich-government, poor-growth economy - MarketWatch http://t.co/od2zXyXFHa #hongkong\",\"hongkong\"\n", "572491548010672129,\"At least three arrested at Hong Kong anti-China protest - euronews http://t.co/jO0v6Lk78W #hongkong\",\"hongkong\"\n", "572491545942855680,\"2015 Forbes Billionaires: Ten New Faces Help Lift Hong Kong Membership To 55 - Forbes http://t.co/GImXRbgb3F #hongkong\",\"hongkong\"\n", "572491543891865600,\"Chinese Shoppers Latest Target of Hong Kong Protest Anger - ABC News http://t.co/0PLjpAi0k5 #hongkong\",\"hongkong\"\n", "572489507494215680,\"Hold a baby and a gun? There's a class for that http://t.co/ZZNoLlE7Ir #HongKong #News\",\"hongkong\"\n", "572489503144689664,\"Russian opposition 'afraid,' but vows to push on http://t.co/3P9tkZ5059 #HongKong #News\",\"hongkong\"\n", "572489498367426561,\"No opposition in Russia? Putin ally calls it 'propaganda' http://t.co/eroDXYY07s #HongKong #News\",\"hongkong\"\n", "572489493866942464,\"Putin holds Ukraine pilot's fate, says her lawyer http://t.co/l5TrcNSXdf #HongKong #News\",\"hongkong\"\n", "572488313539465216,\"wsj: #Alibaba to Invest $316M in #Taiwan #Startup Fund http://t.co/MSxWXJ8vhg #hongkong #taiwan #startups #entrepreneur #vc #venturecapital\",\"hongkong\"\n", "572487838056357888,\"#hongkong Hong Kong Film Festival Unveils Lineup: Sylvia Chang's 'Murmur of the Heart... http://t.co/uWGAQJSlwt - http://t.co/UIQIpYp9sm\",\"hongkong\"\n", "572487126840967168,\"#hongkong https://t.co/za3NXn6jDR\",\"hongkong\"\n", "572486952106135553,\"Jsuis trop degoutee elle est pourrie la fin #serie #hongkong #raisingbar\",\"hongkong\"\n", "572485113335943168,\"Mei Ho House: The Housing History of Hong Kong http://t.co/ezaAQobVqo #asia #hongkong #travel\",\"hongkong\"\n", "572485028279537664,\"' anders sein ' http://t.co/0977bmpZrl #tokyo #berlin #HongKong #usa #startup #losangeles #London ^ #newyork http://t.co/6hOJitO0Kr .\",\"hongkong\"\n", "572330514423648256,\"Protest against Chinese SHOPPERS breaks out in clashes http://t.co/PevTyjR3EV #occupyhk #occupycentral #HK https://t.co/RT4AHuu6Uj\",\"occupycentral\"" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "572442084260638721,\"Independent filmmaker making a documentary video for the movement http://t.co/7e8kIE1dMa #umbrellamovement #occupyhongkong #occupycentral\",\"occupycentral\"\n", "572362621627858944,\"RT @aqchiu: Arrests after HK shopping protest via @BBC featured news #OccupyCentral http://t.co/3D6OmxMG5X\",\"occupycentral\"\n", "572351103930540032,\"Arrests after HK shopping protest via @BBC featured news #OccupyCentral http://t.co/3D6OmxMG5X\",\"occupycentral\"\n", "572350490110926848,\"RT @gavinrmorgan: #Retail #property sales up despite #OccupyCentral protests in #HongKong http://t.co/259fpxvC7D\",\"occupycentral\"\n", "572345651788427264,\"RT @gavinrmorgan: #Retail #property sales up despite #OccupyCentral protests in #HongKong http://t.co/259fpxvC7D\",\"occupycentral\"\n", "572345621891440640,\"RT @wolfpeng: More than 30 arrested as Hk anti-#China protesters scuffle w police http://t.co/uC0xWbYpub #occupyhk #occupycentral #HK\",\"occupycentral\"\n", "572345594892685312,\"RT @RoyCCNg: @HKFS1958 Two small-circle elected kids demanded universal suffrage. Do they know what they're doing!? #occupycentral #occupyhk\",\"occupycentral\"\n", "572345261399379968,\"Protest against Chinese SHOPPERS breaks out in clashes http://t.co/ZeDzOhrRUJ #occupyhk #occupycentral #HK http://t.co/ehr3247wyB\",\"occupycentral\"\n", "572345207259340801,\"RT @gavinrmorgan: #Retail #property sales up despite #OccupyCentral protests in #HongKong http://t.co/259fpxvC7D\",\"occupycentral\"\n", "572330514423648256,\"Protest against Chinese SHOPPERS breaks out in clashes http://t.co/PevTyjR3EV #occupyhk #occupycentral #HK https://t.co/RT4AHuu6Uj\",\"occupycentral\"\n", "572319528555843584,\"Hong Kong lawmakers http://t.co/C6LmNrWZt7 #hongkong #umhk #occupyhk #hk #umbrellamovement #occupycentral #umbrellarevolution #hkindigenous\",\"occupycentral\"\n", "572316777100677120,\"#Retail #property sales up despite #OccupyCentral protests in #HongKong http://t.co/259fpxvC7D\",\"occupycentral\"\n", "572293104260276224,\"themselves police protests http://t.co/C6LmNrWZt7 #hongkong #umhk #occupyhk #hk #umbrellamovement #occupycentral #umbrellarevolution\",\"occupycentral\"\n", "572274119879753728,\"@HKFS1958 Two small-circle elected kids demanded universal suffrage. Do they know what they're doing!? #occupycentral #occupyhk\",\"occupycentral\"\n", "572241083859529728,\"More than 30 arrested as Hong Kong anti-#China protesters scuffle with police http://t.co/vOt8e85D9D #occupyhk #occupycentral #HK #freedom\",\"occupycentral\"\n", "572236307897847808,\"HongKong pro-democracy lawmakers hand themselves in for being a part of #OccupyCentral http://t.co/REYiOaZEka http://t.co/ox4A7G1p1S\",\"occupycentral\"\n", "572233844851515392,\"#HongKong pro-democracy lawmakers hand themselves in for being a part of #OccupyCentral http://t.co/X544AbR8cZ http://t.co/Sh9LZyQUQk\",\"occupycentral\"\n", "572231225915531265,\"RT @STForeignDesk: #HongKong lawmakers hand themselves in to police over #OccupyCentral protests http://t.co/3wAdhvYepu\",\"occupycentral\"\n", "572231113072115712,\"RT @STForeignDesk: #HongKong lawmakers hand themselves in to police over #OccupyCentral protests http://t.co/3wAdhvYepu\",\"occupycentral\"\n", "572231098601746434,\"RT @STcom: RT @STForeignDesk: #HongKong lawmakers hand themselves in to police over #OccupyCentral protests http://t.co/6Oc69e7dW3\",\"occupycentral\"\n", "572228962593611776,\"RT @STForeignDesk: #HongKong lawmakers hand themselves in to police over #OccupyCentral protests http://t.co/6Oc69e7dW3\",\"occupycentral\"\n", "572228937922703360,\"#HongKong lawmakers hand themselves in to police over #OccupyCentral protests http://t.co/3wAdhvYepu\",\"occupycentral\"\n", "572172013432930305,\"RT @nbeffemellevw: Non-violence does not mean non-action. Let your voices be heard.#OccupyCentral\",\"occupycentral\"\n", "572144388920356864,\"Non-violence does not mean non-action. Let your voices be heard.#OccupyCentral\",\"occupycentral\"\n", "572142164072906752,\"RT @LQejackso1bo: Non-violence does not mean non-action. Let your voices be heard.#OccupyCentral\",\"occupycentral\"\n", "572137672627630083,\"Non-violence does not mean non-action. Let your voices be heard.#OccupyCentral\",\"occupycentral\"\n", "572100748114325504,\"RT @HIPERBOREUN: @CWynnykWilson #HKisnotChina #OccupyHK @SocRECorg #OccupyCentral #HKDemocracy @george_chen #9wu @2legit2trip @PRHacks #9wu\",\"occupycentral\"\n", "572096278953369600,\"@CWynnykWilson #HKisnotChina #OccupyHK @SocRECorg #OccupyCentral #HKDemocracy @george_chen #9wu @2legit2trip @PRHacks #9wu\",\"occupycentral\"\n", "572342223796375552,\"Is it just me or does #Jesus look like he's under a yellow umbrella? #umhk #UmbrellaMovement #UmbrellaRevolution #HK http://t.co/GGkN9NcBCH\",\"umbrellarevolution\"" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "569717515695362048,\"#Common 'For people fighting for democracy in #HongKong, the bridge is built on hope' #umbrellarevolution #Oscars2015 http://t.co/WRshNhlV5T\",\"umbrellarevolution\"\n", "572486677261918210,\"Even the good Doctor hates rain! >>> http://t.co/vJnfS5b3yO #MondayMustBuys #umbrellarevolution #RainyDay http://t.co/OH7AcRax6b\",\"umbrellarevolution\"\n", "572445822887043072,\"@pallavfnz94 PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572445806034362368,\"@jonasanbrown PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572429576107765760,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572424359094063105,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572423805974548480,\"RT @OpNewWorldOrder: @sinonomeaa #UmbrellaLove #umbrellamovement #UmbrellaRevolution how is it going? http://t.co/OhgsTmXR0T\",\"umbrellarevolution\"\n", "572414017811161091,\"@sinonomeaa #UmbrellaLove #umbrellamovement #UmbrellaRevolution how is it going? http://t.co/OhgsTmXR0T\",\"umbrellarevolution\"\n", "572413162512879616,\"@UnitedAvrill PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572413143705653250,\"@HiringScreenRH PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572413132678828033,\"@hgc81221 PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572413121962373120,\"@lukebryanllyric PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572412944232935424,\"@guiltyberryChoo PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572412931029266432,\"@ahmed_azhan PPL from #HongKong need ur #Help Pls sign the online petition to #UNSC http://t.co/ZYKcMzSFn1 #UmbrellaRevolution #thankyou\",\"umbrellarevolution\"\n", "572390643437326336,\"RT @enjolrasthewimp: There is a flame that never dies. (For once, I'm not talking about my hair.) #UmbrellaRevolution http://t.co/My9t9nXNhJ\",\"umbrellarevolution\"\n", "572370252983508993,\"RT @alejandroriano: That whiteshirt cop landed three baton hits on the one protestor. #hk #umbrellarevolution https://t.co/ZJmLcgs1ae\",\"umbrellarevolution\"\n", "572354192733560832,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572352884446244864,\"RT @alejandroriano: Life at the front. #HK #umbrellarevolution continua https://t.co/D6PSHBAnjA\",\"umbrellarevolution\"\n", "572350423123685377,\"RT @alejandroriano: Life at the front. #HK #umbrellarevolution continua https://t.co/D6PSHBAnjA\",\"umbrellarevolution\"\n", "572350346359521280,\"RT @alejandroriano: That whiteshirt cop landed three baton hits on the one protestor. #hk #umbrellarevolution https://t.co/ZJmLcgs1ae\",\"umbrellarevolution\"\n", "572349846130192384,\"R @alejandroriano Protest against Chinese SHOPPERS breaks out in clashes http://t.co/0O0SVta7EV #umbrellarevolution https://t.co/gCD6MPz8dd\",\"umbrellarevolution\"\n", "572348656449277952,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572347936799006720,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572346579543146496,\"RT @alejandroriano: That whiteshirt cop landed three baton hits on the one protestor. #hk #umbrellarevolution https://t.co/ZJmLcgs1ae\",\"umbrellarevolution\"\n", "572346557913120769,\"RT @alejandroriano: Life at the front. #HK #umbrellarevolution continua https://t.co/D6PSHBAnjA\",\"umbrellarevolution\"\n", "572346489416126464,\"RT @therosefox: China is like the US in the 60s... minus the drugs... and with communism. #cleanairacts #UmbrellaRevolution\",\"umbrellarevolution\"\n", "572345963324579840,\"RT @therosefox: China is like the US in the 60s... minus the drugs... and with communism. #cleanairacts #UmbrellaRevolution\",\"umbrellarevolution\"\n", "572345924921368576,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572345624223453184,\"RT @alejandroriano: @OccuWorld #UmbrellaRevolution @ProtestPin #OccupyHK\",\"umbrellarevolution\"\n", "572345524558438400,\"RT @alejandroriano: That whiteshirt cop landed three baton hits on the one protestor. #hk #umbrellarevolution https://t.co/ZJmLcgs1ae\",\"umbrellarevolution\"\n", "572345505411416064,\"RT @alejandroriano: Life at the front. #HK #umbrellarevolution continua https://t.co/D6PSHBAnjA\",\"umbrellarevolution\"\n", "572345483471003648,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572342223796375552,\"Is it just me or does #Jesus look like he's under a yellow umbrella? #umhk #UmbrellaMovement #UmbrellaRevolution #HK http://t.co/GGkN9NcBCH\",\"umbrellarevolution\"\n", "572339667279990784,\"RT @alejandroriano: @OccuWorld #UmbrellaRevolution @ProtestPin #OccupyHK\",\"umbrellarevolution\"\n", "572334432952184833,\"@OccuWorld #UmbrellaRevolution @ProtestPin #OccupyHK\",\"umbrellarevolution\"\n", "572333697313062912,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572330688957030400,\"Life at the front. #HK #umbrellarevolution continua https://t.co/D6PSHBAnjA\",\"umbrellarevolution\"\n", "572329930651906049,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572329411384635392,\"That whiteshirt cop landed three baton hits on the one protestor. #hk #umbrellarevolution https://t.co/ZJmLcgs1ae\",\"umbrellarevolution\"\n", "572328427883249664,\"Protest against Chinese SHOPPERS breaks out in clashes http://t.co/vS62AA7VQO #umbrellarevolution https://t.co/rHdgB7Hjuy\",\"umbrellarevolution\"\n", "572322713139691521,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572320992145506304,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572319528555843584,\"Hong Kong lawmakers http://t.co/C6LmNrWZt7 #hongkong #umhk #occupyhk #hk #umbrellamovement #occupycentral #umbrellarevolution #hkindigenous\",\"umbrellarevolution\"\n", "572318249276190720,\"RT @2legit2trip: So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572318227667107840,\"China is like the US in the 60s... minus the drugs... and with communism. #cleanairacts #UmbrellaRevolution\",\"umbrellarevolution\"\n", "572318202455187456,\"So uh I saw this on a taxi earlier in Mei Foo #umbrellamovement #umbrellarevolution #D7689 http://t.co/Y760AaVtpL\",\"umbrellarevolution\"\n", "572293104260276224,\"themselves police protests http://t.co/C6LmNrWZt7 #hongkong #umhk #occupyhk #hk #umbrellamovement #occupycentral #umbrellarevolution\",\"umbrellarevolution\"\n", "572288228041965568,\"Student Front, created soon after #UmbrellaRevolution, announced that it is formally dissolved. Student Front... http://t.co/zf2FI5PfLl\",\"umbrellarevolution\"\n", "572501782116610048,\"#Russia Tops #China as Principal #Cyber Threat to #US | The Diplomat http://t.co/Cj0a0mxAko\",\"china\"" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "572501777456619521,\"RT @lajornadaonline: Documental sobre el #esmog en #China se vuelve viral -> http://t.co/FibTTZ7euL http://t.co/s82hqzDXHg\",\"china\"\n", "572501776378814465,\"Dionysus - creating unique events in China #China http://t.co/bVV1SlHnGu\",\"china\"\n", "572501771345661955,\".@zanu_pf In bed with #China to feed your lavish lifestyle while children of Zimbabwe suffer #SaveAfricanAnimals http://t.co/ryZEcP96zp\",\"china\"\n", "572501732430880768,\"Weird Things in #China: Sleeping in Public http://t.co/3X6r8JNiKV #Travel #Expatlife #Asia\",\"china\"\n", "572501711232753665,\"#Internacional Sismo en suroeste de #China deja 32 heridos y 67,000 afectados. http://t.co/1IzFXXbeQG http://t.co/wL7vtGHYSx\",\"china\"\n", "572501705402667009,\"New post: 'Glorious and Majestic old Warships' http://t.co/r8BVfV6lyC #china #military\",\"china\"\n", "572501697492361217,\"#China systematically destroying #Tibet's Buddhist, today #devil #Shugden Society working hand in hand with Chinese. #DalaiLama @Khandarohi\",\"china\"\n", "572501649903771649,\"Hello, privacy rules #proxy from #China: https://t.co/gwah9v9DQS\",\"china\"\n", "572501568882401281,\"#Russia Tops #China as Principal Cyber Threat to US!! ~Virus~ http://t.co/EiOC38JSpZ\",\"china\"\n", "572501564037996544,\".@Hon_Kasukuwere Minister? Your lies & corrupt GREED #China make you UNFIT for position #SaveAfricanAnimals http://t.co/7kxjSfuq72\",\"china\"\n", "572501501119234048,\"VERY INTERESTING! http://t.co/dDXy2HaoNq Five emerging startup hubs for opportunities to launch or expand internationally #ECONOMY #China\",\"china\"\n", "572501439521693696,\"Watch #Netflix from anywhere in the World! #Japan #Australia #China... Anywhere! http://t.co/rft8lyCiUC\",\"china\"\n", "572501378372935680,\".@george_chen #China MUST STOP looting #Africa #zimbabweelephants captured 4 BRUTAL #Chimelong #SaveAfricanAnimals http://t.co/PJZNMiEnwB\",\"china\"\n", "572501337277132800,\".@TheBricsPost On what grounds is #China continuing to rape #Africa? We are NOT for sale! #SaveAfricanAnimals http://t.co/SzMxD8dqM8\",\"china\"\n", "572501252975820800,\"RT @CFASocietiesAus: @Nouriel sees robust US growth, a bumpy landing for #China, and no #Grexit http://t.co/ovjKjLN4uW #FutureFinance #CFA\",\"china\"\n", "572501151700140032,\"#Viralands - A #New #World Found Underneath #China! It looks like a #ScienceFiction! | #WorldNews #Science #Discovery http://t.co/EMVsWSwRWw\",\"china\"\n", "572501098726080514,\".@VivaIFP WARN'S #China new economic colonialist !! SAY'S #Africa is NOT for sale ? #SaveAfricanAnimals http://t.co/LxiTPJMZbA\",\"china\"\n", "572501086285774848,\"#China and #Russia vs. the United States? | The Diplomat #US http://t.co/X6JuCpbIPl\",\"china\"\n", "572501073455280128,\"Prince William in China and Japan #PHOTOS #VIDEO http://t.co/sovNN0Q7aY via @examinercom #china #japan #PrinceWilliam #travel #asia\",\"china\"\n", "572501049879212032,\"#China #News: theChinaGap: 2015 #Chinese social media #trends http://t.co/gbGwNCONQs #China http://t.co/dfD86ssQ8V\",\"china\"\n", "572501032611278849,\"RT @lajornadaonline: Documental sobre el #esmog en #China se vuelve viral -> http://t.co/FibTTZ7euL http://t.co/s82hqzDXHg\",\"china\"\n", "572500988562546688,\"RT @lajornadaonline: Documental sobre el #esmog en #China se vuelve viral -> http://t.co/FibTTZ7euL http://t.co/s82hqzDXHg\",\"china\"\n", "572500944715423745,\".@HeraldZimbabwe who counted #zimbabweelephants? NOT ZANUPF they submit false figures 2 feed #China Bloody Ivory! http://t.co/5I4GnnKoyW\",\"china\"\n", "572500944371511296,\"#silkroadart #gallery #chineseart #chinesepainting #China #artist #art #yale #newhaven at http://t.co/KuF1hYUapw\",\"china\"\n", "572500939678081024,\"#NightCruise with @norinme and @joeyzaza22 NP-#china by @NaetoC @phynofino @jaysleek #HotFM983at10\",\"china\"\n", "572500888960548865,\"Our Mother #Africa is UNDER SIEGE threatened by GREED #China stripping her bare !! #SaveAfricanAnimals http://t.co/OLNAraxfpK\",\"china\"\n", "572500653110530048,\"Hardware news!: epigalul... http://t.co/QqPnJku4Th #manufacturing #hardware #China #steel #trailer #fencing\",\"china\"\n", "572500484222808064,\"RT @LizDylan123: 5 elderly women take care of 1,300 stray #dogs in #China http://t.co/gqIfoxf8K4 @TACN_Official #AnimalWelfare\",\"china\"\n", "572500432322473984,\"Dionysus is your Partner to get your Brand into Far East #china http://t.co/bVV1SlHnGu\",\"china\"\n", "572500426685358082,\"EDITORIAL: Manon Leloup x L'Officiel #China, March '15. http://t.co/5srlArJJsO http://t.co/KzaPnjqNps\",\"china\"\n", "572500414689644546,\"Wheel of Fortune -- ' I have come full circle' -- William Shakespeare #photooftheday #bnw #china http://t.co/VDFvfFAVVN\",\"china\"\n", "572500401834094592,\"@Gabriele_Corno #cool #china #rain #umbrella #BridgesOfLove\",\"china\"\n", "572500391763558400,\"2015 #Chinese social media #trends http://t.co/O5HzQHL1F7 #China http://t.co/bkm9VyvTmU\",\"china\"\n", "572500383228010497,\"RT @lajornadaonline: Documental sobre el #esmog en #China se vuelve viral -> http://t.co/FibTTZ7euL http://t.co/s82hqzDXHg\",\"china\"\n", "572500376911536128,\".@CITESconvention Why is #China getting away with murder? They are stripping #Africa bare! #SaveAfricanAnimals http://t.co/3vCJ1KUMvb\",\"china\"\n", "572500337929543680,\"RT @lajornadaonline: Documental sobre el #esmog en #China se vuelve viral -> http://t.co/FibTTZ7euL http://t.co/s82hqzDXHg\",\"china\"\n", "572500270183280640,\"RT @Gabriele_Corno: Rain in an ancient town............. by Chen Li #China http://t.co/WxqigZ1TOE\",\"china\"\n", "572500201728053248,\"Our #Africa is NOT for sale to #China #SaveAfricanAnimals http://t.co/mnQr5bqueH\",\"china\"\n", "572320993781272576,\"#China says stance on #HK constitutional reform remains 'clear and consistent'; says regional chief executive should love country, HK\",\"hk\"" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "572330514423648256,\"Protest against Chinese SHOPPERS breaks out in clashes http://t.co/PevTyjR3EV #occupyhk #occupycentral #HK https://t.co/RT4AHuu6Uj\",\"hk\"\n", "572342223796375552,\"Is it just me or does #Jesus look like he's under a yellow umbrella? #umhk #UmbrellaMovement #UmbrellaRevolution #HK http://t.co/GGkN9NcBCH\",\"hk\"\n", "572495152759095297,\"RT @ajam: Hong Kong arrests 38 people in protests over mainland Ch... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @PhotoRaptor\",\"hk\"\n", "572489952463745024,\"#HK #China Fixes for the weakest link http://t.co/CMRCO5Ivbl @thestandardhk\",\"hk\"\n", "572489951687798784,\"#HK #China Charge of the smart brigade http://t.co/2QcC15ym7n @thestandardhk\",\"hk\"\n", "572489951004139521,\"#HK #China Java 'can keep MS at bay' http://t.co/pl3Fuqon5B @thestandardhk\",\"hk\"\n", "572489950312046594,\"#HK #China Link between depression and violent crime http://t.co/M4iNwyZjGv @thestandardhk\",\"hk\"\n", "572489949628391424,\"#HK #China Power of play http://t.co/Ux6dMMvo22 @thestandardhk\",\"hk\"\n", "572488009469186048,\"#HK #China #EM #Asia Namibia's leader gets US$5m leadership prize http://t.co/wFfv42KYN8 @SCMP_News\",\"hk\"\n", "572488008726794240,\"#HK #China #EM #Asia A rare glimpse into Saudi prison for convicted terrorists http://t.co/OCTkPiuZKB @SCMP_News\",\"hk\"\n", "572483975802982400,\"Changing #HK\",\"hk\"\n", "572483521081704449,\"Pool A standings. #CWC15 #Cricket #HK http://t.co/83bVcgdETJ\",\"hk\"\n", "572483367989583873,\"Pool B standings. #CWC15 #Cricket #HK http://t.co/XMfVXH0xsD\",\"hk\"\n", "572483221360918528,\"Saeed Ajmal is still at No.1 spot at ICC ODI Ranking. #ICCODIRanking #Cricket #HK http://t.co/2kX0fr1ELa\",\"hk\"\n", "572482963130224641,\"#UK #Escorts #Escort at #HamiltonsLondon #Egypt #Bahrain #Dubai #AbuDhabi #London #Paris #Budapest #UAE #USA #France #Berlin #Italy #HK\",\"hk\"\n", "572482867319713793,\"ICC ODI Ranking. #CWC15 #Cricket #HK http://t.co/gMWOvb42mN\",\"hk\"\n", "572482611832078336,\"Sarfaraz Ahmed is looking to play against UAE tomorrow. #CWC15 #PAKvUAE #Cricket #HK http://t.co/KvIhSr6sbq\",\"hk\"\n", "572481745737674752,\"Pakistan Cricket Team tops Google search charts. #CWC15 #Cricket #HK http://t.co/I2Pe2hyue5\",\"hk\"\n", "572481628779368449,\"Ireland need to win the game against South Africa to reach into the Quarter Finals. #CWC15 #SAvIRE #Cricket #HK http://t.co/hP2jqpwtRb\",\"hk\"\n", "572481276650897409,\"Misbah's reaction after Sohaib Maqsood dismissal! #CWC15 #PakvZim #Cricket #HK http://t.co/PfHxvFkUjp\",\"hk\"\n", "572480842863398912,\"Ireland will face South Africa today at Canberra, watch at 8:30 am PST. #CWC15 #Cricket #SAvIRE #HK http://t.co/wKrvenZToQ\",\"hk\"\n", "572480549752864769,\"Chris Gayle Taking Autograph from AB De Villiers after #SAvWI Match. #CWC15 #Cricket #HK http://t.co/fdO4D057Eb\",\"hk\"\n", "572480409973460992,\"Latest #Selfie of Kevin Pietersen. #CWC15 #Cricket #HK http://t.co/3AFMyWj3lB\",\"hk\"\n", "572478851563069440,\"#moraineave #flying over the #rockymountains #cathaypacific from #Hk #hongkong #nyc #newyorkcity by 6dust http://t.co/5MnfkSrUR8\",\"hk\"\n", "572474071499313152,\"5 Misconceptions about Hong Kong http://t.co/7GdAH8Ttat #expat #hongkong #hk #china #travel #ttot\",\"hk\"\n", "572470185984204802,\"RT @ColombianLoveMo: @Prestige_HK RT @AlphaSphere based in #hk crowd funding campaign https://t.co/QtcRy4bZlV #musictech #innovation\",\"hk\"\n", "572467439319027713,\"Benieuwd naar openingstijden van onze vestigingen in #HK, #Hem, #Enkhuizen? Bekijk ze hier: http://t.co/4R2DVBDtbL http://t.co/QnzGG1docL\",\"hk\"\n", "572464639147159553,\"RT @HKwalls: #pantone is coming to #HK for #HKwalls! http://t.co/ClCsn47Ox7\",\"hk\"\n", "572461513606352896,\"Occupy Central was an attempt at colour revolution: PLA general #... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @FollowHKNews\",\"hk\"\n", "572460648023654402,\"#IceHockey #Livescore @ScoresPro: (SVK-1L) #HC 07 Detva vs #Hk Spis N. Ves: 3-0 ...\",\"hk\"\n", "572459489300897792,\"arahkeun panah anjeun ka cermin, salila ieu, musuh anu diteang2 di kahirupan anjeun bakalan aya didinya. #HK\",\"hk\"\n", "572459385953386496,\"#IceHockey #Livescore @ScoresPro: (SVK-1L) #HC 07 Detva vs #Hk Spis N. Ves: 2-0 ...\",\"hk\"\n", "572459269657767936,\"#HK #China #EM #Asia Cross-border tensions overshadow yearly national meetings in Beijing http://t.co/fnidmFO0Gz @SCMP_News\",\"hk\"\n", "572459268370137088,\"#HK #China #EM #Asia Occupy Central was an attempt at colour revolution: PLA general http://t.co/7bdx5o2p7s @SCMP_News\",\"hk\"\n", "572456877168185345,\"#IceHockey #Livescore @ScoresPro: (SVK-1L) #HC 07 Detva vs #Hk Spis N. Ves: 1-0 ...\",\"hk\"\n", "572455477671559168,\"Professionals in creative design #hk http://t.co/NhC5x2RZBg http://t.co/kfoHldRP5q\",\"hk\"\n", "572453094656741376,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 3-4 ...\",\"hk\"\n", "572451835686084608,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 3-3 ...\",\"hk\"\n", "572450035427770368,\"#HK #China #EM #Asia Businessman Victor Chu is a champion for education and the oceans http://t.co/bpFacujO4V @SCMP_News\",\"hk\"\n", "572450034748243968,\"#HK #China #EM #Asia Hong Kong is a driving force, if we keep our eyes open, Victor Chu says http://t.co/em0p2IllNB @SCMP_News\",\"hk\"\n", "572450034102321152,\"#HK #China #EM #Asia Hong Kong mortgage rules tightened again as cooling measures shown effective http://t.co/ck00NGdqwI @SCMP_News\",\"hk\"\n", "572450033435471872,\"#HK #China #EM #Asia China's young entrepreneurs to drive its innovation economy http://t.co/WEHA1GREQo @SCMP_News\",\"hk\"\n", "572450032772784128,\"#HK #China #EM #Asia Taiwan sets compliance deadline for Alibaba's Singapore offshoot http://t.co/ctNmyDZu8V @SCMP_News\",\"hk\"\n", "572449498070429696,\"#IceHockey #Livescore ScoresPro: (SVK-1L) #HC 07 Detva vs #Hk Spis N. Ves: 0-0: http://t.co/8rQaV6KYpO\",\"hk\"\n", "572449412946927617,\"@Prestige_HK RT @AlphaSphere based in #hk crowd funding campaign https://t.co/QtcRy4bZlV #musictech #innovation\",\"hk\"\n", "572445393805438976,\"#HK #China #Asia London Metal Exchange aims to further cut warehouse logjams http://t.co/bmd3vvsT6r @SCMP_News\",\"hk\"\n", "572445393054666752,\"#HK #China #Asia China's natural gas price cut aimed at linking domestic market to overseas prices http://t.co/jKcTVRMWkL @SCMP_News\",\"hk\"\n", "572445392375164928,\"#HK #China #Asia No short selling of A shares recorded on first day http://t.co/k3iQS9hTJT @SCMP_News\",\"hk\"\n", "572444287444652033,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 3-2 ...\",\"hk\"\n", "572443030772428801,\"#IceHockey #Livescore @ScoresPro: (SVK-1L) #HC 07 Detva vs #Hk Spis N. Ves: 0-0 ...\",\"hk\"\n", "572443028184539136,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 2-2 ...\",\"hk\"\n", "572442360250011648,\"March 22-28th #WeDoThis4PHun #AlphaWeek #JustWaitOnIt #RoadToAlphageddon #Alphageddon #HK #LaTechAlphas https://t.co/KZkGcUzhgd\",\"hk\"\n", "572440499858100224,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HC 05 B. Bystrica vs #HK Dukla Trencin: 1-0 ...\",\"hk\"\n", "572440497639301122,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 2-1 ...\",\"hk\"\n", "572439420164448259,\"RT @Amberbrella: I thought pepper spray were used to disperse protesters...#occupyHK #HK #occupy #hongkong http://t.co/BvzHHcQW4D\",\"hk\"\n", "572438990986477568,\"#HK http://t.co/t3UxtuXpKp\",\"hk\"\n", "572438205443788800,\"#HK ranks 2nd in world for #women #entrepreneurs. @SCMP_News explains the @BNPP_Wealth Report http://t.co/OGELeGsqYq http://t.co/nIkeFcMJgF\",\"hk\"\n", "572437766707023873,\"RT @SCMP_News: CITY BEAT: Dangers emerge in Hong Kong over anti-Ma... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @hinhung0119\",\"hk\"\n", "572433601888948224,\"West Indies all-rounder Andre Russell posing for a selfie with a fan. #CWC15 #Cricket #HK http://t.co/sau11SE3Bg\",\"hk\"\n", "572433367523704832,\"#HK #China #EM #Asia 'Blank-vote' idea unlikely in Hong Kong's 2017 election plan, Carrie Lam says http://t.co/e9JOcXpdzR @SCMP_News\",\"hk\"\n", "572433366772916225,\"#HK #China #EM #Asia Triple appointments of IT adviser provoke scepticism http://t.co/LdD5tORT77 @SCMP_News\",\"hk\"\n", "572433366131195904,\"#HK #China #EM #Asia No place for Yuen Long protest bigots http://t.co/TiX3LIbAbl @SCMP_News\",\"hk\"\n", "572433365401387008,\"#HK #China #EM #Asia 5 more universities face calls to quit Hong Kong's union of student unions http://t.co/rGJVuRfW7u @SCMP_News\",\"hk\"\n", "572433364705136640,\"#HK #China #EM #Asia How travel broadened the Hong Kong Federation of Students' wallet http://t.co/kSOyVsuFmt @SCMP_News\",\"hk\"\n", "572433287202934784,\"Suresh Raina took 6 Catches is this World Cup - Most by any player in 2015 World Cup so far. #CWC15 #Cricket #HK http://t.co/upGSBdthJn\",\"hk\"\n", "572433123058839553,\"On this Day in 2011 , Kevin O'Brien Scored Fastest World Cup Century. #CWCStats #CWC15 #Cricket #HK http://t.co/2echpmkStC\",\"hk\"\n", "572432984156078080,\"Shoaib Akthar & Harbhajan Singh on the Sets of Comedy Nights with Kapil. #Cricket #HK http://t.co/4tfDKVzq5j\",\"hk\"\n", "572432950496759808,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 1-1 ...\",\"hk\"\n", "572432836277485568,\"Chris Gayle having Fun Time in Last Night Party at Perth. #CWC15 #Cricket #HK http://t.co/LbyD15aP3m\",\"hk\"\n", "572432393971351552,\"World Cup 2015: Australian Pacer Pat Cummins Set to Miss Afghanistan Match. #AusvAfg #CWC15 #Cricket #HK http://t.co/gigZoqd6mb\",\"hk\"\n", "572432051896496128,\"Beautiful sight of Wellington & Westpac Stadium from Aeroplane. #CWC15 #Cricket #HK http://t.co/VvkNCsHArT\",\"hk\"\n", "572431754780401664,\"Happy Birthday Andrew Strauss. #HBD #Cricket #HK http://t.co/kIS1oFvvxv\",\"hk\"\n", "572431690347520000,\"#IceHockey #Livescore @ScoresPro: (SVK-LIG) #HKM Zvolen vs #HK Poprad: 0-1 ...\",\"hk\"\n", "572431632994594816,\"Rahul Dravid with his Family. #Cricket #HK http://t.co/SWeqxoKots\",\"hk\"\n", "572431505198321665,\"MS Dhoni & Co. Are the Complete Team in World Cup: Clive Lloyd. #CWC15 #Cricket #HK http://t.co/ZDCeobWucS\",\"hk\"\n", "571325508576681985,\"Hello @milanluthria what's your top pick for #invitationcup2015 weekend? #BeSafe? #DancingPrances? #Admiralty? #ifyounevergo\",\"admiralty\"" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "571296128261443585,\"Roll bet your 30% investment on #eternalflame to #Admiralty #rwitc\",\"admiralty\"\n", "571288802045636608,\"Happy Friday... Knackered... #hk #hkig #hongkong #admiralty #thaibasil #pacificplace @ Thai Basil https://t.co/J7WwzJLAL0\",\"admiralty\"\n", "570890981802799105,\"@YeodieHk #tbt Umbrella Revolution 30/9/14 #urhk #occupyhk #REVOLUTION #Wanchai #Admiralty #ILOVEHK http://t.co/4w5rLbRxKr\",\"admiralty\"\n", "570190777210544128,\"#parrot #HKPark #aviary #bird #admiralty @ Edward Youde Aviary https://t.co/jChJsQ0hVQ\",\"admiralty\"\n", "569858300029767680,\"RT @fdknight: West Coast #Maritime labor dispute resolved with 5 year deal. #Admiralty https://t.co/KwKxrdxuV9\",\"admiralty\"\n", "569857493746057217,\"West Coast #Maritime labor dispute resolved with 5 year deal. #Admiralty https://t.co/KwKxrdxuV9\",\"admiralty\"\n", "572248950134079488,\"Helena Wong: these arrests are political persecution to scare people from fighting for democracy. #OccupyHk http://t.co/KeHIezkWsY\",\"occupyhk\"" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "571929949839335424,\"I am on the way to Yuen Long for a Live English Stream of todays protest. #OccupyHK #UmbrellaMovement Pic@2legit2trip http://t.co/WMceBUJyL2\",\"occupyhk\"\n", "572209201621032960,\"Audrey Eu: the 'appointment arrests' feel like a show to me. Why haven't they arrested the '7 policemen'? #OccupyHK http://t.co/wFZJz5IXa1\",\"occupyhk\"\n", "572495152759095297,\"RT @ajam: Hong Kong arrests 38 people in protests over mainland Ch... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @PhotoRaptor\",\"occupyhk\"\n", "572488328903200768,\"Views sharply divided over mainland Chinese shoppers visiting Hong Kong http://t.co/BCFMwoEi9m #OccupyHK http://t.co/Rr1naE13Oo\",\"occupyhk\"\n", "572478563305172992,\"@NoTalk2014 @SpyEast ~Hong Kong has always been Hong Kong,will always be Hong Kong--a unique place &city-state! #UmbrellaMovement #OccupyHK\",\"occupyhk\"\n", "572461513606352896,\"Occupy Central was an attempt at colour revolution: PLA general #... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @FollowHKNews\",\"occupyhk\"\n", "572443107528077314,\"RT @MikeGJW: Here they come. #occupyHK http://t.co/QKEbfNfffS\",\"occupyhk\"\n", "572440845951107072,\"RT @MikeGJW: Here they come. #occupyHK http://t.co/QKEbfNfffS\",\"occupyhk\"\n", "572439713358856193,\"RT @lostdutchhk: Pepperspray in use #occupyhk #yuenlong http://t.co/FL201nGdCv\",\"occupyhk\"\n", "572439420164448259,\"RT @Amberbrella: I thought pepper spray were used to disperse protesters...#occupyHK #HK #occupy #hongkong http://t.co/BvzHHcQW4D\",\"occupyhk\"\n", "572437766707023873,\"RT @SCMP_News: CITY BEAT: Dangers emerge in Hong Kong over anti-Ma... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @hinhung0119\",\"occupyhk\"\n", "572434778751819778,\"RT @tomgrundy: #occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572416160395894784,\"Dozens arrested in Hong Kong amid protests over #China mainland shoppers http://t.co/FLGKQc2HTW via @guardian #OccupyHK #HK @tomgrundy\",\"occupyhk\"\n", "572408833940844544,\"Sigh, anger escalated @ahpei0311 Fresh violence agai... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK #Umbrellamovement via @cycypea\",\"occupyhk\"\n", "572408614364717057,\"RT @tomgrundy: #occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572406668283154432,\"RT @tomgrundy: #occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572405660178104320,\"RT @tomgrundy: #occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572402321855922176,\"RT @tomgrundy: #occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572401438304808962,\"RT @tomgrundy: #occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572401260046913537,\"#occupyhk artist 'Mr & Ms HK People' https://t.co/HWtBDCwG9W reacts to CY poll: http://t.co/SnsdfQ6nCH http://t.co/f3rkyaetN8\",\"occupyhk\"\n", "572398333009928193,\"Constitutional development must reflect 'public sentiment' http://t.co/B5fxEdDb7p says Lam,who refused further meetings w/#Occupyhk leaders\",\"occupyhk\"\n", "572368168456482817,\"Protesters in Hong Kong Pepper-Sprayed By Police http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK #Umbrellamovement via @AngelaNBC6\",\"occupyhk\"\n", "572361251063537664,\"RT @harbourtimes: Martin Lee's lawyer: Martin faces 8 counts of participating in unauthorized assembly. #occupyHk\",\"occupyhk\"\n", "572359479951376385,\"RT @harbourtimes: Martin Lee's lawyer: Martin faces 8 counts of participating in unauthorized assembly. #occupyHk\",\"occupyhk\"\n", "572359146818945024,\"#SkidRow shooting, protests in #HongKong & a wax #PrinceWilliam... http://t.co/Ib3k5vgzNG #Umhk #Occupyhk #HK #Umbrellamovement via @BBCOS\",\"occupyhk\"\n", "572355587218022400,\"RT @harbourtimes: Martin Lee leaves without saying a word to the press. #OccupyHK http://t.co/9f9kamP1q5\",\"occupyhk\"\n", "572355506045620224,\"RT @harbourtimes: Martin Lee's lawyer: Martin faces 8 counts of participating in unauthorized assembly. #occupyHk\",\"occupyhk\"\n", "572350882475610112,\"Hong Kong protest sees violence, pepper spray and arrests, but triad... http://t.co/Ib3k5vgzNG #Hongkong #Umhk #Occupyhk #HK via @SCMP_News\",\"occupyhk\"\n", "572345624223453184,\"RT @alejandroriano: @OccuWorld #UmbrellaRevolution @ProtestPin #OccupyHK\",\"occupyhk\"\n", "572345621891440640,\"RT @wolfpeng: More than 30 arrested as Hk anti-#China protesters scuffle w police http://t.co/uC0xWbYpub #occupyhk #occupycentral #HK\",\"occupyhk\"\n", "572345594892685312,\"RT @RoyCCNg: @HKFS1958 Two small-circle elected kids demanded universal suffrage. Do they know what they're doing!? #occupycentral #occupyhk\",\"occupyhk\"\n", "572345261399379968,\"Protest against Chinese SHOPPERS breaks out in clashes http://t.co/ZeDzOhrRUJ #occupyhk #occupycentral #HK http://t.co/ehr3247wyB\",\"occupyhk\"\n", "572344821219758081,\"RT @bcmagazinehk: Draw, Create, Express Yourselves Freely @ Tim Mei Art Village!: http://t.co/tRPO35zQRl #umhk #umbrellamovement #occupyhk\",\"occupyhk\"\n", "572339667279990784,\"RT @alejandroriano: @OccuWorld #UmbrellaRevolution @ProtestPin #OccupyHK\",\"occupyhk\"\n", "572339657117073408,\"@FreeMindTH in the greater scheme of things, less of a worry than a repeat of #occupyhk\",\"occupyhk\"\n", "572335597525721090,\"RT @bcmagazinehk: Draw, Create, Express Yourselves Freely @ Tim Mei Art Village!: http://t.co/tRPO35zQRl #umhk #umbrellamovement #occupyhk\",\"occupyhk\"\n", "572334432952184833,\"@OccuWorld #UmbrellaRevolution @ProtestPin #OccupyHK\",\"occupyhk\"\n" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Your turn: API\n", "\n", "1. Read through the [search API](https://dev.twitter.com/rest/reference/get/search/tweets). What are the additional parameters that can be set?\n", "2. How could the script be changed above to be regularly used and make sure you're getting distinct tweets each time it runs\n", "3. Experimental design: Given the additional parameters that could be set, and the above code, what could be some interesting questions to ask and then explore? Ex: \"Comparing tweets about HK in Hong Kong vs Beijing\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with text and using NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`nltk` stands for natural language toolkit. It's primary purpose in our data science toolkit sits around the following functionality:\n", "\n", "* parsing text objects into lists of tokens\n", "* providing context and similarity between text\n", "* containers for large amounts of text (various texts from project gutenberg are included for download)\n", "* tagging words\n", "* building predictive models using text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we'll load in some previously generated twitter data into a pandas dataframe, though with some integration, it wouldn't be a stretch to stream them directly into a pandas data frame, or into a SQL database, which we could later pull in with pandas." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "tweets = pd.read_csv('../data/twitter5.csv')\n", "tweets = tweets\n", "\n", "print tweets" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ " id tweet \\\n", "0 522205943074160640 Is this #HongKong 's Rodney King? Police need ... \n", "1 521669229188501504 'We won't move and I'm ready to get arrested',... \n", "2 522228472786067456 RT @stanyee: Footage of beating prompts #HongK... \n", "3 522228386002108418 What is happening in Hong Kong is something th... \n", "4 522228373964480512 #Funding:#HongKong #travel #startup @KlookTrav... \n", "5 522228351231344640 HK police use pepper spray on protesters, beat... \n", "6 522228250450206720 RT @stanyee: Footage of beating prompts #HongK... \n", "7 522228232330817536 RT @stanyee: Footage of beating prompts #HongK... \n", "8 522228119952822272 Squeezed. #vscocam #vsco_hub #vscogang #vscogr... \n", "9 522228020069679104 RT @stanyee: Footage of beating prompts #HongK... \n", "10 522227916659105792 BREAKING: #HongKong security chief says 6 poli... \n", "11 522227892198334464 Trendingnews: #hongkong #demonstranten | #aang... \n", "12 522227878755184642 @BBCBreaking In #HongKong #UmbrellaMovement, p... \n", "13 522227788095303680 @cnnbrk In #HongKong #UmbrellaMovement, police... \n", "14 522227771502653440 Map of the underpass in #hongkong where police... \n", "15 522227564555673601 @nytimes In #HongKong #UmbrellaMovement, polic... \n", "16 522227525598994432 RT @CoconutsHK: 37 men and 8 women arrested la... \n", "17 522227458012368897 #Hongkong today http://t.co/QWD1odyMdS \n", "18 522227399069802496 RT @SauloCorona: ICYMI: #HongKong tensions ris... \n", "19 522227204743524353 RT @stanyee: Footage of beating prompts #HongK... \n", "20 522227036161839105 Los 7 goles que le hicimos a #HongKong son los... \n", "21 522227008042840064 #EXO #hongkong #FROMLOSTPLANET #20140601 #cha... \n", "22 522227000887361536 @WSJAsia @WSJ This was how the #HongKong polic... \n", "23 522226702475218944 Statement by #HongKong Police about the video ... \n", "24 522226690244628482 #hongkong Oil prices rebound... but not for lo... \n", "25 522226686138388480 #hongkong Oil prices rebound... but not for lo... \n", "26 522226684573933568 #hongkong Oil prices rebound... but not for lo... \n", "27 522226680367034369 #HongKong RT @george_chen: First tear gas, no ... \n", "28 522226675892092928 Footage of beating prompts #HongKong police to... \n", "29 522226672771530754 the underpass in #hongkong where police used p... \n", "... ... ... \n", "1466 522495130905763841 RT @krislc: Legislator Leung Kwok-hung aka Lon... \n", "1467 522495102971699200 RT @krislc: Audrey Eu giving out free eggtarts... \n", "1468 522494858507071488 RT @tomgrundy: Big Pictures from @TheAtlantic ... \n", "1469 522494785152512000 RT @krislc: Mong Kok. 100 people. but seems ba... \n", "1470 522494728147726337 RT @krislc: earlier: the 1 dollar coins that c... \n", "1471 522494722938388480 RT @desiree_fa: 8 Police vans roll out from wa... \n", "1472 522494481552011264 Mong Kok. 100 people. but seems barricades cou... \n", "1473 522493948908941312 Sign the petition: Stand in solidarity with pr... \n", "1474 522493690120388608 RT @krislc: Banner now #OccupyHK http://t.co/u... \n", "1475 522493482980896768 RT @krislc: .@klustout here #OccupyHK http://t... \n", "1476 522493408976596993 RT @cpjasia: Five HK Press Unions United in Co... \n", "1477 522493263543300096 RT @krislc: 9:11pm. Rodney Street packed #Occu... \n", "1478 522493200016359424 RT @krislc: packed around the stage; but loose... \n", "1479 522493095821074432 RT @krislc: now #OccupyHK http://t.co/RJ3eedsXUY \n", "1480 522492944020807680 RT @AgnesBun: Meanwhile in Sheung Wan... #Occu... \n", "1481 522492841759875072 RT @krislc: This is the post. started 10:30am.... \n", "1482 522492750152077312 RT @HKFS1958: Urgently need face masks, saline... \n", "1483 522492727682809856 RT @tomgrundy: Barriers being reinforced in tu... \n", "1484 522492628680839168 RT @krislc: occupied area stretches from legis... \n", "1485 522492375919120384 police can't stand the verbal abuse from prote... \n", "1486 522492265512845312 RT @krislc: barricades also building up westbo... \n", "1487 522492212412571649 Album of close up shots of the pepper spraying... \n", "1488 522492047266443264 RT @krislc: calling wooden plates here to stre... \n", "1489 522492008359665665 Sign the petition: Stand in solidarity with pr... \n", "1490 522492004996239361 RT @krislc: Lung Wo Rd. What? source: https://... \n", "1491 522491590443806720 RT @tomgrundy: Big Pictures from @TheAtlantic ... \n", "1492 522491470146969600 RT @krislc: now & signing off. #OccupyHK h... \n", "1493 522491449322258434 RT @krislc: this amount of people will stay at... \n", "1494 522490688298967040 Sign the petition: Stand in solidarity with pr... \n", "1495 522490643483217920 RT @fion_li: Hong Kong Police Battle Protester... \n", "\n", " hashtag \n", "0 hongkong \n", "1 hongkong \n", "2 hongkong \n", "3 hongkong \n", "4 hongkong \n", "5 hongkong \n", "6 hongkong \n", "7 hongkong \n", "8 hongkong \n", "9 hongkong \n", "10 hongkong \n", "11 hongkong \n", "12 hongkong \n", "13 hongkong \n", "14 hongkong \n", "15 hongkong \n", "16 hongkong \n", "17 hongkong \n", "18 hongkong \n", "19 hongkong \n", "20 hongkong \n", "21 hongkong \n", "22 hongkong \n", "23 hongkong \n", "24 hongkong \n", "25 hongkong \n", "26 hongkong \n", "27 hongkong \n", "28 hongkong \n", "29 hongkong \n", "... ... \n", "1466 occupyhk \n", "1467 occupyhk \n", "1468 occupyhk \n", "1469 occupyhk \n", "1470 occupyhk \n", "1471 occupyhk \n", "1472 occupyhk \n", "1473 occupyhk \n", "1474 occupyhk \n", "1475 occupyhk \n", "1476 occupyhk \n", "1477 occupyhk \n", "1478 occupyhk \n", "1479 occupyhk \n", "1480 occupyhk \n", "1481 occupyhk \n", "1482 occupyhk \n", "1483 occupyhk \n", "1484 occupyhk \n", "1485 occupyhk \n", "1486 occupyhk \n", "1487 occupyhk \n", "1488 occupyhk \n", "1489 occupyhk \n", "1490 occupyhk \n", "1491 occupyhk \n", "1492 occupyhk \n", "1493 occupyhk \n", "1494 occupyhk \n", "1495 occupyhk \n", "\n", "[1496 rows x 3 columns]\n" ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Finding uniqueness of tweets\n", "\n", "One important thing to do would be to find the uniqueness of a dataset. Here, we should measure uniqueness as number of unique tweets / number of tweets in the data set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def tweet_uniqueness(series):\n", " # Code for finding the number of unique tweets in a column over the number of tweets.\n", " # should return a number between 0 and 1\n", "\n", "# shows that with the code above, we didn't get completely unique tweets.\n", "print tweet_uniqueness(tweets.id) \n", "\n", "# code to drop duplicates based on the id column alone.\n", "unique_tweets = tweets.drop_duplicates(cols=['id'])\n", "\n", "print len(unique_tweets)\n", "print tweet_uniqueness(unique_tweets.id)\n", "print tweet_uniqueness(unique_tweets.tweet)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.761363636364\n", "1139\n", "1.0\n", "0.779631255487\n" ] }, { "output_type": "stream", "stream": "stderr", "text": [ "/Users/ed/anaconda/lib/python2.7/site-packages/pandas/util/decorators.py:81: FutureWarning: the 'cols' keyword is deprecated, use 'subset' instead\n", " warnings.warn(msg, FutureWarning)\n" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even by removing the duplicate rows of data, we still have a significant amount of tweets that are the same, likely due to retweets.\n", "\n", "How could we confirm this hypotheses with pandas? Based on this limited dataset, what hashtag seems to get the most retweeted, vs the least retweeted?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for i in unique_tweets.hashtag.unique():\n", " print i, tweet_uniqueness(unique_tweets[unique_tweets.hashtag == i].tweet)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "hongkong 0.863636363636\n", "occupycentral 0.719047619048\n", "umbrellarevolution 0.630872483221\n", "china 0.950980392157\n", "hk 0.970588235294\n", "admiralty 0.716981132075\n", "occupyhk 0.610778443114\n" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll use nltk to tokenize the tweets. Tokens are really just units of words within parsed text.\n", "\n", "To prep nltk and many of its functionalities, we'll need to download additional parts:\n", "\n", "```python\n", "# in an ipython qtconsole:\n", "import nltk\n", "nltk.download()\n", "```\n", "\n", "this should generate a popup.
\n", "Please download \"all\" and please go through the nltk book for fun! http://www.nltk.org/book_1ed/. Close the window after.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import nltk\n", "\n", "def tokenize_tweet(t, remove_stop=True, remove_hashtag=False):\n", " import string\n", " import re\n", " tweet = t\n", " tweet = tweet.lower()\n", " tweet = re.sub('@\\w+', 'TWITTER_HANDLE', tweet)\n", " tweet = re.sub('(https?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w \\.-]*)*\\/?', 'URL', tweet)\n", " tweet = tweet.translate(string.maketrans(\"\",\"\"), string.punctuation)\n", " words = nltk.tokenize.wordpunct_tokenize(tweet)\n", " if remove_stop:\n", " # How do we filter for words in the stopwords corpus?\n", " stopwords_filter = set(nltk.corpus.stopwords.words('english'))\n", "\n", " if remove_hashtag:\n", " # How do we filter out the actual hashtag in the tweet itself?\n", "\n", " return words\n", "\n", "\n", "unique_tweets['tokens'] = unique_tweets.tweet.apply(tokenize_tweet, remove_stop=True)\n", "unique_tweets['tokens_w_stopwords'] = unique_tweets.tweet.apply(tokenize_tweet, remove_stop=False)\n", "\n", "print unique_tweets['tokens_w_stopwords']" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0 [is, this, hongkong, s, rodney, king, police, ...\n", "1 [we, wont, move, and, im, ready, to, get, arre...\n", "2 [rt, TWITTERHANDLE, footage, of, beating, prom...\n", "3 [what, is, happening, in, hong, kong, is, some...\n", "4 [fundinghongkong, travel, startup, TWITTERHAND...\n", "5 [hk, police, use, pepper, spray, on, protester...\n", "6 [rt, TWITTERHANDLE, footage, of, beating, prom...\n", "7 [rt, TWITTERHANDLE, footage, of, beating, prom...\n", "8 [squeezed, vscocam, vscohub, vscogang, vscogra...\n", "9 [rt, TWITTERHANDLE, footage, of, beating, prom...\n", "10 [breaking, hongkong, security, chief, says, 6,...\n", "11 [trendingnews, hongkong, demonstranten, aangek...\n", "12 [TWITTERHANDLE, in, hongkong, umbrellamovement...\n", "13 [TWITTERHANDLE, in, hongkong, umbrellamovement...\n", "14 [map, of, the, underpass, in, hongkong, where,...\n", "...\n", "1476 [rt, TWITTERHANDLE, five, hk, press, unions, u...\n", "1477 [rt, TWITTERHANDLE, 911pm, rodney, street, pac...\n", "1478 [rt, TWITTERHANDLE, packed, around, the, stage...\n", "1479 [rt, TWITTERHANDLE, now, occupyhk, URL]\n", "1481 [rt, TWITTERHANDLE, this, is, the, post, start...\n", "1482 [rt, TWITTERHANDLE, urgently, need, face, mask...\n", "1484 [rt, TWITTERHANDLE, occupied, area, stretches,...\n", "1485 [police, cant, stand, the, verbal, abuse, from...\n", "1486 [rt, TWITTERHANDLE, barricades, also, building...\n", "1487 [album, of, close, up, shots, of, the, pepper,...\n", "1488 [rt, TWITTERHANDLE, calling, wooden, plates, h...\n", "1490 [rt, TWITTERHANDLE, lung, wo, rd, what, source...\n", "1491 [rt, TWITTERHANDLE, big, pictures, from, TWITT...\n", "1492 [rt, TWITTERHANDLE, now, amp, signing, off, oc...\n", "1493 [rt, TWITTERHANDLE, this, amount, of, people, ...\n", "Name: tokens_w_stopwords, Length: 1139, dtype: object\n" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the following question to answer:\n", "\n", "> Do different audiences use different hashtags surrounding the hong kong protests?\n", "\n", "We can use nltk to help break apart the words in the tweets and determine how words may change based on the hashtag.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# pos_tag is a part of speech tagger, based on the text that it ingests.\n", "# It needs some kind of sentence structure to work okay, so we'll use the tokens with stopwords.\n", "# While its not built for twitter data, we can try it out and see how accurate it is\n", "nltk.pos_tag\n", "unique_tweets['pos'] = unique_tweets['tokens_w_stopwords'].apply(nltk.pos_tag)\n", "\n", "# Printing out all words that come back as adjectives (JJ):\n", "def find_all_adj(series):\n", " bag_of_words = [j[0] for j in series if j[1] == 'JJ']\n", " return bag_of_words if bag_of_words else []\n", " \n", "adjectives = unique_tweets.pos.apply(find_all_adj)\n", "\n", "final_list = []\n", "for i in list(adjectives):\n", " final_list.extend(list(set(i)))\n", "\n", "final_list = list(set(final_list))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "print final_list" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['heutejournal', 'cute', 'malicious', 'excessive', 'chinese', 'global', 'displayed', 'yellow', 'kong', 'violent', 'own', 'asian', 'human', 'innovative', 'japan', 'hate', 'suisheng', '116th', 'le', 'police', 'jian', 'leung', 'much', 'hurtful', 'extra', 'young', 'only', 'rich', 'rabble', 'cardinal', 'te', 'worth', 'real', 'good', 'animal', 'big', 'stop', 'possible', 'finish', 'dark', 'joint', 'traffic', 'shameful', 'anonymous', 'international', 'front', 'viral', 'trouble', 'occupyhongkong', 'helpful', 'cable', 'tear', 'shy', 'unnecessary', 'large', 'bad', 'small', 'passionate', 'insane', 'r', 'financial', 'fair', 'tepid', 'national', 'easy', 'dead', 'breakthrough', 'likely', 'economic', '3nsailing', 'complete', 'agricultural', 'corrupt', 'close', 'sexual', 'special', 'outrageous', 'tungchung', 'symbiotic', 'beaten', 'legal', 'creative', 'current', 'outside', 'indian', 'various', 'new', 'adorable', 'public', '3d', 'available', 'pepper', 'full', 'christian', 'pathetic', 'wan', 'sixth', 'fugitive', 'free', 'huang', 'strong', 'soundcloud', 'jubilant', 'revolutionary', 'prodemocratic', 'great', 'bible', 'central', 'many', 'aguus', 'equal', 'urged', 'foreign', 'greatwall', 'protestorsoccupycentral', 'social', 'military', 'weird', 'whole', 'top', 'first', 'major', 'industrial', 'key', 'vid', 'civic', 'civil', 'sweet', 'irresponsible', 'powerful', 'arsenal', 'private', 'supporthongkong', 'lung', 'female', 'spanish', 'doubleedged', 'open', 'legislative', 'east', 'additive', 'silent', 'ophongkong', 'illegal', 'angry', 'live', 'handcuffed', 'long', 'next', 'few', 'genetic', 'occupycentral', 'crush', 'futile', 'basic', 'palestine', 'australian', '3dprinting', 'low', 'website', 'british', 'recipient', 'general', 'gear', 'successful', 'peaceful', 'hk', 'alive', 'icable', 'last', 'comfortable', 'true', 'former', 'tonight', 'main', 'dangerous', 'past', 'handicapped', 'sous', 'future', 'obvious', 'dirty', 'itinerary', 'fiveyear', 'property', 'average', 'ive', 'gratuitous', 'chaotic', 'pic', 'halloween', 'site', 'overzealous', 'sad', 'high', 'brutal', 'gordonmcqueen', 'ready', 'wonderful', 'david', 'occupied', 'id', 'physical', 'opportunityfinancial', 'hong', 'hongkong', 'mic', 'responsible', 'american', 'eric', 'hot', 'other', 'republic', 'fourth', 'automotive', 'widespread', 'several', 'poor', 'independent', 'apparent', 'normal', 'impartial', 'polite', 'evil', 'sigue', 'wrong', 'serious', 'hair', 'yuan', 'recent', 'early', 'natural', 'third', 'clear', 'proud', 'drive', 'concrete', 'verbal', 'detained', 'english', 'professional', 'democratic', 'unspeakable', 'western']\n" ] } ], "prompt_number": 38 }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are some of the challenges that this tagger seems to face?\n", "\n", "Above it's clear it does... okay. Since pos_tag trains based on its input, we could theoretically from here create a corpus that makes up for the expected English patterns, and then it will be better calibrated. For today, let's consider a smaller list of adjectives based on what nltk believes it found for adjectives.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "adjective_list = {\n", " 'industrial': 0,\n", " 'excessive': -1,\n", " 'gratuitous': 1,\n", " 'chaotic': -1,\n", " 'national': 1,\n", " 'young': 1,\n", " 'yellow': 0,\n", " 'high': 1,\n", " 'middle': 0,\n", " 'likely': 1,\n", " 'economic': 0,\n", " 'creative': 1,\n", " 'open': 1,\n", " 'physical': 0,\n", " 'symbiotic': 1,\n", " 'legal': 1,\n", " 'next': 1,\n", " 'genetic': 0,\n", " 'angry': -1,\n", " 'strong': 1,\n", " 'peaceful': 1,\n", " 'new': 1,\n", " 'widespread': 1,\n", " 'real': 1,\n", " 'good': 1,\n", " 'normal': 0,\n", " 'successful': 1,\n", " 'big': 1,\n", " 'basic': -1,\n", " 'hate': -1,\n", " 'private': -1,\n", " 'front': 0,\n", " 'central': 0,\n", " 'comfortable': 1,\n", " 'last': 0,\n", " 'helpful': 1,\n", " 'third': 0,\n", " 'many': 1,\n", " 'clear': 1,\n", " 'proud': 1,\n", " 'brutal': -1,\n", " 'large': 1,\n", " 'dirty': -1,\n", " 'professional': 1,\n", " 'first': 0,\n", "}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 39 }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Goal: Using the adjectives and some measure of sentiment, predict a given hashtag?**\n", "\n", "We'll need to build a couple more functions:\n", "\n", "1. Let's write a function that creates a sentiment score for each tweet based on the adjectives above.\n", "2. Set the targets of the model. We can do this two ways:\n", " 1. One, using the `hashtag` column, which is what was generated based on the twitter search\n", " 2. Create several target columns, based on the tweet itself, using a regex match for the hashtag.\n", "3. Finally, build a logistic regressor using NLTK's sklearn implementation using our created sentiment as a regressor." ] }, { "cell_type": "code", "collapsed": false, "input": [ "% matplotlib inline\n", "import seaborn as sns\n", "\n", "# First function: create a sentiment score column.\n", "# Should take in a list of words, and return back the score as\n", "# mean(sentiment_of_adjectives)\n", "def measure_sentiment(words):\n", "\n", "\n", "# Second function: Create a numeric target column.\n", "def numeric_hashtag(tag):\n", " # we could use a dictionary similar to above to easily map hashtags to numeric values.\n", " targets = {\n", " u'hongkong': 0,\n", " u'occupycentral': 1,\n", " u'umbrellarevolution':2,\n", " u'china':3,\n", " u'hk':4,\n", " u'admiralty':5,\n", " u'occupyhk':6,\n", " }\n", " return targets[tag]\n", "\n", "unique_tweets['sentiment'] = unique_tweets.tokens.apply(measure_sentiment)\n", "print unique_tweets.sentiment.hist()\n", "\n", "unique_tweets['target'] = unique_tweets.hashtag.apply(numeric_hashtag)\n", "\n", "# Build a logistic regression using the sentiment feature and the numeric hashtags\n", "from sklearn import linear_model as lm\n", "\n", "lmfit = lm.LogisticRegression()\n", "lmfit.fit(unique_tweets[['sentiment']], unique_tweets['target'])\n", "print lmfit.score(unique_tweets[['sentiment']], unique_tweets['target'])" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Axes(0.125,0.125;0.775x0.775)\n", "0.197541703248" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] }, { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAEACAYAAABbMHZzAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAE8xJREFUeJzt3X+MZfVd//Hn68uC2to6kpplWZDBb5fUNdVtTaHRmo7a\nEqxfARMFmlgZiz8iaqt/GHdbIzbxS2i/MWLzDY1W6Wy/Kft11UpoRGTBjvYPw1opiF0Q+GNSdnUX\nrd2qNaa78PaPObtchtnh7J0795y59/lINnM+555z72cP73nvua9z7iVVhSRpuvyPricgSRo/m78k\nTSGbvyRNIZu/JE0hm78kTSGbvyRNoTWbf5I7kxxL8tjAuv+T5PEkjyb5ZJJvGHhsT5KnkjyR5MqB\n9d+Z5LHmsd/emL+KJKmtlzvz/xhw1Yp19wPfVlXfATwJ7AFIshO4HtjZ7HNHkjT7fAS4qap2ADuS\nrHxOSdIYrdn8q+ozwJdWrDtQVc83w4eAi5rla4B9VXWiqpaAp4ErkmwDXlVVB5vtPg5cO6L5S5KG\nsN7M/93Avc3yhcDhgccOA9tXWX+kWS9J6sjQzT/J+4GvVtVdI5yPJGkMtgyzU5J54B3A9w+sPgJc\nPDC+iOUz/iO8EA2dWn/kDM/rFw1J0hCqKi+/1QvO+sy/uVj7y8A1VfVfAw/dA9yQ5LwklwI7gINV\ndRT4tyRXNBeA3wXcvcZfwD8j+nPLLbd0PodJ+eOx9Hj2+c8w1jzzT7IPeCvwmiTPALewfHfPecCB\n5maev66qm6vqUJL9wCHgJHBzvTCrm4EF4OuAe6vqvqFmq7OytLTU9RQmhsdytDye3Vuz+VfVO1dZ\nfeca298K3LrK+r8FXn/Ws5MkbQg/4TvB5ufnu57CxPBYjpbHs3sZNi/aCEmqT/ORpM0gCbXRF3y1\neSwuLnY9hYnhsRwtj2f3bP6SNIWMfSRpkzP2kSS1YvOfYOaqo+OxHC2PZ/ds/pI0hcz8JWmTM/OX\nJLVi859g5qqj47EcLY9n92z+kjSFzPwlaZMz85cktWLzn2DmqqPjsRwtj2f3bP6SNIXM/CVpkzPz\nlyS1YvOfYOaqo+OxHC2PZ/ds/pI0hcz8JWmTM/OXJLVi859g5qqj47EcLY9n92z+kjSFzPwlaZMz\n85cktWLzn2DmqqOTpDd/JoG12b0tXU9A2jz6EElORvNX99bM/JPcCfwg8GxVvb5Zdz7wB8AlwBJw\nXVUdbx7bA7wbeA54T1Xd36z/TmAB+Frg3qp67xlez8xfvbR8xt2H2gz+jmiljcj8PwZctWLdbuBA\nVV0GPNiMSbITuB7Y2exzR154j/oR4Kaq2gHsSLLyOSVJY7Rm86+qzwBfWrH6amBvs7wXuLZZvgbY\nV1UnqmoJeBq4Isk24FVVdbDZ7uMD+2gDmauqr6zN7g1zwXdrVR1rlo8BW5vlC4HDA9sdBravsv5I\ns16S1JF13e3TBPQGkD01NzfX9RSkVVmb3Rvmbp9jSS6oqqNNpPNss/4IcPHAdhexfMZ/pFkeXH/k\nTE8+Pz/P7OwsADMzM+zatet0oZx6q+jYcRdjWGx+dj2m1XwdT+54cXGRhYUFgNP98my97Cd8k8wC\nnxq42+dDwBer6oNJdgMzVbW7ueB7F3A5y7HOA8Brq6qSPAS8BzgI/Cnw4aq6b5XX8m6fEVpcXBxo\nXFoP7/YZLWtztIa522fNM/8k+4C3Aq9J8gzwa8BtwP4kN9Hc6glQVYeS7AcOASeBmwc6+c0s3+r5\ndSzf6vmSxi9JGh+/20dqwTN/9Znf7SNJasXmP8FOXSCS+sba7J7NX5KmkJm/1IKZv/rMzF+S1IrN\nf4KZq6qvrM3u2fwlaQqZ+UstmPmrz8z8JUmt2PwnmLmq+sra7J7NX5KmkJm/1IKZv/rMzF+S1IrN\nf4KZq6qvrM3u2fwlaQqZ+UstmPmrz8z8JUmt2PwnmLmq+sra7J7NX5KmkJm/1IKZv/rMzF+S1IrN\nf4KZq6qvrM3u2fwlaQqZ+UstmPmrz8z8JUmt2PwnmLmq+sra7J7NX5KmkJm/1IKZv/rMzF+S1MrQ\nzT/JniSfT/JYkruSfE2S85McSPJkkvuTzKzY/qkkTyS5cjTT11rMVdVX1mb3hmr+SWaBnwLeWFWv\nB84BbgB2Aweq6jLgwWZMkp3A9cBO4CrgjiS+65CkjgzbgP8NOAG8IskW4BXAPwJXA3ubbfYC1zbL\n1wD7qupEVS0BTwOXDztptTM3N9f1FKRVWZvdG6r5V9W/Ar8JfIHlpn+8qg4AW6vqWLPZMWBrs3wh\ncHjgKQ4D24easSRp3bYMs1OS/wn8IjALfBn4wyQ/NrhNVVWStW5LWPWx+fl5ZmdnAZiZmWHXrl2n\nzxJO5YSO241vv/12j98Ix7DY/Ox6TKv59nk8mPn3YT6bbby4uMjCwgLA6X55toa61TPJ9cDbq+on\nm/G7gDcD3wd8b1UdTbIN+HRVvS7JboCquq3Z/j7glqp6aMXzeqvnCC0uLg40Lq2Ht3qOlrU5WsPc\n6jls8/8O4BPAm4D/AhaAg8AlwBer6oNNw5+pqt3NBd+7WM75twMPAK9d2elt/uorm7/6bJjmP1Ts\nU1WPJvk48FngeeBh4HeBVwH7k9wELAHXNdsfSrIfOAScBG62y0tSd/yE7wTzrfXoeOY/WtbmaPkJ\nX0lSK575Sy145q8+88xfktSKzX+CDd5LLfWJtdk9m78kTSEzf6kFM3/1mZm/JKkVm/8EM1dVX1mb\n3bP5S9IUMvOXWjDzV5+Z+UuSWrH5TzBzVfWVtdk9m78kTSEzf6kFM3/1mZm/JKkVm/8EM1dVX1mb\n3bP5S9IUMvOXWjDzV5+Z+UuSWrH5TzBzVfWVtdk9m78kTSEzf6kFM3/1mZm/JKkVm/8EM1dVX1mb\n3bP5S9IUMvOXWjDzV5+Z+UuSWrH5TzBzVfWVtdm9oZt/kpkkf5Tk8SSHklyR5PwkB5I8meT+JDMD\n2+9J8lSSJ5JcOZrpS5KGMXTmn2Qv8JdVdWeSLcArgfcD/1JVH0ryK8A3VtXuJDuBu4A3AduBB4DL\nqur5Fc9p5q9eMvNXn40t80/yDcD3VNWdAFV1sqq+DFwN7G022wtc2yxfA+yrqhNVtQQ8DVw+zGtL\nktZv2NjnUuCfk3wsycNJPprklcDWqjrWbHMM2NosXwgcHtj/MMvvALSBzFXVV9Zm97asY783Aj9f\nVX+T5HZg9+AGVVVJ1np/uupj8/PzzM7OAjAzM8OuXbuYm5sDXigYx+3GjzzySK/ms9nHsNj87HpM\nq/k6ntzx4uIiCwsLAKf75dkaKvNPcgHw11V1aTN+C7AH+Bbge6vqaJJtwKer6nVJdgNU1W3N9vcB\nt1TVQyue18xfvWTmrz4bW+ZfVUeBZ5Jc1qx6G/B54FPAjc26G4G7m+V7gBuSnJfkUmAHcHCY15Yk\nrd967vP/BeATSR4Fvh3438BtwNuTPAl8XzOmqg4B+4FDwJ8BN3uKv/FOvU2U+sba7N6wmT9V9SjL\nt26u9LYzbH8rcOuwrydJGh2/20dqwcxffeZ3+0iSWrH5TzBzVfWVtdk9m78kTSEzf6kFM3/1mZm/\nJKkVm/8EM1dVX1mb3bP5S9IUMvOXWjDzV5+Z+UuSWrH5TzBzVfWVtdk9m78kTSEzf6kFM3/1mZm/\nJKkVm/8EM1dVX1mb3bP5S9IUMvOXWjDzV5+Z+UuSWrH5TzBzVfWVtdk9m78kTSEzf6kFM3/1mZm/\nJKkVm/8EM1dVX1mb3bP5S9IUMvOXWjDzV5+Z+UuSWrH5TzBzVfWVtdk9m78kTaF1Zf5JzgE+Cxyu\nqh9Kcj7wB8AlwBJwXVUdb7bdA7wbeA54T1Xdv8rzmfmrl8z81WddZP7vBQ7xwm/FbuBAVV0GPNiM\nSbITuB7YCVwF3JHEdx2S1JGhG3CSi4B3AL8HnPoX52pgb7O8F7i2Wb4G2FdVJ6pqCXgauHzY11Y7\n5qrqK2uze+s5+/4t4JeB5wfWba2qY83yMWBrs3whcHhgu8PA9nW8tiRpHbYMs1OS/wU8W1WfSzK3\n2jZVVUnWCidXfWx+fp7Z2VkAZmZm2LVrF3Nzyy9x6mzBcbvxqXV9mc9mH8Ni87PrMa3m2+fx3Nxc\nr+az2caLi4ssLCwAnO6XZ2uoC75JbgXeBZwEvhZ4NfBJ4E3AXFUdTbIN+HRVvS7JboCquq3Z/z7g\nlqp6aMXzesFXveQFX/XZ2C74VtX7quriqroUuAH4i6p6F3APcGOz2Y3A3c3yPcANSc5LcimwAzg4\nzGurvVNnClLfWJvdGyr2WcWpU5HbgP1JbqK51ROgqg4l2c/ynUEngZs9xZek7vjdPlILxj7qM7/b\nR5LUis1/gpmrqq+sze7Z/CVpCpn5Sy2Y+avPzPwlSa3Y/CeYuar6ytrsns1fkqaQmb/Ugpm/+szM\nX5LUis1/gpmrqq+sze7Z/CVpCpn5Sy2Y+avPzPwlSa3Y/CeYuar6ytrsns1fkqaQmb/Ugpm/+szM\nX5LUis1/gpmrqq+sze7Z/CVpCpn5Sy2Y+avPzPwlSa3Y/CeYuar6ytrsns1fkqaQmb/Ugpm/+szM\nX5LUis1/gpmrqq+sze7Z/CVpCpn5Sy2Y+avPxpb5J7k4yaeTfD7J3yd5T7P+/CQHkjyZ5P4kMwP7\n7EnyVJInklw5zOtKkkZj2NjnBPBLVfVtwJuBn0vyrcBu4EBVXQY82IxJshO4HtgJXAXckcTIaYOZ\nq6qvrM3uDdWAq+poVT3SLP8H8DiwHbga2Ntsthe4tlm+BthXVSeqagl4Grh8HfOWJK3Dus++k8wC\nbwAeArZW1bHmoWPA1mb5QuDwwG6HWf7HQhtobm6u6ylIq7I2u7eu5p/k64E/Bt5bVf8++Fhz5Xat\nK1NetZKkjmwZdsck57Lc+P9fVd3drD6W5IKqOppkG/Bss/4IcPHA7hc1615ifn6e2dlZAGZmZti1\na9fps4RTOaHjduPbb7/d4zfCMSw2P7se02q+fR4PZv59mM9mGy8uLrKwsABwul+eraFu9czyfW97\ngS9W1S8NrP9Qs+6DSXYDM1W1u7ngexfLOf924AHgtSvv6/RWz9FaXFwcaFxaD2/1HC1rc7SGudVz\n2Ob/FuCvgL/jhd+IPcBBYD/wzcAScF1VHW/2eR/wbuAkyzHRn6/yvDZ/9ZLNX302tua/UWz+6iub\nv/rML3bTiwzmqlKfWJvds/lL0hQy9pFaMPZRnxn7SJJasflPMHNV9ZW12T2bvyRNITN/qQUzf/WZ\nmb8kqRWb/wQzV1VfWZvds/lL0hQy85daMPNXn5n5S5JasflPMHNV9ZW12T2bvyRNITN/qQUzf/WZ\nmb8kqRWb/wQzV1VfWZvdG/p/4C5J0245DtyczPylFsz8tZqe1YWZvyRpbTb/CWauqr6yNrtn85ek\nKWTmL7XQs2y360mo0bO6MPOXJK3N5j/BzFXVV9Zm92z+kjSFzPylFnqW7XY9CTV6Vhdm/pKktY21\n+Se5KskTSZ5K8ivjfO1pZK6qvrI2uze27/ZJcg7wf4G3AUeAv0lyT1U9PrjdF77whXFNaVXnnnsu\n27Zt63QOo/LII48wNzfX9TSkl7A2uzfOL3a7HHi6qpYAkvx/4BrgRc1/5863jHFKL/b881/lkksu\n5PHHH+5sDqN0/Pjxrqcgrcra7N44m/924JmB8WHgipUbfeUrXZ75P8xXv/qTHb6+JI3HOJt/q0vi\nr371D230PM7oueeOc845nb38yC0tLXU9BW2Azfw1woM+8IEPrPs5vPNpeGO71TPJm4Ffr6qrmvEe\n4Pmq+uDANv6XlKQhnO2tnuNs/luAfwC+H/hH4CDwzpUXfCVJG29ssU9VnUzy88CfA+cAv2/jl6Ru\n9OoTvpKk8ej0E75JfjTJ55M8l+SNa2znh8NeRpLzkxxI8mSS+5PMnGG7pSR/l+RzSQ6Oe55916bW\nkny4efzRJG8Y9xw3k5c7nknmkny5qcfPJfnVLua5GSS5M8mxJI+tsU3r2uz66x0eA34Y+KszbTDw\n4bCrgJ3AO5N863imt6nsBg5U1WXAg814NQXMVdUbqurysc1uE2hTa0neAby2qnYAPw18ZOwT3STO\n4nf3L5t6fENV/cZYJ7m5fIzlY7mqs63NTpt/VT1RVU++zGanPxxWVSeAUx8O04tdDextlvcC166x\n7WTcKzh6bWrt9HGuqoeAmSRbxzvNTaPt76712EJVfQb40hqbnFVtdn3m38ZqHw7b3tFc+mxrVR1r\nlo8BZ/qPXsADST6b5KfGM7VNo02trbbNRRs8r82qzfEs4LuamOLeJDvHNrvJc1a1ueF3+yQ5AFyw\nykPvq6pPtXgKr0g31jiW7x8cVFWt8ZmJ766qf0ryTcCBJE80ZxRqX2srz1St0dW1OS4PAxdX1X8m\n+QHgbuCyjZ3WRGtdmxve/Kvq7et8iiPAxQPji1n+F23qrHUsmwtBF1TV0STbgGfP8Bz/1Pz85yR/\nwvJbc5v/sja1tnKbi5p1eqmXPZ5V9e8Dy3+W5I4k51fVv45pjpPkrGqzT7HPmXK/zwI7kswmOQ+4\nHrhnfNPaNO4BbmyWb2T5DOpFkrwiyaua5VcCV7J80V3L2tTaPcCPw+lPrR8fiNv0Yi97PJNsTfN9\nFUkuZ/n2cxv/cM6qNsf53T4vkeSHgQ8DrwH+NMnnquoHklwIfLSqftAPh7V2G7A/yU3AEnAdwOCx\nZDky+mTzu7YF+ERV3d/NdPvnTLWW5Geax3+nqu5N8o4kTwNfAX6iwyn3WpvjCfwI8LNJTgL/CdzQ\n2YR7Lsk+4K3Aa5I8A9wCnAvD1aYf8pKkKdSn2EeSNCY2f0maQjZ/SZpCNn9JmkI2f0maQjZ/SZpC\nNn9JmkI2f0maQv8NiaD3qfuWJKoAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 12 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Some things we didn't talk about" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common approach to handling text data is working with term frequencies and inverse document frequencies. sk-learn actually has a TF-IDF vectorizer that would be useful to interact with. A TF vectorizer would also be a fine start when working with larger texts.\n", "\n", "These matrices (TF and TF-IDF) are also strong indicators for classification.\n", "Consider the usage of stopwords. Stopswords can actually be incredibly useful in predicting! (http://rforwork.info/2012/12/27/intro-to-mult-classification/)" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "On Your Own" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Continue iterating on the twitter data set and the data goals for class. What new features could you curate that could be predictive of a hashtag?\n", "\n", "Consider:\n", "\n", "1. length of a tweet, or number of tokens in a tweet\n", "2. Finding more words to include in the sentiment dictionary\n", "3. Finding words most commonly used along a hashtag.\n", "\n", "add to the code, and build a better models using your new features.\n", "\n", "**OR**\n", "\n", "Follow a different persuit with the data, given that you can work with the twitter api, which will provide you with much more data than included in the master csv file here." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Review/Next Steps/Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Natural Language Processing with Python](http://www.nltk.org/book/): free online book to go in-depth with NLTK\n", "* [NLP online course](https://www.coursera.org/course/nlp): no sessions are available, but [video lectures](https://class.coursera.org/nlp/lecture) and [slides](http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html) are still accessible\n", "* [Brief slides](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) on the major task areas of NLP\n", "* [Detailed slides](https://github.com/ga-students/DAT_SF_9/blob/master/16_Text_Mining/DAT9_lec16_Text_Mining.pdf) on a lot of NLP terminology\n", "* [A visual survey of text visualization techniques](http://textvis.lnu.se/): for exploration and inspiration\n", "* [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml): suite of tools if you want to get serious about NLP\n", "* Getting started with regex: [Python introductory lesson](https://developers.google.com/edu/python/regular-expressions) and [reference guide](https://github.com/justmarkham/DAT3/blob/master/code/99_regex_reference.py), [real-time regex tester](https://regex101.com/#python), [in-depth tutorials](http://www.rexegg.com/)\n", "* [SpaCy](http://honnibal.github.io/spaCy/): a new NLP package" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Next Class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next class we will be learning about bayes probability, the algorithm associated, and making comparisons between logistic regression and bayes using a significantly larger text data set. There will be some light math concepts." ] } ], "metadata": {} } ] }