{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Import the libraries we will be using" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import random\n", "import nltk\n", "import gensim\n", "import numpy as np\n", "import pandas as pd\n", "import scipy\n", "\n", "from nltk.stem.porter import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we download the necessary resources for NLTK" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nltk.download('punkt')\n", "nltk.download('tagsets')\n", "nltk.download('averaged_perceptron_tagger')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Part 1: Tokenization" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Tokenization involves segmenting text into tokens. It is a common preprocessing step in many NLP applications" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "text = \"Are you crazy? I don't know.\"" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "A simple method is just to split based on white space. Note that this doesn't work for many other languages (like Chinese)!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "text.split()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will now explore tokenization as provided by two NLP tools. First, look at how NLTK tokenizes words:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "nltk.word_tokenize(text)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, let's look at how Gensim is handling tokenization" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "list(gensim.utils.tokenize(text))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "It often makes sense to lowercase the text, as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "\"HELLO world\".lower()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
**Question:** Try tokenizing some other texts as well. Which of these methods do you prefer, and why?
" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
** Todo:** We will soon start counting words. First, decide which tokenization method you would like to use
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "def tokenize(text):\n", " return #TODO: fill in your preferred method to tokenize the text. Perhaps add some more preprocessing, like lowercasing etc.\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Make sure the tokenize method works as expected." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "tokenize(\"Hello word!\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
**Optional:** Explore how the tokenization methods are dealing with
\n", "\n", "- hyphenation (e.g., co-operative, thirty-three)
\n", "- non-standard language (e.g., tweets. Take a look at TweetTokenizer from NLTK)
\n", "- languages other than English\n", "
" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Part 2: Preprocessing\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will now start looking at Reddit data. We will focus on the politics subreddit. First, let's load in the data from October 2016." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "file_name = 'reddit_discussion_network_2016_10.csv';\n", "reddit_df = pd.read_csv('../../../data/reddit/' + file_name);" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Which columns does this dataset have?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "reddit_df.columns" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The first post:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "reddit_df.head(1)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "This method returns the tokens of the Reddit dataset based on your tokenization method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "def iter_reddit():\n", " for index, row in reddit_df.iterrows():\n", " yield tokenize(str(row[\"comment\"])) # Convert to string, there are some weird entries (NaN)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Count how often each word occurs. Applying to the whole dataset might take some time, so we will only process 10,000 documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import itertools\n", "\n", "num_documents = 10000\n", "counts = {}\n", "for tokens in itertools.islice(iter_reddit(), num_documents):\n", " for token in tokens:\n", " if token not in counts:\n", " counts[token] = 0\n", " counts[token] = counts[token] + 1" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Print out the top 25 most frequent words:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "for w in sorted(counts, key=counts.get, reverse=True)[:25]:\n", " print \"%s\\t%s\" % (w, counts[w]) " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "For some applications, stemming the words can be helpful" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "stemmer = PorterStemmer()\n", "tokens = ['politics', 'agreed', 'trump', 'clinton', 'replied', 'meeting']\n", "print [stemmer.stem(token) for token in tokens]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
**Optional:** Try out different preprocessing options. Perhaps modify your tokenization function. How does this influence the statistics? How would filtering infrequent words influence the vocabulary size?
" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Part 3: Part of speech tagging\n" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We now look at an example of part of speech tagging using NLTK" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "sentence = \"WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.\"\n", "\n", "pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))\n", "print(pos_sentence)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "NLTK provides a method to retrieve more information about a tag. For example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "nltk.help.upenn_tagset('NNP')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
**Optional:** Try out POS tagging on texts from different sources (like Facebook, Twitter, etc. etc.). What goes well? What goes wrong?
" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Part 4: Sentiment analysis" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will do sentiment analysis using Empath (empath.stanford.edu), which is a dictionary tool that counts\n", "words in various categories (e.g., positive sentiment). The dictionary is created by first expanding manually provided seed words automatically, and then having crowdworkers filter out incorrect words. \n", "First, import the library and create a lexicon." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from empath import Empath\n", "lexicon = Empath()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's start analyzing a sentence. With setting normalize to True, the counts are normalized according to sentence length." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "lexicon.analyze(tokenize(\"Bullshit, you can't even post FACTS on this sub- like Clinton lying about sniper fire.\"), normalize=True)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Another sentence" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "lexicon.analyze(tokenize(\"Totally agree. Planning to beat your opponent is not a sign of corruption. That's politics. \"), normalize=True)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
**Optional** Explore the tool with some more examples. What happens in cases of sarcasm, negation, or very informal text?
" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Part 5: Topic modeling" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will now look at topic modeling. The code below is a utility class to help process the reddit data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "class RedditCorpus(object):\n", " def __init__(self, dictionary):\n", " \"\"\"\n", " Parse the data. \n", " Yield each document in turn, as a list of tokens.\n", " \n", " \"\"\"\n", " self.dictionary = dictionary\n", " \n", " def __iter__(self):\n", " for tokens in iter_reddit():\n", " yield self.dictionary.doc2bow(tokens)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will be using the gensim library for topic modeling. The first step involves constructing a dictionary, which is a mapping from identifiers to words. " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
Note the following may take some time, so skip the next two steps if it takes too long and load the dictionary directly from the provided file.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit = gensim.corpora.Dictionary(iter_reddit())" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Save the full dictionary to a file" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit.save(\"full_reddit.dict\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Load a dictionary from file (continue here if you skipped constructing the dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit = gensim.corpora.dictionary.Dictionary.load(\"../data/full_reddit.dict\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "How big is the dictionary?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "len(id2word_reddit)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The first word in the dictionary with identifier 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit[0]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "It often helps to remove very infrequent and very frequent words. It will also help speed up the process (which we need - otherwise it will take a long time to train a topic model)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit.filter_extremes(no_below=50, no_above=0.05)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "How big is the dictionary now?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "len(id2word_reddit)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Save the pruned dictionary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit.save(\"pruned_reddit.dict\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Load the pruned dictionary here in case something went wrong with the previous steps" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "id2word_reddit = gensim.corpora.dictionary.Dictionary.load(\"../data/pruned_reddit.dict\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will now start building a topic model. The pretrained model uses 15 topics, but feel free to explore other settings when training your own model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "NUM_TOPICS = 15" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
**Optional** Now, let's train a topic model Now, this takes a lot of time, so consider training the model during one of the breaks. Skips the next two steps to continue with exploring an (already) trained topic model
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "reddit_corpus = RedditCorpus(id2word_reddit)\n", "lda_model_reddit = gensim.models.LdaModel(reddit_corpus, num_topics=NUM_TOPICS, id2word=id2word_reddit, passes=2, update_every=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "lda_model_reddit.save('reddit_lda.lda')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Load in a trained model if you didn't train a model yourself" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "lda_model_reddit = gensim.models.ldamodel.LdaModel.load('../data/reddit_lda.lda')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Print out the topics. For each topic, the top words and their probability are shown." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [], "source": [ "lda_model_reddit.print_topics(-1)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Get the topics for a particular text. If minimum_probability is not specified, only topics with a high probability are shown." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "lda_model_reddit.get_document_topics(id2word_reddit.doc2bow(tokenize(\"Just because you are selfish and don't want to pay taxes for services that you may/may not use, does not mean you don't have to pay them. \")), minimum_probability=0.0)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }