{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Modern NLP in Python\n", "### _- Or -_\n", "## What you can learn about food by analyzing a million Yelp reviews" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Before we get started...\n", "__whois?__\n", "- Patrick Harrison\n", "- Lead Data Scientist @ S&P Global Market Intelligence - _**we are hiring**_\n", "- University of Virginia — Systems Engineering\n", "- patrick@skipgram.io / @skipgram\n", "\n", "__Join Charlottesville Data Science!__\n", "- On Meetup.com ... http://www.meetup.com/CharlottesvilleDataScience\n", "- On Slack ... __https://cville.typeform.com/to/UEzMVh__\n", " - _link invites you to join the Cville team on Slack. Join Cville, then join channel __#datascience__._\n", " \n", "_Note: I presented this notebook as a tutorial during the [PyData DC 2016 conference](http://pydata.org/dc2016/schedule/presentation/11/). To view the video of the presentation on YouTube, see [here](https://www.youtube.com/watch?v=6zm9NC9uRkk)._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Our Trail Map\n", "This tutorial features an end-to-end data science & natural language processing pipeline, starting with **raw data** and running through **preparing**, **modeling**, **visualizing**, and **analyzing** the data. We'll touch on the following points:\n", "1. A tour of the dataset\n", "1. Introduction to text processing with spaCy\n", "1. Automatic phrase modeling\n", "1. Topic modeling with LDA\n", "1. Visualizing topic models with pyLDAvis\n", "1. Word vector models with word2vec\n", "1. Visualizing word2vec with t-SNE\n", "\n", "...and we might even learn a thing or two about Python along the way.\n", "\n", "Let's get started!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Yelp Dataset\n", "[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) is a dataset published by the business review service [Yelp](http://yelp.com) for academic research and educational purposes. I really like the Yelp dataset as a subject for machine learning and natural language processing demos, because it's big (but not so big that you need your own data center to process it), well-connected, and anyone can relate to it — it's largely about food, after all!\n", "\n", "**Note:** If you'd like to execute this notebook interactively on your local machine, you'll need to download your own copy of the Yelp dataset. If you're reviewing a static copy of the notebook online, you can skip this step. Here's how to get the dataset:\n", "1. Please visit the Yelp dataset webpage [here](https://www.yelp.com/dataset_challenge/)\n", "1. Click \"Get the Data\"\n", "1. Please review, agree to, and respect Yelp's terms of use!\n", "1. The dataset downloads as a compressed .tgz file; uncompress it\n", "1. Place the uncompressed dataset files (*yelp_academic_dataset_business.json*, etc.) in a directory named *yelp_dataset_challenge_academic_dataset*\n", "1. Place the *yelp_dataset_challenge_academic_dataset* within the *data* directory in the *Modern NLP in Python* project folder\n", "\n", "That's it! You're ready to go.\n", "\n", "The current iteration of the Yelp dataset (as of this demo) consists of the following data:\n", "- __552K__ users\n", "- __77K__ businesses\n", "- __2.2M__ user reviews\n", "\n", "When focusing on restaurants alone, there are approximately __22K__ restaurants with approximately __1M__ user reviews written about them.\n", "\n", "The data is provided in a handful of files in _.json_ format. We'll be using the following files for our demo:\n", "- __yelp\\_academic\\_dataset\\_business.json__ — _the records for individual businesses_\n", "- __yelp\\_academic\\_dataset\\_review.json__ — _the records for reviews users wrote about businesses_\n", "\n", "The files are text files (UTF-8) with one _json object_ per line, each one corresponding to an individual data record. Let's take a look at a few examples." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"business_id\": \"vcNAWiLM4dR7D2nwwJ7nCA\", \"full_address\": \"4840 E Indian School Rd\\nSte 101\\nPhoenix, AZ 85018\", \"hours\": {\"Tuesday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Friday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Monday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Wednesday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Thursday\": {\"close\": \"17:00\", \"open\": \"08:00\"}}, \"open\": true, \"categories\": [\"Doctors\", \"Health & Medical\"], \"city\": \"Phoenix\", \"review_count\": 9, \"name\": \"Eric Goldberg, MD\", \"neighborhoods\": [], \"longitude\": -111.98375799999999, \"state\": \"AZ\", \"stars\": 3.5, \"latitude\": 33.499313000000001, \"attributes\": {\"By Appointment Only\": true}, \"type\": \"business\"}\n", "\n" ] } ], "source": [ "import os\n", "import codecs\n", "\n", "data_directory = os.path.join('..', 'data',\n", " 'yelp_dataset_challenge_academic_dataset')\n", "\n", "businesses_filepath = os.path.join(data_directory,\n", " 'yelp_academic_dataset_business.json')\n", "\n", "with codecs.open(businesses_filepath, encoding='utf_8') as f:\n", " first_business_record = f.readline() \n", "\n", "print first_business_record" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The business records consist of _key, value_ pairs containing information about the particular business. A few attributes we'll be interested in for this demo include:\n", "- __business\\_id__ — _unique identifier for businesses_\n", "- __categories__ — _an array containing relevant category values of businesses_\n", "\n", "The _categories_ attribute is of special interest. This demo will focus on restaurants, which are indicated by the presence of the _Restaurant_ tag in the _categories_ array. In addition, the _categories_ array may contain more detailed information about restaurants, such as the type of food they serve." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The review records are stored in a similar manner — _key, value_ pairs containing information about the reviews." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\"votes\": {\"funny\": 0, \"useful\": 2, \"cool\": 1}, \"user_id\": \"Xqd0DzHaiyRqVH3WRG7hzg\", \"review_id\": \"15SdjuK7DmYqUAj6rjGowg\", \"stars\": 5, \"date\": \"2007-05-17\", \"text\": \"dr. goldberg offers everything i look for in a general practitioner. he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first. really, what more do you need? i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.\", \"type\": \"review\", \"business_id\": \"vcNAWiLM4dR7D2nwwJ7nCA\"}\n", "\n" ] } ], "source": [ "review_json_filepath = os.path.join(data_directory,\n", " 'yelp_academic_dataset_review.json')\n", "\n", "with codecs.open(review_json_filepath, encoding='utf_8') as f:\n", " first_review_record = f.readline()\n", " \n", "print first_review_record" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few attributes of note on the review records:\n", "- __business\\_id__ — _indicates which business the review is about_\n", "- __text__ — _the natural language text the user wrote_\n", "\n", "The _text_ attribute will be our focus today!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_json_ is a handy file format for data interchange, but it's typically not the most usable for any sort of modeling work. Let's do a bit more data preparation to get our data in a more usable format. Our next code block will do the following:\n", "1. Read in each business record and convert it to a Python `dict`\n", "2. Filter out business records that aren't about restaurants (i.e., not in the \"Restaurant\" category)\n", "3. Create a `frozenset` of the business IDs for restaurants, which we'll use in the next step" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "21,892 restaurants in the dataset.\n" ] } ], "source": [ "import json\n", "\n", "restaurant_ids = set()\n", "\n", "# open the businesses file\n", "with codecs.open(businesses_filepath, encoding='utf_8') as f:\n", " \n", " # iterate through each line (json record) in the file\n", " for business_json in f:\n", " \n", " # convert the json record to a Python dict\n", " business = json.loads(business_json)\n", " \n", " # if this business is not a restaurant, skip to the next one\n", " if u'Restaurants' not in business[u'categories']:\n", " continue\n", " \n", " # add the restaurant business id to our restaurant_ids set\n", " restaurant_ids.add(business[u'business_id'])\n", "\n", "# turn restaurant_ids into a frozenset, as we don't need to change it anymore\n", "restaurant_ids = frozenset(restaurant_ids)\n", "\n", "# print the number of unique restaurant ids in the dataset\n", "print '{:,}'.format(len(restaurant_ids)), u'restaurants in the dataset.'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will create a new file that contains only the text from reviews about restaurants, with one review per line in the file." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "intermediate_directory = os.path.join('..', 'intermediate')\n", "\n", "review_txt_filepath = os.path.join(intermediate_directory,\n", " 'review_text_all.txt')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Text from 991,714 restaurant reviews in the txt file.\n", "CPU times: user 26.7 s, sys: 1.21 s, total: 27.9 s\n", "Wall time: 28.1 s\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute data prep yourself.\n", "if 0 == 1:\n", " \n", " review_count = 0\n", "\n", " # create & open a new file in write mode\n", " with codecs.open(review_txt_filepath, 'w', encoding='utf_8') as review_txt_file:\n", "\n", " # open the existing review json file\n", " with codecs.open(review_json_filepath, encoding='utf_8') as review_json_file:\n", "\n", " # loop through all reviews in the existing file and convert to dict\n", " for review_json in review_json_file:\n", " review = json.loads(review_json)\n", "\n", " # if this review is not about a restaurant, skip to the next one\n", " if review[u'business_id'] not in restaurant_ids:\n", " continue\n", "\n", " # write the restaurant review as a line in the new file\n", " # escape newline characters in the original review text\n", " review_txt_file.write(review[u'text'].replace('\\n', '\\\\n') + '\\n')\n", " review_count += 1\n", "\n", " print u'''Text from {:,} restaurant reviews\n", " written to the new txt file.'''.format(review_count)\n", " \n", "else:\n", " \n", " with codecs.open(review_txt_filepath, encoding='utf_8') as review_txt_file:\n", " for review_count, line in enumerate(review_txt_file):\n", " pass\n", " \n", " print u'Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## spaCy — Industrial-Strength NLP in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[**spaCy**](https://spacy.io) is an industrial-strength natural language processing (_NLP_) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.\n", "\n", "spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:\n", "- Tokenization\n", "- Text normalization, such as lowercasing, stemming/lemmatization\n", "- Part-of-speech tagging\n", "- Syntactic dependency parsing\n", "- Sentence boundary detection\n", "- Named entity recognition and annotation\n", "\n", "In the \"batteries included\" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:\n", "- Large English vocabulary, including stopword lists\n", "- Token \"probabilities\"\n", "- Word vectors\n", "\n", "spaCy is written in optimized Cython, which means it's _fast_. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the _GIL_)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import spacy\n", "import pandas as pd\n", "import itertools as it\n", "\n", "nlp = spacy.load('en')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's grab a sample review to play with." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "After a morning of Thrift Store hunting, a friend and I were thinking of lunch, and he suggested Emil's after he'd seen Chris Sebak do a bit on it and had tried it a time or two before, and I had not. He said they had a decent Reuben, but to be prepared to step back in time.\n", "\n", "Well, seeing as how I'm kind of addicted to late 40's and early 50's, and the whole Rat Pack scene, stepping back in time is a welcomed change in da burgh...as long as it doesn't involve 1979, which I can see all around me every day.\n", "\n", "And yet another shot at finding a decent Reuben in da burgh...well, that's like hunting the Holy Grail. So looking under one more bush certainly wouldn't hurt.\n", "\n", "So off we go right at lunchtime in the middle of...where exactly were we? At first I thought we were lost, driving around a handful of very rather dismal looking blocks in what looked like a neighborhood that had been blighted by the building of a highway. And then...AHA! Here it is! And yep, there it was. This little unassuming building with an add-on entrance with what looked like a very old hand painted sign stating quite simply 'Emil's. \n", "\n", "We walked in the front door, and entered another world. Another time, and another place. Oh, and any Big Burrito/Sousa foodies might as well stop reading now. I wouldn't want to see you walk in, roll your eyes and say 'Reaaaaaalllly?'\n", "\n", "This is about as old world bar/lounge/restaurant as it gets. Plain, with a dark wood bar on one side, plain white walls with no yinzer pics, good sturdy chairs and actual white linens on the tables. This is the kind of neighborhood dive that I could see Frank and Dino pulling a few tables together for some poker, a fish sammich, and some cheap scotch. And THAT is exactly what I love.\n", "\n", "Oh...but good food counts too. \n", "\n", "We each had a Reuben, and my friend had a side of fries. The Reubens were decent, but not NY awesome. A little too thick on the bread, but overall, tasty and definitely filling. Not too skimpy on the meat. I seriously CRAVE a true, good NY Reuben, but since I can't afford to travel right now, what I find in da burgh will have to do. But as we sat and ate, burgers came out to an adjoining table. Those were some big thick burgers. A steak went past for the table behind us. That was HUGE! And when we asked about it, the waitress said 'Yeah, it's huge and really good, and he only charges $12.99 for it, ain't that nuts?' Another table of five came in, and wham. Fish sandwiches PILED with breaded fish that looked amazing. Yeah, I want that, that, that and THAT!\n", "\n", "My friend also mentioned that they have a Chicken Parm special one day of the week that is only served UNTIL 4 pm, and that it is fantastic. If only I could GET there on that week day before 4...\n", "\n", "The waitress did a good job, especially since there was quite a growing crowd at lunchtime on a Saturday, and only one of her. She kept up and was very friendly. \n", "\n", "They only have Pepsi products, so I had a brewed iced tea, which was very fresh, and she did pop by to ask about refills as often as she could. As the lunch hour went on, they were getting busy.\n", "\n", "Emil's is no frills, good portions, very reasonable prices, VERY comfortable neighborhood hole in the wall...kind of like Cheers, but in a blue collar neighborhood in the 1950's. Fan-freakin-tastic! I could feel at home here.\n", "\n", "You definitely want to hit Mapquest or plug in your GPS though. I am not sure that I could find it again on my own...it really is a hidden gem. I will be making my friend take me back until I can memorize where the heck it is.\n", "\n", "Addendum: 2nd visit for the fish sandwich. Excellent. Truly. A pound of fish on a fish-shaped bun (as opposed to da burgh's seemingly popular hamburger bun). The fish was flavorful, the batter excellent, and for just $8. This may have been the best fish sandwich I've yet to have in da burgh.\n", "\n" ] } ], "source": [ "with codecs.open(review_txt_filepath, encoding='utf_8') as f:\n", " sample_review = list(it.islice(f, 8, 9))[0]\n", " sample_review = sample_review.replace('\\\\n', '\\n')\n", " \n", "print sample_review" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hand the review text to spaCy, and be prepared to wait..." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 222 ms, sys: 11.6 ms, total: 234 ms\n", "Wall time: 251 ms\n" ] } ], "source": [ "%%time\n", "parsed_review = nlp(sample_review)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...1/20th of a second or so. Let's take a look at what we got during that time..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "After a morning of Thrift Store hunting, a friend and I were thinking of lunch, and he suggested Emil's after he'd seen Chris Sebak do a bit on it and had tried it a time or two before, and I had not. He said they had a decent Reuben, but to be prepared to step back in time.\n", "\n", "Well, seeing as how I'm kind of addicted to late 40's and early 50's, and the whole Rat Pack scene, stepping back in time is a welcomed change in da burgh...as long as it doesn't involve 1979, which I can see all around me every day.\n", "\n", "And yet another shot at finding a decent Reuben in da burgh...well, that's like hunting the Holy Grail. So looking under one more bush certainly wouldn't hurt.\n", "\n", "So off we go right at lunchtime in the middle of...where exactly were we? At first I thought we were lost, driving around a handful of very rather dismal looking blocks in what looked like a neighborhood that had been blighted by the building of a highway. And then...AHA! Here it is! And yep, there it was. This little unassuming building with an add-on entrance with what looked like a very old hand painted sign stating quite simply 'Emil's. \n", "\n", "We walked in the front door, and entered another world. Another time, and another place. Oh, and any Big Burrito/Sousa foodies might as well stop reading now. I wouldn't want to see you walk in, roll your eyes and say 'Reaaaaaalllly?'\n", "\n", "This is about as old world bar/lounge/restaurant as it gets. Plain, with a dark wood bar on one side, plain white walls with no yinzer pics, good sturdy chairs and actual white linens on the tables. This is the kind of neighborhood dive that I could see Frank and Dino pulling a few tables together for some poker, a fish sammich, and some cheap scotch. And THAT is exactly what I love.\n", "\n", "Oh...but good food counts too. \n", "\n", "We each had a Reuben, and my friend had a side of fries. The Reubens were decent, but not NY awesome. A little too thick on the bread, but overall, tasty and definitely filling. Not too skimpy on the meat. I seriously CRAVE a true, good NY Reuben, but since I can't afford to travel right now, what I find in da burgh will have to do. But as we sat and ate, burgers came out to an adjoining table. Those were some big thick burgers. A steak went past for the table behind us. That was HUGE! And when we asked about it, the waitress said 'Yeah, it's huge and really good, and he only charges $12.99 for it, ain't that nuts?' Another table of five came in, and wham. Fish sandwiches PILED with breaded fish that looked amazing. Yeah, I want that, that, that and THAT!\n", "\n", "My friend also mentioned that they have a Chicken Parm special one day of the week that is only served UNTIL 4 pm, and that it is fantastic. If only I could GET there on that week day before 4...\n", "\n", "The waitress did a good job, especially since there was quite a growing crowd at lunchtime on a Saturday, and only one of her. She kept up and was very friendly. \n", "\n", "They only have Pepsi products, so I had a brewed iced tea, which was very fresh, and she did pop by to ask about refills as often as she could. As the lunch hour went on, they were getting busy.\n", "\n", "Emil's is no frills, good portions, very reasonable prices, VERY comfortable neighborhood hole in the wall...kind of like Cheers, but in a blue collar neighborhood in the 1950's. Fan-freakin-tastic! I could feel at home here.\n", "\n", "You definitely want to hit Mapquest or plug in your GPS though. I am not sure that I could find it again on my own...it really is a hidden gem. I will be making my friend take me back until I can memorize where the heck it is.\n", "\n", "Addendum: 2nd visit for the fish sandwich. Excellent. Truly. A pound of fish on a fish-shaped bun (as opposed to da burgh's seemingly popular hamburger bun). The fish was flavorful, the batter excellent, and for just $8. This may have been the best fish sandwich I've yet to have in da burgh.\n", "\n" ] } ], "source": [ "print parsed_review" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks the same! What happened under the hood?\n", "\n", "What about sentence detection and segmentation?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sentence 1:\n", "After a morning of Thrift Store hunting, a friend and I were thinking of lunch, and he suggested Emil's after he'd seen Chris Sebak do a bit on it and had tried it a time or two before, and I had not.\n", "\n", "Sentence 2:\n", "He said they had a decent Reuben, but to be prepared to step back in time.\n", "\n", "\n", "\n", "Sentence 3:\n", "Well, seeing as how I'm kind of addicted to late 40's and early 50's, and the whole Rat Pack scene, stepping back in time is a welcomed change in da burgh...as long as it doesn't involve 1979, which I can see all around me every day.\n", "\n", "\n", "\n", "Sentence 4:\n", "And yet another shot at finding a decent Reuben in da burgh...\n", "\n", "Sentence 5:\n", "well, that's like hunting the Holy Grail.\n", "\n", "Sentence 6:\n", "So looking under one more bush certainly wouldn't hurt.\n", "\n", "\n", "\n", "Sentence 7:\n", "So off we go right at lunchtime in the middle of...where exactly were we?\n", "\n", "Sentence 8:\n", "At first I thought we were lost, driving around a handful of very rather dismal looking blocks in what looked like a neighborhood that had been blighted by the building of a highway.\n", "\n", "Sentence 9:\n", "And then...AHA!\n", "\n", "Sentence 10:\n", "Here it is!\n", "\n", "Sentence 11:\n", "And yep, there it was.\n", "\n", "Sentence 12:\n", "This little unassuming building with an add-on entrance with what looked like a very old hand painted sign stating quite simply 'Emil's. \n", "\n", "\n", "\n", "Sentence 13:\n", "We walked in the front door, and entered another world.\n", "\n", "Sentence 14:\n", "Another time, and another place.\n", "\n", "Sentence 15:\n", "Oh, and any Big Burrito/Sousa foodies might as well stop reading now.\n", "\n", "Sentence 16:\n", "I wouldn't want to see you walk in, roll your eyes and say 'Reaaaaaalllly?'\n", "\n", "\n", "\n", "Sentence 17:\n", "This is about as old world bar/lounge/restaurant as it gets.\n", "\n", "Sentence 18:\n", "Plain, with a dark wood bar on one side, plain white walls with no yinzer pics, good sturdy chairs and actual white linens on the tables.\n", "\n", "Sentence 19:\n", "This is the kind of neighborhood dive that I could see Frank and Dino pulling a few tables together for some poker, a fish sammich, and some cheap scotch.\n", "\n", "Sentence 20:\n", "And THAT is exactly what I love.\n", "\n", "\n", "\n", "Sentence 21:\n", "Oh...but good food counts too. \n", "\n", "\n", "\n", "Sentence 22:\n", "We each had a Reuben, and my friend had a side of fries.\n", "\n", "Sentence 23:\n", "The Reubens were decent, but not NY awesome.\n", "\n", "Sentence 24:\n", "A little too thick on the bread, but overall, tasty and definitely filling.\n", "\n", "Sentence 25:\n", "Not too skimpy on the meat.\n", "\n", "Sentence 26:\n", "I seriously CRAVE a true, good NY Reuben, but since I can't afford to travel right now, what I find in da burgh will have to do.\n", "\n", "Sentence 27:\n", "But as we sat and ate, burgers came out to an adjoining table.\n", "\n", "Sentence 28:\n", "Those were some big thick burgers.\n", "\n", "Sentence 29:\n", "A steak went past for the table behind us.\n", "\n", "Sentence 30:\n", "That was HUGE!\n", "\n", "Sentence 31:\n", "And when we asked about it, the waitress said 'Yeah, it's huge and really good, and he only charges $12.99 for it, ain't that nuts?'\n", "\n", "Sentence 32:\n", "Another table of five came in, and wham.\n", "\n", "Sentence 33:\n", "Fish sandwiches PILED with breaded fish that looked amazing.\n", "\n", "Sentence 34:\n", "Yeah, I want that, that, that and THAT!\n", "\n", "\n", "\n", "Sentence 35:\n", "My friend also mentioned that they have a Chicken Parm special one day of the week that is only served UNTIL 4 pm, and that it is fantastic.\n", "\n", "Sentence 36:\n", "If only I could GET there on that week day before 4...\n", "\n", "\n", "\n", "Sentence 37:\n", "The waitress did a good job, especially since there was quite a growing crowd at lunchtime on a Saturday, and only one of her.\n", "\n", "Sentence 38:\n", "She kept up and was very friendly. \n", "\n", "\n", "\n", "Sentence 39:\n", "They only have Pepsi products, so I had a brewed iced tea, which was very fresh, and she did pop by to ask about refills as often as she could.\n", "\n", "Sentence 40:\n", "As the lunch hour went on, they were getting busy.\n", "\n", "\n", "\n", "Sentence 41:\n", "Emil's is no frills, good portions, very reasonable prices, VERY comfortable neighborhood hole in the wall...\n", "\n", "Sentence 42:\n", "kind of like Cheers, but in a blue collar neighborhood in the 1950's.\n", "\n", "Sentence 43:\n", "Fan-freakin-tastic!\n", "\n", "Sentence 44:\n", "I could feel at home here.\n", "\n", "\n", "\n", "Sentence 45:\n", "You definitely want to hit Mapquest or plug in your GPS though.\n", "\n", "Sentence 46:\n", "I am not sure that I could find it again on my own...it really is a hidden gem.\n", "\n", "Sentence 47:\n", "I will be making my friend take me back until I can memorize where the heck it is.\n", "\n", "\n", "\n", "Sentence 48:\n", "Addendum: 2nd visit for the fish sandwich.\n", "\n", "Sentence 49:\n", "Excellent.\n", "\n", "Sentence 50:\n", "Truly.\n", "\n", "Sentence 51:\n", "A pound of fish on a fish-shaped bun (as opposed to da burgh's seemingly popular hamburger bun).\n", "\n", "Sentence 52:\n", "The fish was flavorful, the batter excellent, and for just $8.\n", "\n", "Sentence 53:\n", "This may have been the best fish sandwich I've yet to have in da burgh.\n", "\n", "\n" ] } ], "source": [ "for num, sentence in enumerate(parsed_review.sents):\n", " print 'Sentence {}:'.format(num + 1)\n", " print sentence\n", " print ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about named entity detection?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Entity 1: Thrift Store - ORG\n", "\n", "Entity 2: Emil - PERSON\n", "\n", "Entity 3: Chris Sebak - PERSON\n", "\n", "Entity 4: two - CARDINAL\n", "\n", "Entity 5: Reuben - PERSON\n", "\n", "Entity 6: Rat Pack - ORG\n", "\n", "Entity 7: 1979 - DATE\n", "\n", "Entity 8: every day - DATE\n", "\n", "Entity 9: Reuben - PERSON\n", "\n", "Entity 10: one - CARDINAL\n", "\n", "Entity 11: Emil - PERSON\n", "\n", "Entity 12: Frank - PERSON\n", "\n", "Entity 13: Dino - PERSON\n", "\n", "Entity 14: Reuben - PERSON\n", "\n", "Entity 15: Reubens - PERSON\n", "\n", "Entity 16: Reuben - PERSON\n", "\n", "Entity 17: HUGE - ORG\n", "\n", "Entity 18: 12.99 - MONEY\n", "\n", "Entity 19: five - CARDINAL\n", "\n", "Entity 20: one day - DATE\n", "\n", "Entity 21: UNTIL - ORG\n", "\n", "Entity 22: 4 pm - TIME\n", "\n", "Entity 23: that week day - DATE\n", "\n", "Entity 24: Saturday - DATE\n", "\n", "Entity 25: only one - CARDINAL\n", "\n", "Entity 26: Pepsi - ORG\n", "\n", "Entity 27: the lunch hour - TIME\n", "\n", "Entity 28: Emil - PERSON\n", "\n", "Entity 29: 1950 - DATE\n", "\n", "Entity 30: Mapquest - LOC\n", "\n", "Entity 31: 2nd - CARDINAL\n", "\n", "Entity 32: Truly - PERSON\n", "\n", "Entity 33: 8 - MONEY\n", "\n" ] } ], "source": [ "for num, entity in enumerate(parsed_review.ents):\n", " print 'Entity {}:'.format(num + 1), entity, '-', entity.label_\n", " print ''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about part of speech tagging?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
token_textpart_of_speech
0AfterADP
1aDET
2morningNOUN
3ofADP
4ThriftPROPN
5StorePROPN
6huntingNOUN
7,PUNCT
8aDET
9friendNOUN
10andCONJ
11IPRON
12wereVERB
13thinkingVERB
14ofADP
15lunchNOUN
16,PUNCT
17andCONJ
18hePRON
19suggestedVERB
20EmilPROPN
21'sPART
22afterADP
23hePRON
24'dVERB
25seenVERB
26ChrisPROPN
27SebakPROPN
28doVERB
29aDET
.........
855flavorfulADJ
856,PUNCT
857theDET
858batterNOUN
859excellentADJ
860,PUNCT
861andCONJ
862forADP
863justADV
864$SYM
8658NUM
866.PUNCT
867ThisDET
868mayVERB
869haveVERB
870beenVERB
871theDET
872bestADJ
873fishNOUN
874sandwichNOUN
875IPRON
876'veVERB
877yetADV
878toPART
879haveVERB
880inADP
881daPROPN
882burghNOUN
883.PUNCT
884\\nSPACE
\n", "

885 rows × 2 columns

\n", "
" ], "text/plain": [ " token_text part_of_speech\n", "0 After ADP\n", "1 a DET\n", "2 morning NOUN\n", "3 of ADP\n", "4 Thrift PROPN\n", "5 Store PROPN\n", "6 hunting NOUN\n", "7 , PUNCT\n", "8 a DET\n", "9 friend NOUN\n", "10 and CONJ\n", "11 I PRON\n", "12 were VERB\n", "13 thinking VERB\n", "14 of ADP\n", "15 lunch NOUN\n", "16 , PUNCT\n", "17 and CONJ\n", "18 he PRON\n", "19 suggested VERB\n", "20 Emil PROPN\n", "21 's PART\n", "22 after ADP\n", "23 he PRON\n", "24 'd VERB\n", "25 seen VERB\n", "26 Chris PROPN\n", "27 Sebak PROPN\n", "28 do VERB\n", "29 a DET\n", ".. ... ...\n", "855 flavorful ADJ\n", "856 , PUNCT\n", "857 the DET\n", "858 batter NOUN\n", "859 excellent ADJ\n", "860 , PUNCT\n", "861 and CONJ\n", "862 for ADP\n", "863 just ADV\n", "864 $ SYM\n", "865 8 NUM\n", "866 . PUNCT\n", "867 This DET\n", "868 may VERB\n", "869 have VERB\n", "870 been VERB\n", "871 the DET\n", "872 best ADJ\n", "873 fish NOUN\n", "874 sandwich NOUN\n", "875 I PRON\n", "876 've VERB\n", "877 yet ADV\n", "878 to PART\n", "879 have VERB\n", "880 in ADP\n", "881 da PROPN\n", "882 burgh NOUN\n", "883 . PUNCT\n", "884 \\n SPACE\n", "\n", "[885 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_text = [token.orth_ for token in parsed_review]\n", "token_pos = [token.pos_ for token in parsed_review]\n", "\n", "pd.DataFrame(zip(token_text, token_pos),\n", " columns=['token_text', 'part_of_speech'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about text normalization, like stemming/lemmatization and shape analysis?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
token_texttoken_lemmatoken_shape
0AfterafterXxxxx
1aax
2morningmorningxxxx
3ofofxx
4ThriftthriftXxxxx
5StorestoreXxxxx
6huntinghuntingxxxx
7,,,
8aax
9friendfriendxxxx
10andandxxx
11IiX
12werebexxxx
13thinkingthinkxxxx
14ofofxx
15lunchlunchxxxx
16,,,
17andandxxx
18hehexx
19suggestedsuggestxxxx
20EmilemilXxxx
21's's'x
22afterafterxxxx
23hehexx
24'dwould'x
25seenseexxxx
26ChrischrisXxxxx
27SebaksebakXxxxx
28dodoxx
29aax
............
855flavorfulflavorfulxxxx
856,,,
857thethexxx
858batterbatterxxxx
859excellentexcellentxxxx
860,,,
861andandxxx
862forforxxx
863justjustxxxx
864$$$
86588d
866...
867ThisthisXxxx
868maymayxxx
869havehavexxxx
870beenbexxxx
871thethexxx
872bestbestxxxx
873fishfishxxxx
874sandwichsandwichxxxx
875IiX
876'vehave'xx
877yetyetxxx
878totoxx
879havehavexxxx
880ininxx
881dadaxx
882burghburghxxxx
883...
884\\n\\n\\n
\n", "

885 rows × 3 columns

\n", "
" ], "text/plain": [ " token_text token_lemma token_shape\n", "0 After after Xxxxx\n", "1 a a x\n", "2 morning morning xxxx\n", "3 of of xx\n", "4 Thrift thrift Xxxxx\n", "5 Store store Xxxxx\n", "6 hunting hunting xxxx\n", "7 , , ,\n", "8 a a x\n", "9 friend friend xxxx\n", "10 and and xxx\n", "11 I i X\n", "12 were be xxxx\n", "13 thinking think xxxx\n", "14 of of xx\n", "15 lunch lunch xxxx\n", "16 , , ,\n", "17 and and xxx\n", "18 he he xx\n", "19 suggested suggest xxxx\n", "20 Emil emil Xxxx\n", "21 's 's 'x\n", "22 after after xxxx\n", "23 he he xx\n", "24 'd would 'x\n", "25 seen see xxxx\n", "26 Chris chris Xxxxx\n", "27 Sebak sebak Xxxxx\n", "28 do do xx\n", "29 a a x\n", ".. ... ... ...\n", "855 flavorful flavorful xxxx\n", "856 , , ,\n", "857 the the xxx\n", "858 batter batter xxxx\n", "859 excellent excellent xxxx\n", "860 , , ,\n", "861 and and xxx\n", "862 for for xxx\n", "863 just just xxxx\n", "864 $ $ $\n", "865 8 8 d\n", "866 . . .\n", "867 This this Xxxx\n", "868 may may xxx\n", "869 have have xxxx\n", "870 been be xxxx\n", "871 the the xxx\n", "872 best best xxxx\n", "873 fish fish xxxx\n", "874 sandwich sandwich xxxx\n", "875 I i X\n", "876 've have 'xx\n", "877 yet yet xxx\n", "878 to to xx\n", "879 have have xxxx\n", "880 in in xx\n", "881 da da xx\n", "882 burgh burgh xxxx\n", "883 . . .\n", "884 \\n \\n \\n\n", "\n", "[885 rows x 3 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_lemma = [token.lemma_ for token in parsed_review]\n", "token_shape = [token.shape_ for token in parsed_review]\n", "\n", "pd.DataFrame(zip(token_text, token_lemma, token_shape),\n", " columns=['token_text', 'token_lemma', 'token_shape'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about token-level entity analysis?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
token_textentity_typeinside_outside_begin
0AfterO
1aO
2morningO
3ofO
4ThriftORGB
5StoreORGI
6huntingO
7,O
8aO
9friendO
10andO
11IO
12wereO
13thinkingO
14ofO
15lunchO
16,O
17andO
18heO
19suggestedO
20EmilPERSONB
21'sO
22afterO
23heO
24'dO
25seenO
26ChrisPERSONB
27SebakPERSONI
28doO
29aO
............
855flavorfulO
856,O
857theO
858batterO
859excellentO
860,O
861andO
862forO
863justO
864$O
8658MONEYB
866.O
867ThisO
868mayO
869haveO
870beenO
871theO
872bestO
873fishO
874sandwichO
875IO
876'veO
877yetO
878toO
879haveO
880inO
881daO
882burghO
883.O
884\\nO
\n", "

885 rows × 3 columns

\n", "
" ], "text/plain": [ " token_text entity_type inside_outside_begin\n", "0 After O\n", "1 a O\n", "2 morning O\n", "3 of O\n", "4 Thrift ORG B\n", "5 Store ORG I\n", "6 hunting O\n", "7 , O\n", "8 a O\n", "9 friend O\n", "10 and O\n", "11 I O\n", "12 were O\n", "13 thinking O\n", "14 of O\n", "15 lunch O\n", "16 , O\n", "17 and O\n", "18 he O\n", "19 suggested O\n", "20 Emil PERSON B\n", "21 's O\n", "22 after O\n", "23 he O\n", "24 'd O\n", "25 seen O\n", "26 Chris PERSON B\n", "27 Sebak PERSON I\n", "28 do O\n", "29 a O\n", ".. ... ... ...\n", "855 flavorful O\n", "856 , O\n", "857 the O\n", "858 batter O\n", "859 excellent O\n", "860 , O\n", "861 and O\n", "862 for O\n", "863 just O\n", "864 $ O\n", "865 8 MONEY B\n", "866 . O\n", "867 This O\n", "868 may O\n", "869 have O\n", "870 been O\n", "871 the O\n", "872 best O\n", "873 fish O\n", "874 sandwich O\n", "875 I O\n", "876 've O\n", "877 yet O\n", "878 to O\n", "879 have O\n", "880 in O\n", "881 da O\n", "882 burgh O\n", "883 . O\n", "884 \\n O\n", "\n", "[885 rows x 3 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_entity_type = [token.ent_type_ for token in parsed_review]\n", "token_entity_iob = [token.ent_iob_ for token in parsed_review]\n", "\n", "pd.DataFrame(zip(token_text, token_entity_type, token_entity_iob),\n", " columns=['token_text', 'entity_type', 'inside_outside_begin'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?\n", "- stopword\n", "- punctuation\n", "- whitespace\n", "- represents a number\n", "- whether or not the token is included in spaCy's default vocabulary?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
textlog_probabilitystop?punctuation?whitespace?number?out of vocab.?
0After-9.091193Yes
1a-3.929788Yes
2morning-9.529314
3of-4.275874Yes
4Thrift-14.550483
5Store-11.719210
6hunting-10.961483
7,-3.454960Yes
8a-3.929788Yes
9friend-8.210516
10and-4.113108Yes
11I-3.791565Yes
12were-6.673175Yes
13thinking-8.442947
14of-4.275874Yes
15lunch-10.572958
16,-3.454960Yes
17and-4.113108Yes
18he-5.931905Yes
19suggested-10.656719
20Emil-15.862375
21's-4.830559
22after-7.265652Yes
23he-5.931905Yes
24'd-7.075287
25seen-7.973224
26Chris-10.966099
27Sebak-19.502029Yes
28do-5.246997Yes
29a-3.929788Yes
........................
855flavorful-14.094742
856,-3.454960Yes
857the-3.528767Yes
858batter-12.895466
859excellent-10.147964
860,-3.454960Yes
861and-4.113108Yes
862for-4.880109Yes
863just-5.630868Yes
864$-7.450107
8658-8.940966Yes
866.-3.067898Yes
867This-6.783917Yes
868may-7.678495Yes
869have-5.156485Yes
870been-6.670917Yes
871the-3.528767Yes
872best-7.492557
873fish-10.166230
874sandwich-11.186007
875I-3.791565Yes
876've-6.593011
877yet-8.229137Yes
878to-3.856022Yes
879have-5.156485Yes
880in-4.619072Yes
881da-10.829142
882burgh-16.942732
883.-3.067898Yes
884\\n-6.050651Yes
\n", "

885 rows × 7 columns

\n", "
" ], "text/plain": [ " text log_probability stop? punctuation? whitespace? number? \\\n", "0 After -9.091193 Yes \n", "1 a -3.929788 Yes \n", "2 morning -9.529314 \n", "3 of -4.275874 Yes \n", "4 Thrift -14.550483 \n", "5 Store -11.719210 \n", "6 hunting -10.961483 \n", "7 , -3.454960 Yes \n", "8 a -3.929788 Yes \n", "9 friend -8.210516 \n", "10 and -4.113108 Yes \n", "11 I -3.791565 Yes \n", "12 were -6.673175 Yes \n", "13 thinking -8.442947 \n", "14 of -4.275874 Yes \n", "15 lunch -10.572958 \n", "16 , -3.454960 Yes \n", "17 and -4.113108 Yes \n", "18 he -5.931905 Yes \n", "19 suggested -10.656719 \n", "20 Emil -15.862375 \n", "21 's -4.830559 \n", "22 after -7.265652 Yes \n", "23 he -5.931905 Yes \n", "24 'd -7.075287 \n", "25 seen -7.973224 \n", "26 Chris -10.966099 \n", "27 Sebak -19.502029 \n", "28 do -5.246997 Yes \n", "29 a -3.929788 Yes \n", ".. ... ... ... ... ... ... \n", "855 flavorful -14.094742 \n", "856 , -3.454960 Yes \n", "857 the -3.528767 Yes \n", "858 batter -12.895466 \n", "859 excellent -10.147964 \n", "860 , -3.454960 Yes \n", "861 and -4.113108 Yes \n", "862 for -4.880109 Yes \n", "863 just -5.630868 Yes \n", "864 $ -7.450107 \n", "865 8 -8.940966 Yes \n", "866 . -3.067898 Yes \n", "867 This -6.783917 Yes \n", "868 may -7.678495 Yes \n", "869 have -5.156485 Yes \n", "870 been -6.670917 Yes \n", "871 the -3.528767 Yes \n", "872 best -7.492557 \n", "873 fish -10.166230 \n", "874 sandwich -11.186007 \n", "875 I -3.791565 Yes \n", "876 've -6.593011 \n", "877 yet -8.229137 Yes \n", "878 to -3.856022 Yes \n", "879 have -5.156485 Yes \n", "880 in -4.619072 Yes \n", "881 da -10.829142 \n", "882 burgh -16.942732 \n", "883 . -3.067898 Yes \n", "884 \\n -6.050651 Yes \n", "\n", " out of vocab.? \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "5 \n", "6 \n", "7 \n", "8 \n", "9 \n", "10 \n", "11 \n", "12 \n", "13 \n", "14 \n", "15 \n", "16 \n", "17 \n", "18 \n", "19 \n", "20 \n", "21 \n", "22 \n", "23 \n", "24 \n", "25 \n", "26 \n", "27 Yes \n", "28 \n", "29 \n", ".. ... \n", "855 \n", "856 \n", "857 \n", "858 \n", "859 \n", "860 \n", "861 \n", "862 \n", "863 \n", "864 \n", "865 \n", "866 \n", "867 \n", "868 \n", "869 \n", "870 \n", "871 \n", "872 \n", "873 \n", "874 \n", "875 \n", "876 \n", "877 \n", "878 \n", "879 \n", "880 \n", "881 \n", "882 \n", "883 \n", "884 \n", "\n", "[885 rows x 7 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "token_attributes = [(token.orth_,\n", " token.prob,\n", " token.is_stop,\n", " token.is_punct,\n", " token.is_space,\n", " token.like_num,\n", " token.is_oov)\n", " for token in parsed_review]\n", "\n", "df = pd.DataFrame(token_attributes,\n", " columns=['text',\n", " 'log_probability',\n", " 'stop?',\n", " 'punctuation?',\n", " 'whitespace?',\n", " 'number?',\n", " 'out of vocab.?'])\n", "\n", "df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']\n", " .applymap(lambda x: u'Yes' if x else u''))\n", " \n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box.\n", "\n", "I think it will eventually become a core part of the Python data science ecosystem — it will do for natural language computing what other great libraries have done for numerical computing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Phrase Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Phrase modeling_ is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that _co-occur_ (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:\n", "\n", "$$\\frac{count(A\\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$\n", "\n", "...where:\n", "* $count(A)$ is the number of times token $A$ appears in the corpus\n", "* $count(B)$ is the number of times token $B$ appears in the corpus\n", "* $count(A\\ B)$ is the number of times the tokens $A\\ B$ appear in the corpus *in order*\n", "* $N$ is the total size of the corpus vocabulary\n", "* $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times\n", "* $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase\n", "\n", "Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.\n", "\n", "Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so _new york_ would become *new\\_york*). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as _happy hour_) to also become phrases in the model.\n", "\n", "We turn to the indispensible [**gensim**](https://radimrehurek.com/gensim/index.html) library to help us with phrase modeling — the [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) class in particular." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from gensim.models import Phrases\n", "from gensim.models.word2vec import LineSentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:\n", "\n", "1. Segment text of complete reviews into sentences & normalize text\n", "1. First-order phrase modeling $\\rightarrow$ _apply first-order phrase model to transform sentences_\n", "1. Second-order phrase modeling $\\rightarrow$ _apply second-order phrase model to transform sentences_\n", "1. Apply text normalization and second-order phrase model to text of complete reviews\n", "\n", "We'll use this transformed data as the input for some higher-level modeling approaches in the following sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's define a few helper functions that we'll use for text normalization. In particular, the `lemmatized_sentence_corpus` generator function will use spaCy to:\n", "- Iterate over the 1M reviews in the `review_txt_all.txt` we created before\n", "- Segment the reviews into individual sentences\n", "- Remove punctuation and excess whitespace\n", "- Lemmatize the text\n", "\n", "... and do so efficiently in parallel, thanks to spaCy's `nlp.pipe()` function." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def punct_space(token):\n", " \"\"\"\n", " helper function to eliminate tokens\n", " that are pure punctuation or whitespace\n", " \"\"\"\n", " \n", " return token.is_punct or token.is_space\n", "\n", "def line_review(filename):\n", " \"\"\"\n", " generator function to read in reviews from the file\n", " and un-escape the original line breaks in the text\n", " \"\"\"\n", " \n", " with codecs.open(filename, encoding='utf_8') as f:\n", " for review in f:\n", " yield review.replace('\\\\n', '\\n')\n", " \n", "def lemmatized_sentence_corpus(filename):\n", " \"\"\"\n", " generator function to use spaCy to parse reviews,\n", " lemmatize the text, and yield sentences\n", " \"\"\"\n", " \n", " for parsed_review in nlp.pipe(line_review(filename),\n", " batch_size=10000, n_threads=4):\n", " \n", " for sent in parsed_review.sents:\n", " yield u' '.join([token.lemma_ for token in sent\n", " if not punct_space(token)])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "unigram_sentences_filepath = os.path.join(intermediate_directory,\n", " 'unigram_sentences_all.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use the `lemmatized_sentence_corpus` generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text. We'll write this data back out to a new file (`unigram_sentences_all`), with one normalized sentence per line. We'll use this data for learning our phrase models." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 11 µs, sys: 12 µs, total: 23 µs\n", "Wall time: 36 µs\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute data prep yourself.\n", "if 0 == 1:\n", "\n", " with codecs.open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:\n", " for sentence in lemmatized_sentence_corpus(review_txt_filepath):\n", " f.write(sentence + '\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your data is organized like our `unigram_sentences_all` file now is — a large text file with one document/sentence per line — gensim's [**LineSentence**](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) class provides a convenient iterator for working with other gensim components. It *streams* the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "unigram_sentences = LineSentence(unigram_sentences_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at a few sample sentences in our new, transformed file." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "no it be not the best food in the world but the service greatly help the perception and it do not taste bad\n", "\n", "so back in the late 90 there use to be this super kick as cinnamon ice cream like an apple pie ice cream without the apple or the pie crust\n", "\n", "so delicious\n", "\n", "however now there be some shit tastic replacement that taste like vanilla ice cream with last year 's red hot in the middle totally gross\n", "\n", "fortunately our server be nice enough to warn me about the change and bring me a sample so i only have to suffer the death of a childhood memory rather than also have to pay for it\n", "\n", "the portion be big and fill just do not come for the ice cream\n", "\n", "i have pretty much be eat at various king pretty regularly since i be a child when my parent would take my sister and i into the fox chapel location often\n", "\n", "lately me and my girl have be visit the heidelburg location\n", "\n", "i love the food it really taste homemade much like something a grandmother would make complete with gob of butter and side dish\n", "\n", "price be low selection be great but do not expect fine dining by any mean\n", "\n" ] } ], "source": [ "for unigram_sentence in it.islice(unigram_sentences, 230, 240):\n", " print u' '.join(unigram_sentence)\n", " print u''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll learn a phrase model that will link individual words into two-word phrases. We'd expect words that together represent a specific concept, like \"`ice cream`\", to be linked together to form a new, single token: \"`ice_cream`\"." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5.91 s, sys: 3.14 s, total: 9.05 s\n", "Wall time: 11 s\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute modeling yourself.\n", "if 0 == 1:\n", "\n", " bigram_model = Phrases(unigram_sentences)\n", "\n", " bigram_model.save(bigram_model_filepath)\n", " \n", "# load the finished model from disk\n", "bigram_model = Phrases.load(bigram_model_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bigram_sentences_filepath = os.path.join(intermediate_directory,\n", " 'bigram_sentences_all.txt')" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs\n", "Wall time: 8.11 µs\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute data prep yourself.\n", "if 0 == 1:\n", "\n", " with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:\n", " \n", " for unigram_sentence in unigram_sentences:\n", " \n", " bigram_sentence = u' '.join(bigram_model[unigram_sentence])\n", " \n", " f.write(bigram_sentence + '\\n')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "bigram_sentences = LineSentence(bigram_sentences_filepath)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "no it be not the best food in the world but the service greatly help the perception and it do not taste bad\n", "\n", "so back in the late 90 there use to be this super kick as cinnamon ice_cream like an apple_pie ice_cream without the apple or the pie crust\n", "\n", "so delicious\n", "\n", "however now there be some shit tastic replacement that taste like vanilla_ice cream with last year 's red hot in the middle totally gross\n", "\n", "fortunately our server be nice enough to warn me about the change and bring me a sample so i only have to suffer the death of a childhood_memory rather_than also have to pay for it\n", "\n", "the portion be big and fill just do not come for the ice_cream\n", "\n", "i have pretty much be eat at various king pretty regularly since i be a child when my parent would take my sister and i into the fox_chapel location often\n", "\n", "lately me and my girl have be visit the heidelburg location\n", "\n", "i love the food it really taste homemade much like something a grandmother would make complete with gob of butter and side dish\n", "\n", "price be low selection be great but do not expect fine_dining by any mean\n", "\n" ] } ], "source": [ "for bigram_sentence in it.islice(bigram_sentences, 230, 240):\n", " print u' '.join(bigram_sentence)\n", " print u''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like the phrase modeling worked! We now see two-word phrases, such as \"`ice_cream`\" and \"`apple_pie`\", linked together in the text as a single token. Next, we'll train a _second-order_ phrase model. We'll apply the second-order phrase model on top of the already-transformed data, so that incomplete word combinations like \"`vanilla_ice cream`\" will become fully joined to \"`vanilla_ice_cream`\". No disrespect intended to [Vanilla Ice](https://www.youtube.com/watch?v=rog8ou-ZepE), of course." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trigram_model_filepath = os.path.join(intermediate_directory,\n", " 'trigram_model_all')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.85 s, sys: 3.17 s, total: 8.02 s\n", "Wall time: 9.58 s\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute modeling yourself.\n", "if 0 == 1:\n", "\n", " trigram_model = Phrases(bigram_sentences)\n", "\n", " trigram_model.save(trigram_model_filepath)\n", " \n", "# load the finished model from disk\n", "trigram_model = Phrases.load(trigram_model_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trigram_sentences_filepath = os.path.join(intermediate_directory,\n", " 'trigram_sentences_all.txt')" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8 µs, sys: 4 µs, total: 12 µs\n", "Wall time: 21.9 µs\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute data prep yourself.\n", "if 0 == 1:\n", "\n", " with codecs.open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:\n", " \n", " for bigram_sentence in bigram_sentences:\n", " \n", " trigram_sentence = u' '.join(trigram_model[bigram_sentence])\n", " \n", " f.write(trigram_sentence + '\\n')" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trigram_sentences = LineSentence(trigram_sentences_filepath)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "no it be not the best food in the world but the service greatly help the perception and it do not taste bad\n", "\n", "so back in the late 90 there use to be this super kick as cinnamon_ice_cream like an apple_pie ice_cream without the apple or the pie crust\n", "\n", "so delicious\n", "\n", "however now there be some shit tastic replacement that taste like vanilla_ice_cream with last year 's red hot in the middle totally gross\n", "\n", "fortunately our server be nice enough to warn me about the change and bring me a sample so i only have to suffer the death of a childhood_memory rather_than also have to pay for it\n", "\n", "the portion be big and fill just do not come for the ice_cream\n", "\n", "i have pretty much be eat at various king pretty regularly since i be a child when my parent would take my sister and i into the fox_chapel location often\n", "\n", "lately me and my girl have be visit the heidelburg location\n", "\n", "i love the food it really taste homemade much like something a grandmother would make complete with gob of butter and side dish\n", "\n", "price be low selection be great but do not expect fine_dining by any mean\n", "\n" ] } ], "source": [ "for trigram_sentence in it.islice(trigram_sentences, 230, 240):\n", " print u' '.join(trigram_sentence)\n", " print u''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like the second-order phrase model was successful. We're now seeing three-word phrases, such as \"`vanilla_ice_cream`\" and \"`cinnamon_ice_cream`\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final step of our text preparation process circles back to the complete text of the reviews. We're going to run the complete text of the reviews through a pipeline that applies our text normalization and phrase models.\n", "\n", "In addition, we'll remove stopwords at this point. _Stopwords_ are very common words, like _a_, _the_, _and_, and so on, that serve functional roles in natural language, but typically don't contribute to the overall meaning of text. Filtering stopwords is a common procedure that allows higher-level NLP modeling techniques to focus on the words that carry more semantic weight.\n", "\n", "Finally, we'll write the transformed text out to a new file, with one review per line." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trigram_reviews_filepath = os.path.join(intermediate_directory,\n", " 'trigram_transformed_reviews_all.txt')" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs\n", "Wall time: 11.9 µs\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute data prep yourself.\n", "if 0 == 1:\n", "\n", " with codecs.open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:\n", " \n", " for parsed_review in nlp.pipe(line_review(review_txt_filepath),\n", " batch_size=10000, n_threads=4):\n", " \n", " # lemmatize the text, removing punctuation and whitespace\n", " unigram_review = [token.lemma_ for token in parsed_review\n", " if not punct_space(token)]\n", " \n", " # apply the first-order and second-order phrase models\n", " bigram_review = bigram_model[unigram_review]\n", " trigram_review = trigram_model[bigram_review]\n", " \n", " # remove any remaining stopwords\n", " trigram_review = [term for term in trigram_review\n", " if term not in spacy.en.STOPWORDS]\n", " \n", " # write the transformed review as a line in the new file\n", " trigram_review = u' '.join(trigram_review)\n", " f.write(trigram_review + '\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original:\n", "\n", "A great townie bar with tasty food and an interesting clientele. I went to check this place out on the way home from the airport one Friday night and it didn't disappoint. It is refreshing to walk into a townie bar and not feel like the music stops and everyone in the place is staring at you - I'm guessing the mixed crowd of older hockey fans, young men in collared shirts, and thirtysomethings have probably seen it all during their time at this place. \n", "\n", "The staff was top notch - the orders were somewhat overwhelming as they appeared short-staffed for the night, but my waitress tried to keep a positive attitude for my entire visit. The other waiter was wearing a hooded cardigan, and I wanted to steal it from him due to my difficulty in finding such a quality article of clothing.\n", "\n", "We ordered a white pizza - large in size, engulfed in cheese, full of garlic flavor, flavorful hot sausage. An overall delicious pizza, aside from 2 things: 1, way too much grease (I know this comes with the territory, but still, it is sometimes unbearable); 2, CANNED MUSHROOMS - the worst thing to come out of a can. Ever. I would rather eat canned Alpo than canned mushrooms. And if the mushrooms weren't canned, they were just the worst mushrooms I've ever consumed. The mushroom debacle is enough to lower the review by an entire star - disgusting!\n", "\n", "My advice for the place is keep everything awesome - random music from the jukebox, tasty food, great prices, good crowd and staff - and get some decent mushrooms; why they spoil an otherwise above average pie with such inferior crap, I'll never know.\n", "\n", "----\n", "\n", "Transformed:\n", "\n", "great townie bar tasty food interesting clientele check place way home airport friday_night disappoint refresh walk townie bar feel like music stop place star guess mixed crowd old hockey_fan young_man collared_shirt thirtysomethings probably time place staff top_notch order somewhat overwhelming appear short staff night waitress try positive_attitude entire visit waiter wear hooded cardigan want steal difficulty quality article clothing order white pizza large size engulf cheese garlic flavor flavorful hot sausage overall delicious pizza aside_from 2 thing 1 way grease know come territory unbearable 2 canned mushrooms bad thing come eat alpo canned_mushroom mushroom bad mushroom consume mushroom debacle lower review entire star disgusting advice place awesome random music jukebox tasty food great price good crowd staff decent mushroom spoil above_average pie inferior crap know\n", "\n" ] } ], "source": [ "print u'Original:' + u'\\n'\n", "\n", "for review in it.islice(line_review(review_txt_filepath), 11, 12):\n", " print review\n", "\n", "print u'----' + u'\\n'\n", "print u'Transformed:' + u'\\n'\n", "\n", "with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:\n", " for review in it.islice(f, 11, 12):\n", " print review" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that most of the grammatical structure has been scrubbed from the text — capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic *meaning* is still present. Also, multi-word concepts such as \"`friday_night`\" and \"`above_average`\" have been joined into single tokens, as expected. The review text is now ready for higher-level modeling. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Topic Modeling with Latent Dirichlet Allocation (_LDA_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Topic modeling* is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent \"topics\". For this demo, we'll be using [*Latent Dirichlet Allocation*](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) or LDA, a popular approach to topic modeling.\n", "\n", "In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a *vector* of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:\n", "* Document vectors tend to be large (one dimension for each token $\\Rightarrow$ lots of dimensions)\n", "* They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.\n", "* The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as _knife_ and _fork_.\n", "\n", "LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of *topics*, and the *topics* are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow [*Dirichlet*](https://en.wikipedia.org/wiki/Dirichlet_distribution) probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![LDA](https://s3.amazonaws.com/skipgram-images/LDA.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LDA is fully unsupervised. The topics are \"discovered\" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.\n", "\n", "We'll again turn to gensim to assist with data preparation and modeling. In particular, gensim offers a high-performance parallelized implementation of LDA with its [**LdaMulticore**](https://radimrehurek.com/gensim/models/ldamulticore.html) class." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.corpora import Dictionary, MmCorpus\n", "from gensim.models.ldamulticore import LdaMulticore\n", "\n", "import pyLDAvis\n", "import pyLDAvis.gensim\n", "import warnings\n", "import cPickle as pickle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's [**Dictionary**](https://radimrehurek.com/gensim/corpora/dictionary.html) class for this." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trigram_dictionary_filepath = os.path.join(intermediate_directory,\n", " 'trigram_dict_all.dict')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 50.8 ms, sys: 10.7 ms, total: 61.5 ms\n", "Wall time: 65.5 ms\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to learn the dictionary yourself.\n", "if 0 == 1:\n", "\n", " trigram_reviews = LineSentence(trigram_reviews_filepath)\n", "\n", " # learn the dictionary by iterating over all of the reviews\n", " trigram_dictionary = Dictionary(trigram_reviews)\n", " \n", " # filter tokens that are very rare or too common from\n", " # the dictionary (filter_extremes) and reassign integer ids (compactify)\n", " trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)\n", " trigram_dictionary.compactify()\n", "\n", " trigram_dictionary.save(trigram_dictionary_filepath)\n", " \n", "# load the finished dictionary from disk\n", "trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Like many NLP techniques, LDA uses a simplifying assumption known as the [*bag-of-words* model](https://en.wikipedia.org/wiki/Bag-of-words_model). In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded. \n", "\n", "Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The `trigram_bow_generator` function implements this. We'll save the resulting bag-of-words reviews as a matrix.\n", "\n", "In the following code, \"bag-of-words\" is abbreviated as `bow`." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "trigram_bow_filepath = os.path.join(intermediate_directory,\n", " 'trigram_bow_corpus_all.mm')" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def trigram_bow_generator(filepath):\n", " \"\"\"\n", " generator function to read reviews from a file\n", " and yield a bag-of-words representation\n", " \"\"\"\n", " \n", " for review in LineSentence(filepath):\n", " yield trigram_dictionary.doc2bow(review)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 143 ms, sys: 25.7 ms, total: 169 ms\n", "Wall time: 172 ms\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to build the bag-of-words corpus yourself.\n", "if 0 == 1:\n", "\n", " # generate bag-of-words representations for\n", " # all reviews and save them as a matrix\n", " MmCorpus.serialize(trigram_bow_filepath,\n", " trigram_bow_generator(trigram_reviews_filepath))\n", " \n", "# load the finished bag-of-words corpus from disk\n", "trigram_bow_corpus = MmCorpus(trigram_bow_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to `LdaMulticore` as inputs, along with the number of topics the model should learn. For this demo, we're asking for 50 topics." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 130 ms, sys: 182 ms, total: 312 ms\n", "Wall time: 337 ms\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to train the LDA model yourself.\n", "if 0 == 1:\n", "\n", " with warnings.catch_warnings():\n", " warnings.simplefilter('ignore')\n", " \n", " # workers => sets the parallelism, and should be\n", " # set to your number of physical cores minus one\n", " lda = LdaMulticore(trigram_bow_corpus,\n", " num_topics=50,\n", " id2word=trigram_dictionary,\n", " workers=3)\n", " \n", " lda.save(lda_model_filepath)\n", " \n", "# load the finished LDA model from disk\n", "lda = LdaMulticore.load(lda_model_filepath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def explore_topic(topic_number, topn=25):\n", " \"\"\"\n", " accept a user-supplied topic number and\n", " print out a formatted list of the top terms\n", " \"\"\"\n", " \n", " print u'{:20} {}'.format(u'term', u'frequency') + u'\\n'\n", "\n", " for term, frequency in lda.show_topic(topic_number, topn=25):\n", " print u'{:20} {:.3f}'.format(term, round(frequency, 3))" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "term frequency\n", "\n", "taco 0.053\n", "salsa 0.029\n", "chip 0.027\n", "mexican 0.027\n", "burrito 0.020\n", "order 0.016\n", "like 0.013\n", "try 0.012\n", "margarita 0.011\n", "guacamole 0.010\n", "come 0.009\n", "fresh 0.009\n", "bean 0.009\n", "cheese 0.008\n", "rice 0.008\n", "chicken 0.008\n", "meat 0.008\n", "tortilla 0.007\n", "flavor 0.007\n", "nacho 0.007\n", "' 0.007\n", "fish_taco 0.007\n", "chipotle 0.006\n", "little 0.006\n", "sauce 0.006\n" ] } ], "source": [ "explore_topic(topic_number=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first topic has strong associations with words like *taco*, *salsa*, *chip*, *burrito*, and *margarita*, as well as a handful of more general words. You might call this the **Mexican food** topic!\n", "\n", "It's possible to go through and inspect each topic in the same way, and try to assign a human-interpretable label that captures the essence of each one. I've given it a shot for all 50 topics below." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": true }, "outputs": [], "source": [ "topic_names = {0: u'mexican',\n", " 1: u'menu',\n", " 2: u'thai',\n", " 3: u'steak',\n", " 4: u'donuts & appetizers',\n", " 5: u'specials',\n", " 6: u'soup',\n", " 7: u'wings, sports bar',\n", " 8: u'foreign language',\n", " 9: u'las vegas',\n", " 10: u'chicken',\n", " 11: u'aria buffet',\n", " 12: u'noodles',\n", " 13: u'ambience & seating',\n", " 14: u'sushi',\n", " 15: u'arizona',\n", " 16: u'family',\n", " 17: u'price',\n", " 18: u'sweet',\n", " 19: u'waiting',\n", " 20: u'general',\n", " 21: u'tapas',\n", " 22: u'dirty',\n", " 23: u'customer service',\n", " 24: u'restrooms',\n", " 25: u'chinese',\n", " 26: u'gluten free',\n", " 27: u'pizza',\n", " 28: u'seafood',\n", " 29: u'amazing',\n", " 30: u'eat, like, know, want',\n", " 31: u'bars',\n", " 32: u'breakfast',\n", " 33: u'location & time',\n", " 34: u'italian',\n", " 35: u'barbecue',\n", " 36: u'arizona',\n", " 37: u'indian',\n", " 38: u'latin & cajun',\n", " 39: u'burger & fries',\n", " 40: u'vegetarian',\n", " 41: u'lunch buffet',\n", " 42: u'customer service',\n", " 43: u'taco, ice cream',\n", " 44: u'high cuisine',\n", " 45: u'healthy',\n", " 46: u'salad & sandwich',\n", " 47: u'greek',\n", " 48: u'poor experience',\n", " 49: u'wine & dine'}" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [], "source": [ "topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')\n", "\n", "with open(topic_names_filepath, 'w') as f:\n", " pickle.dump(topic_names, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see that, along with **mexican**, there are a variety of topics related to different styles of food, such as **thai**, **steak**, **sushi**, **pizza**, and so on. In addition, there are topics that are more related to the overall restaurant *experience*, like **ambience & seating**, **good service**, **waiting**, and **price**.\n", "\n", "Beyond these two categories, there are still some topics that are difficult to apply a meaningful human interpretation to, such as topic 30 and 43.\n", "\n", "Manually reviewing the top terms for each topic is a helpful exercise, but to get a deeper understanding of the topics and how they relate to each other, we need to visualize the data — preferably in an interactive format. Fortunately, we have the fantastic [**pyLDAvis**](https://pyldavis.readthedocs.io/en/latest/readme.html) library to help with that!\n", "\n", "pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": true }, "outputs": [], "source": [ "LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 442 ms, sys: 28.4 ms, total: 471 ms\n", "Wall time: 526 ms\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to execute data prep yourself.\n", "if 0 == 1:\n", "\n", " LDAvis_prepared = pyLDAvis.gensim.prepare(lda, trigram_bow_corpus,\n", " trigram_dictionary)\n", "\n", " with open(LDAvis_data_filepath, 'w') as f:\n", " pickle.dump(LDAvis_prepared, f)\n", " \n", "# load the pre-prepared pyLDAvis data from disk\n", "with open(LDAvis_data_filepath) as f:\n", " LDAvis_prepared = pickle.load(f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`pyLDAvis.display(...)` displays the topic model visualization in-line in the notebook." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyLDAvis.display(LDAvis_prepared)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wait, what am I looking at again?\n", "There are a lot of moving parts in the visualization. Here's a brief summary:\n", "\n", "* On the left, there is a plot of the \"distance\" between all of the topics (labeled as the _Intertopic Distance Map_)\n", " * The plot is rendered in two dimensions according a [*multidimensional scaling (MDS)*](https://en.wikipedia.org/wiki/Multidimensional_scaling) algorithm. Topics that are generally similar should be appear close together on the plot, while *dis*similar topics should appear far apart.\n", " * The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.\n", " * An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the \"selected topic\" box in the upper-left.\n", "* On the right, there is a bar chart showing top terms.\n", " * When no topic is selected in the plot on the left, the bar chart shows the top-30 most \"salient\" terms in the corpus. A term's *saliency* is a measure of both how frequent the term is in the corpus and how \"distinctive\" it is in distinguishing between different topics.\n", " * When a particular topic is selected, the bar chart changes to show the top-30 most \"relevant\" terms for the selected topic. The relevance metric is controlled by the parameter $\\lambda$, which can be adjusted with a slider above the bar chart.\n", " * Setting the $\\lambda$ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.\n", " * Setting $\\lambda$ close to 0.0 will rank the terms solely according to their \"distinctiveness\" or \"exclusivity\" within the topic — i.e., terms that occur *only* in this topic, and do not occur in other topics.\n", " * Setting $\\lambda$ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.\n", "* Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.\n", "\n", "A more detailed explanation of the pyLDAvis visualization can be found [here](https://cran.r-project.org/web/packages/LDAvis/vignettes/details.pdf). Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's `LdaMulticore` object and pyLDAvis' visualization, you have to dig through the terms manually." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyzing our LDA model\n", "The interactive visualization pyLDAvis produces is helpful for both:\n", "1. Better understanding and interpreting individual topics, and\n", "1. Better understanding the relationships between the topics.\n", "\n", "For (1), you can manually select each topic to view its top most freqeuent and/or \"relevant\" terms, using different values of the $\\lambda$ parameter. This can help when you're trying to assign a human interpretable name or \"meaning\" to each topic.\n", "\n", "For (2), exploring the _Intertopic Distance Plot_ can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.\n", "\n", "In our plot, there is a stark divide along the x-axis, with two topics far to the left and most of the remaining 48 far to the right. Inspecting the two outlier topics provides a plausible explanation: both topics contain many non-English words, while most of the rest of the topics are in English. So, one of the main attributes that distinguish the reviews in the dataset from one another is their language.\n", "\n", "This finding isn't entirely a surprise. In addition to English-speaking cities, the Yelp dataset includes reviews of businesses in Montreal and Karlsruhe, Germany, often written in French and German, respectively. Multiple languages isn't a problem for our demo, but for a real NLP application, you might need to ensure that the text you're processing is written in English (or is at least tagged for language) before passing it along to some downstream processing. If that were the case, the divide along the x-axis in the topic plot would immediately alert you to a potential data quality issue.\n", "\n", "The y-axis separates two large groups of topics — let's call them \"super-topics\" — one in the upper-right quadrant and the other in the lower-right quadrant. These super-topics correlate reasonably well with the pattern we'd noticed while naming the topics:\n", "* The super-topic in the *lower*-right tends to be about *food*. It groups together the **burger & fries**, **breakfast**, **sushi**, **barbecue**, and **greek** topics, among others.\n", "* The super-topic in the *upper*-right tends to be about other elements of the *restaurant experience*. It groups together the **ambience & seating**, **location & time**, **family**, and **customer service** topics, among others.\n", "\n", "So, in addition to the 50 direct topics the model has learned, our analysis suggests a higher-level pattern in the data. Restaurant reviewers in the Yelp dataset talk about two main things in their reviews, in general: (1) the food, and (2) their overall restaurant experience. For this dataset, this is a very intuitive result, and we probably didn't need a sophisticated modeling technique to tell it to us. When working with datasets from other domains, though, such high-level patterns may be much less obvious from the outset — and that's where topic modeling can help." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Describing text with LDA\n", "Beyond data exploration, one of the key uses for an LDA model is providing a compact, quantitative description of natural language text. Once an LDA model has been trained, it can be used to represent free text as a mixture of the topics the model learned from the original corpus. This mixture can be interpreted as a probability distribution across the topics, so the LDA representation of a paragraph of text might look like 50% _Topic A_, 20% _Topic B_, 20% _Topic C_, and 10% _Topic D_.\n", "\n", "To use an LDA model to generate a vector representation of new text, you'll need to apply any text preprocessing steps you used on the model's training corpus to the new text, too. For our model, the preprocessing steps we used include:\n", "1. Using spaCy to remove punctuation and lemmatize the text\n", "1. Applying our first-order phrase model to join word pairs\n", "1. Applying our second-order phrase model to join longer phrases\n", "1. Removing stopwords\n", "1. Creating a bag-of-words representation\n", "\n", "Once you've applied these preprocessing steps to the new text, it's ready to pass directly to the model to create an LDA representation. The `lda_description(...)` function will perform all these steps for us, including printing the resulting topical description of the input text." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_sample_review(review_number):\n", " \"\"\"\n", " retrieve a particular review index\n", " from the reviews file and return it\n", " \"\"\"\n", " \n", " return list(it.islice(line_review(review_txt_filepath),\n", " review_number, review_number+1))[0]" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def lda_description(review_text, min_topic_freq=0.05):\n", " \"\"\"\n", " accept the original text of a review and (1) parse it with spaCy,\n", " (2) apply text pre-proccessing steps, (3) create a bag-of-words\n", " representation, (4) create an LDA representation, and\n", " (5) print a sorted list of the top topics in the LDA representation\n", " \"\"\"\n", " \n", " # parse the review text with spaCy\n", " parsed_review = nlp(review_text)\n", " \n", " # lemmatize the text and remove punctuation and whitespace\n", " unigram_review = [token.lemma_ for token in parsed_review\n", " if not punct_space(token)]\n", " \n", " # apply the first-order and secord-order phrase models\n", " bigram_review = bigram_model[unigram_review]\n", " trigram_review = trigram_model[bigram_review]\n", " \n", " # remove any remaining stopwords\n", " trigram_review = [term for term in trigram_review\n", " if not term in spacy.en.STOPWORDS]\n", " \n", " # create a bag-of-words representation\n", " review_bow = trigram_dictionary.doc2bow(trigram_review)\n", " \n", " # create an LDA representation\n", " review_lda = lda[review_bow]\n", " \n", " # sort with the most highly related topics first\n", " review_lda = sorted(review_lda, key=lambda (topic_number, freq): -freq)\n", " \n", " for topic_number, freq in review_lda:\n", " if freq < min_topic_freq:\n", " break\n", " \n", " # print the most highly related topic names and frequencies\n", " print '{:25} {}'.format(topic_names[topic_number],\n", " round(freq, 3))" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best French toast ever!! Love the friendly atmosphere, and especially the breakfast. Never been disappointed. You have to try French toast with raisin bread too... yummy!\n", "\n" ] } ], "source": [ "sample_review = get_sample_review(50)\n", "print sample_review" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "breakfast 0.4\n", "amazing 0.341\n", "customer service 0.192\n" ] } ], "source": [ "lda_description(sample_review)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I went there for dinner last night with a client. This is second time I visited. I had a scotch and he had a Guiness. The (-1) is for drink selection. Just stock some better beers and higher end scotch and you're five stars.\n", "\n", "We started with the meatballs covered with Provolone and other blessed goodness. These did not disappoint. I had a four cheese pizza with sweet sausage and garlic. It was fantastic. They have so many good dishes, but I wanted the pizza last night. I couldn't finish the pizza - way to go big medium pizza.\n", "\n", "I finished up with a coffee. The parking can be a bit of a challenge on the street, but it's a small town atmosphere in Carnegie, PA. I love the downtown there.\n", "\n" ] } ], "source": [ "sample_review = get_sample_review(100)\n", "print sample_review" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pizza 0.28\n", "bars 0.193\n", "amazing 0.172\n", "wine & dine 0.149\n" ] } ], "source": [ "lda_description(sample_review)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word Vector Embedding with Word2Vec" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pop quiz! Can you complete this text snippet?\n", "\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "


\n", "You just demonstrated the core machine learning concept behind word vector embedding models!\n", "


" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised — they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided.\n", "\n", "Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word — i.e., the other words immediately before and after it — to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time.\n", "\n", "At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it \"nudges\" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space.\n", "\n", "For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf).\n", "\n", "Word2vec has a number of user-defined hyperparameters, including:\n", "- The dimensionality of the vectors. Typical choices include a few dozen to several hundred.\n", "- The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible.\n", "- The number of training epochs.\n", "\n", "For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from gensim.models import Word2Vec\n", "\n", "trigram_sentences = LineSentence(trigram_sentences_filepath)\n", "word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs." ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "12 training epochs so far.\n", "CPU times: user 5.43 s, sys: 891 ms, total: 6.32 s\n", "Wall time: 7.12 s\n" ] } ], "source": [ "%%time\n", "\n", "# this is a bit time consuming - make the if statement True\n", "# if you want to train the word2vec model yourself.\n", "if 0 == 1:\n", "\n", " # initiate the model and perform the first epoch of training\n", " food2vec = Word2Vec(trigram_sentences, size=100, window=5,\n", " min_count=20, sg=1, workers=4)\n", " \n", " food2vec.save(word2vec_filepath)\n", "\n", " # perform another 11 epochs of training\n", " for i in range(1,12):\n", "\n", " food2vec.train(trigram_sentences)\n", " food2vec.save(word2vec_filepath)\n", " \n", "# load the finished model from disk\n", "food2vec = Word2Vec.load(word2vec_filepath)\n", "food2vec.init_sims()\n", "\n", "print u'{} training epochs so far.'.format(food2vec.train_count)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On my four-core machine, each epoch over all the text in the ~1 million Yelp reviews takes about 5-10 minutes." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50,835 terms in the food2vec vocabulary.\n" ] } ], "source": [ "print u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.vocab))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns." ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...90919293949596979899
the-0.035762-0.173890-0.035782-0.0071440.032371-0.065272-0.219383-0.0646650.0027390.025802...0.0501360.0440300.145281-0.0204420.128879-0.0764610.075532-0.0128410.024710-0.067555
be-0.074780-0.0495240.085974-0.0988920.1415560.024878-0.011119-0.1753740.005410-0.110996...-0.199047-0.081284-0.1983440.0072570.0753390.070266-0.008326-0.127542-0.0462460.110279
and-0.070505-0.0269180.028344-0.0999090.127974-0.058155-0.056091-0.0289730.197281-0.040528...-0.049051-0.212434-0.0425760.0557310.117097-0.2067370.055435-0.0650560.052316-0.078666
i-0.1612380.050831-0.081706-0.0844790.053073-0.102327-0.108607-0.001920-0.057367-0.050715...0.028528-0.016578-0.1792290.0533570.0709130.036893-0.000544-0.007254-0.0560050.106345
a-0.083491-0.033712-0.124125-0.110776-0.033046-0.0899500.025416-0.052321-0.0592810.074985...-0.1019390.0223920.0570490.015819-0.0017980.0011030.0030960.037175-0.0742790.001683
to-0.0120820.033135-0.063183-0.057252-0.018721-0.017931-0.0277840.1121100.020549-0.174336...-0.017111-0.067532-0.0221490.154788-0.093789-0.0204560.0654780.075484-0.053530-0.005314
it0.0250220.0815810.127987-0.1880150.041450-0.1262220.172725-0.149931-0.069566-0.036031...0.0457200.0948280.0893290.051623-0.108989-0.1454760.0686170.090687-0.1017250.090377
have-0.140812-0.0705520.0221020.0010770.109890-0.0613650.0464500.0030730.113845-0.038957...-0.051071-0.090922-0.0220110.157082-0.082406-0.010306-0.063481-0.098728-0.0640200.153466
of-0.036341-0.0549030.000644-0.0106020.168195-0.058505-0.0523420.039159-0.053572-0.160039...0.085908-0.211464-0.0849900.0823150.223018-0.1425010.2806470.003435-0.037710-0.145140
not-0.0752760.1090470.0551350.0522510.2094370.084334-0.122419-0.1933070.000699-0.099067...-0.150619-0.0604460.181940-0.118538-0.0028790.0188270.0845860.0404370.070277-0.047521
for-0.1029760.001369-0.069402-0.1229360.028278-0.074256-0.013786-0.1470650.204125-0.033473...0.0321230.0133650.008156-0.0213310.0253850.1050750.1847370.087325-0.2306210.075051
in-0.053390-0.175599-0.091688-0.1537910.0032050.013146-0.0132610.162506-0.036985-0.123813...-0.101997-0.0251170.1011470.002555-0.075434-0.0310210.170358-0.070997-0.143472-0.039543
we-0.015929-0.019187-0.186680-0.2409630.077926-0.122313-0.183584-0.0387070.067121-0.108626...0.054799-0.029601-0.197221-0.0819940.1141290.1277460.057743-0.044793-0.080014-0.001816
that-0.0266090.0859400.1181640.0115760.156952-0.061402-0.0682070.008184-0.1694720.051105...-0.0889090.062827-0.1145070.007300-0.075059-0.2022000.0036580.042448-0.0919250.045213
but-0.0634360.089140-0.057425-0.0931100.066531-0.079715-0.049745-0.1613460.0970940.035439...-0.129163-0.022460-0.2007310.079950-0.002590-0.1137340.0484700.0373330.111525-0.001558
with0.042318-0.186670-0.2305630.0763020.216593-0.0561830.004471-0.0878190.073513-0.219137...0.0365140.135176-0.056771-0.0202610.213735-0.1160740.1629920.015298-0.1527310.070306
my-0.002067-0.0231590.0358790.036316-0.110738-0.033034-0.100291-0.0394030.1093420.024952...-0.0514170.220378-0.1061710.159718-0.036391-0.0255730.133651-0.1576150.010161-0.172925
this0.1397390.1846000.137359-0.1099160.021484-0.018423-0.027546-0.055886-0.137625-0.058589...-0.035842-0.075413-0.0685980.122231-0.0978410.1140740.1110750.174843-0.0187430.087721
you-0.171262-0.1198660.063801-0.087287-0.0619230.023105-0.196524-0.043654-0.003327-0.078496...0.0946150.0292430.020553-0.1016570.0396550.059782-0.073931-0.002060-0.068405-0.246893
on0.0574810.044937-0.063766-0.0078390.161119-0.047322-0.024250-0.0389040.0859890.036280...0.084352-0.1195250.076835-0.0103690.0355610.0555880.1195980.306402-0.0950850.053575
they-0.235321-0.0263140.143165-0.1704600.042189-0.019444-0.171945-0.0876660.005467-0.034397...-0.122975-0.0547450.022250-0.068428-0.009932-0.0124890.1027400.071282-0.1651660.126805
food-0.1641330.0077450.058311-0.169839-0.0422780.0040950.203732-0.021252-0.084491-0.016372...-0.0968690.060159-0.1335410.1668040.0849010.1092610.1378710.018093-0.158754-0.042917
do0.091428-0.1321150.1050800.1359490.0381000.066993-0.046825-0.165575-0.0873340.068053...-0.101949-0.037880-0.1878360.037602-0.094156-0.040069-0.013014-0.013038-0.033346-0.056112
good-0.239592-0.232940-0.005036-0.0282260.149816-0.133312-0.034164-0.130310-0.0137570.008618...-0.034984-0.135347-0.1129650.0563120.055106-0.026181-0.1355100.0876640.009934-0.111619
place0.0254790.1303110.119834-0.0963650.0137930.074431-0.0637800.063191-0.0042730.111458...-0.076251-0.076574-0.086146-0.0239360.136419-0.001543-0.0843010.016356-0.148379-0.016498
so0.0214550.0797940.192058-0.093809-0.094279-0.147522-0.066564-0.0731330.0097080.050529...-0.050627-0.008651-0.0342670.045445-0.104442-0.012076-0.118052-0.015163-0.006679-0.074553
get0.009313-0.101684-0.163864-0.1590020.018936-0.056202-0.074619-0.1270810.182303-0.001993...-0.040294-0.038149-0.180993-0.1433410.1402790.1813990.054530-0.1525960.028443-0.030319
go0.031094-0.126839-0.054429-0.221885-0.0634640.0245540.060154-0.011108-0.0207440.038979...-0.074458-0.172092-0.1235180.006400-0.0851490.157569-0.0486330.0179310.1110660.040107
at0.102501-0.095756-0.216304-0.107230-0.112544-0.036979-0.066605-0.0160800.046475-0.128300...-0.058233-0.0468210.0424060.1786070.181424-0.1201130.0290310.113648-0.107441-0.005374
as0.1410910.0736690.109637-0.112564-0.167600-0.059139-0.122552-0.1373830.0932180.096284...-0.052156-0.106116-0.088926-0.0791290.072921-0.009605-0.0014470.068642-0.0228450.197407
..................................................................
hard_boiled_egg0.0251540.060949-0.0648160.0719750.087870-0.0345520.046470-0.0740130.048614-0.027098...0.161784-0.0118000.0362880.102444-0.036660-0.2594960.093633-0.0551280.0994770.069598
poached_quail_egg0.0606160.001519-0.0529080.0135900.031732-0.0201640.0672090.0158950.034758-0.247553...0.0743960.0516720.038299-0.0012480.074726-0.081252-0.0787780.0081090.0238910.075863
egg_foo0.0228160.077288-0.122898-0.022765-0.137979-0.131516-0.0257500.071284-0.130913-0.041167...0.0443350.1001120.124033-0.0124260.0371610.058809-0.033203-0.1050340.207285-0.094321
sherri-0.089536-0.0303700.011311-0.094766-0.1219700.143681-0.035314-0.0606500.005531-0.034962...-0.0101140.0597060.0680940.158161-0.0761220.115804-0.1338260.022022-0.1150430.043500
hindrance-0.0333250.0115200.0273700.240223-0.004098-0.047842-0.008730-0.086408-0.1554690.036887...0.0282510.1368070.079745-0.107322-0.0923310.1516930.126414-0.0366420.043212-0.016060
eggdrop_soup-0.0283690.040039-0.005378-0.0887360.015203-0.150734-0.0082860.0528360.0414980.061759...0.0581920.170658-0.069061-0.0432490.080981-0.0748590.026031-0.0483310.1960820.050968
arbitrarily0.203714-0.047405-0.0452610.000302-0.074105-0.0114200.0067370.068212-0.0103060.162682...-0.165282-0.078750-0.0515270.129190-0.0883320.0003390.0219540.224878-0.0206370.019025
faisant0.0460530.0838730.0579430.174203-0.121259-0.043806-0.069513-0.037047-0.026478-0.119066...-0.0632230.018042-0.1071650.028304-0.141706-0.0845320.097593-0.029115-0.016920-0.027668
marian-0.0353300.146843-0.173594-0.010971-0.1501500.0822240.036275-0.028033-0.0760820.051976...0.0425060.1609510.0037640.168254-0.057206-0.067292-0.2544530.049995-0.0970590.106862
9:30p0.1190100.002159-0.000148-0.102635-0.016918-0.075822-0.0160080.012505-0.068003-0.066301...0.027208-0.1106850.139527-0.031720-0.1019190.1058270.0341340.0838590.0892990.053762
extra_chasu-0.225889-0.1316150.0464310.0179990.119188-0.075226-0.140749-0.0548290.210201-0.098395...0.049070-0.028876-0.1737170.074353-0.078363-0.166292-0.0075460.1165090.0730520.090262
lavosh_wrap0.134420-0.0320550.0122400.0244200.031334-0.0024140.022550-0.0245450.0540830.101219...0.092470-0.1099320.019652-0.090741-0.0088250.0523820.012688-0.035351-0.093695-0.105152
dum_dum-0.002741-0.1373710.030704-0.030365-0.134645-0.036521-0.019889-0.1911690.0340610.156001...0.0728510.016341-0.0098480.038048-0.026917-0.035949-0.022561-0.0001620.0260490.074344
triplet-0.0104950.057432-0.019535-0.0448810.042409-0.0943550.111214-0.141414-0.1022810.013674...0.009309-0.0714230.029983-0.0797340.0172780.0495960.000595-0.1110900.125764-0.020409
nantucket-0.1243560.141918-0.0385790.035650-0.1576620.048110-0.0069150.0490560.1919260.001897...-0.076790-0.0670470.0202610.0887590.0297440.020393-0.0336820.1508560.276557-0.086213
gurl-0.1197940.025898-0.070130-0.0279290.1132440.0768680.084859-0.000508-0.0082750.026478...0.0386120.0107460.0529300.2366580.021199-0.092340-0.1432700.0383940.0054310.177151
nordstroms-0.030410-0.026861-0.0168360.097363-0.0981890.0806750.0058550.065331-0.047586-0.083942...-0.029618-0.2597970.0788610.081747-0.1473090.084849-0.1213200.219587-0.045757-0.032065
eau_de-0.1464500.000788-0.0943910.146833-0.0513920.0265190.022764-0.161101-0.1069510.018758...-0.051030-0.074207-0.017950-0.091572-0.142917-0.2204020.008315-0.0861240.1017740.051059
extra_.5-0.124972-0.032596-0.0178000.106415-0.086728-0.039636-0.048088-0.016923-0.0793150.078559...-0.1800880.053834-0.1461450.1144610.0284000.0339010.0407670.1311770.1543350.170376
hazelnut_crunch0.055723-0.0987080.0132250.098075-0.021967-0.0461370.0836400.011891-0.0345130.159220...0.0981740.030037-0.0289750.0434620.046888-0.1876560.048226-0.0867430.0608920.115703
bella_notte0.209296-0.102068-0.059274-0.0612230.0320780.0424330.060573-0.2638550.026291-0.082852...-0.089840-0.048790-0.040089-0.112727-0.008889-0.095856-0.0476740.194045-0.1180620.139320
homebrew0.054390-0.050279-0.1810060.001028-0.0640480.0293830.007203-0.207681-0.0267850.004940...-0.138264-0.0976490.1396290.101149-0.1453910.1433340.121490-0.0184920.1348250.120686
conveyor_belt_oven-0.099600-0.1276180.070998-0.040632-0.022066-0.021832-0.066087-0.130704-0.0578860.000509...0.020469-0.2513850.0969590.0791150.054866-0.0976570.2235970.0311700.070630-0.136683
meekly-0.0878120.0203890.040041-0.0464360.0848470.0225250.122033-0.0473170.0054510.035133...0.0598810.0801780.0097840.1257050.0102450.068661-0.1710230.0345950.1049360.051597
foccaccia0.1181300.028629-0.0636420.0062470.072471-0.0867540.078307-0.052333-0.0511100.034550...0.045899-0.123955-0.0829760.117231-0.052423-0.2742940.0933530.015437-0.108372-0.065421
clyde0.025336-0.044486-0.081030-0.049451-0.215602-0.0041570.048990-0.149425-0.003808-0.043461...-0.1167010.082913-0.020653-0.0061040.0451350.094833-0.1340000.0771210.0694050.039477
original_g_spicy0.058115-0.036521-0.119183-0.0401590.1631930.0439030.047436-0.050745-0.0713570.027666...0.0188130.062281-0.0573080.022598-0.0430960.1105820.028887-0.1028380.020102-0.127496
potaotes0.0090980.057424-0.156997-0.0573880.030169-0.0952430.1101110.0490340.2024770.018903...-0.0801520.0543860.005011-0.0242570.078248-0.135774-0.0860260.0675700.036091-0.236710
desert_botanical_gardens0.0736530.200249-0.088580-0.032873-0.1618530.0666770.162242-0.057449-0.0141130.114148...-0.128910-0.1288240.1689450.011161-0.0434180.006576-0.0596630.228896-0.0019210.073541
mi_match0.0254820.1221780.0626930.150734-0.0280560.091268-0.1736030.027926-0.232217-0.054804...-0.0346240.0013890.182999-0.010098-0.074026-0.0038590.082882-0.0617450.1320400.038518
\n", "

50835 rows × 100 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 \\\n", "the -0.035762 -0.173890 -0.035782 -0.007144 0.032371 \n", "be -0.074780 -0.049524 0.085974 -0.098892 0.141556 \n", "and -0.070505 -0.026918 0.028344 -0.099909 0.127974 \n", "i -0.161238 0.050831 -0.081706 -0.084479 0.053073 \n", "a -0.083491 -0.033712 -0.124125 -0.110776 -0.033046 \n", "to -0.012082 0.033135 -0.063183 -0.057252 -0.018721 \n", "it 0.025022 0.081581 0.127987 -0.188015 0.041450 \n", "have -0.140812 -0.070552 0.022102 0.001077 0.109890 \n", "of -0.036341 -0.054903 0.000644 -0.010602 0.168195 \n", "not -0.075276 0.109047 0.055135 0.052251 0.209437 \n", "for -0.102976 0.001369 -0.069402 -0.122936 0.028278 \n", "in -0.053390 -0.175599 -0.091688 -0.153791 0.003205 \n", "we -0.015929 -0.019187 -0.186680 -0.240963 0.077926 \n", "that -0.026609 0.085940 0.118164 0.011576 0.156952 \n", "but -0.063436 0.089140 -0.057425 -0.093110 0.066531 \n", "with 0.042318 -0.186670 -0.230563 0.076302 0.216593 \n", "my -0.002067 -0.023159 0.035879 0.036316 -0.110738 \n", "this 0.139739 0.184600 0.137359 -0.109916 0.021484 \n", "you -0.171262 -0.119866 0.063801 -0.087287 -0.061923 \n", "on 0.057481 0.044937 -0.063766 -0.007839 0.161119 \n", "they -0.235321 -0.026314 0.143165 -0.170460 0.042189 \n", "food -0.164133 0.007745 0.058311 -0.169839 -0.042278 \n", "do 0.091428 -0.132115 0.105080 0.135949 0.038100 \n", "good -0.239592 -0.232940 -0.005036 -0.028226 0.149816 \n", "place 0.025479 0.130311 0.119834 -0.096365 0.013793 \n", "so 0.021455 0.079794 0.192058 -0.093809 -0.094279 \n", "get 0.009313 -0.101684 -0.163864 -0.159002 0.018936 \n", "go 0.031094 -0.126839 -0.054429 -0.221885 -0.063464 \n", "at 0.102501 -0.095756 -0.216304 -0.107230 -0.112544 \n", "as 0.141091 0.073669 0.109637 -0.112564 -0.167600 \n", "... ... ... ... ... ... \n", "hard_boiled_egg 0.025154 0.060949 -0.064816 0.071975 0.087870 \n", "poached_quail_egg 0.060616 0.001519 -0.052908 0.013590 0.031732 \n", "egg_foo 0.022816 0.077288 -0.122898 -0.022765 -0.137979 \n", "sherri -0.089536 -0.030370 0.011311 -0.094766 -0.121970 \n", "hindrance -0.033325 0.011520 0.027370 0.240223 -0.004098 \n", "eggdrop_soup -0.028369 0.040039 -0.005378 -0.088736 0.015203 \n", "arbitrarily 0.203714 -0.047405 -0.045261 0.000302 -0.074105 \n", "faisant 0.046053 0.083873 0.057943 0.174203 -0.121259 \n", "marian -0.035330 0.146843 -0.173594 -0.010971 -0.150150 \n", "9:30p 0.119010 0.002159 -0.000148 -0.102635 -0.016918 \n", "extra_chasu -0.225889 -0.131615 0.046431 0.017999 0.119188 \n", "lavosh_wrap 0.134420 -0.032055 0.012240 0.024420 0.031334 \n", "dum_dum -0.002741 -0.137371 0.030704 -0.030365 -0.134645 \n", "triplet -0.010495 0.057432 -0.019535 -0.044881 0.042409 \n", "nantucket -0.124356 0.141918 -0.038579 0.035650 -0.157662 \n", "gurl -0.119794 0.025898 -0.070130 -0.027929 0.113244 \n", "nordstroms -0.030410 -0.026861 -0.016836 0.097363 -0.098189 \n", "eau_de -0.146450 0.000788 -0.094391 0.146833 -0.051392 \n", "extra_.5 -0.124972 -0.032596 -0.017800 0.106415 -0.086728 \n", "hazelnut_crunch 0.055723 -0.098708 0.013225 0.098075 -0.021967 \n", "bella_notte 0.209296 -0.102068 -0.059274 -0.061223 0.032078 \n", "homebrew 0.054390 -0.050279 -0.181006 0.001028 -0.064048 \n", "conveyor_belt_oven -0.099600 -0.127618 0.070998 -0.040632 -0.022066 \n", "meekly -0.087812 0.020389 0.040041 -0.046436 0.084847 \n", "foccaccia 0.118130 0.028629 -0.063642 0.006247 0.072471 \n", "clyde 0.025336 -0.044486 -0.081030 -0.049451 -0.215602 \n", "original_g_spicy 0.058115 -0.036521 -0.119183 -0.040159 0.163193 \n", "potaotes 0.009098 0.057424 -0.156997 -0.057388 0.030169 \n", "desert_botanical_gardens 0.073653 0.200249 -0.088580 -0.032873 -0.161853 \n", "mi_match 0.025482 0.122178 0.062693 0.150734 -0.028056 \n", "\n", " 5 6 7 8 9 \\\n", "the -0.065272 -0.219383 -0.064665 0.002739 0.025802 \n", "be 0.024878 -0.011119 -0.175374 0.005410 -0.110996 \n", "and -0.058155 -0.056091 -0.028973 0.197281 -0.040528 \n", "i -0.102327 -0.108607 -0.001920 -0.057367 -0.050715 \n", "a -0.089950 0.025416 -0.052321 -0.059281 0.074985 \n", "to -0.017931 -0.027784 0.112110 0.020549 -0.174336 \n", "it -0.126222 0.172725 -0.149931 -0.069566 -0.036031 \n", "have -0.061365 0.046450 0.003073 0.113845 -0.038957 \n", "of -0.058505 -0.052342 0.039159 -0.053572 -0.160039 \n", "not 0.084334 -0.122419 -0.193307 0.000699 -0.099067 \n", "for -0.074256 -0.013786 -0.147065 0.204125 -0.033473 \n", "in 0.013146 -0.013261 0.162506 -0.036985 -0.123813 \n", "we -0.122313 -0.183584 -0.038707 0.067121 -0.108626 \n", "that -0.061402 -0.068207 0.008184 -0.169472 0.051105 \n", "but -0.079715 -0.049745 -0.161346 0.097094 0.035439 \n", "with -0.056183 0.004471 -0.087819 0.073513 -0.219137 \n", "my -0.033034 -0.100291 -0.039403 0.109342 0.024952 \n", "this -0.018423 -0.027546 -0.055886 -0.137625 -0.058589 \n", "you 0.023105 -0.196524 -0.043654 -0.003327 -0.078496 \n", "on -0.047322 -0.024250 -0.038904 0.085989 0.036280 \n", "they -0.019444 -0.171945 -0.087666 0.005467 -0.034397 \n", "food 0.004095 0.203732 -0.021252 -0.084491 -0.016372 \n", "do 0.066993 -0.046825 -0.165575 -0.087334 0.068053 \n", "good -0.133312 -0.034164 -0.130310 -0.013757 0.008618 \n", "place 0.074431 -0.063780 0.063191 -0.004273 0.111458 \n", "so -0.147522 -0.066564 -0.073133 0.009708 0.050529 \n", "get -0.056202 -0.074619 -0.127081 0.182303 -0.001993 \n", "go 0.024554 0.060154 -0.011108 -0.020744 0.038979 \n", "at -0.036979 -0.066605 -0.016080 0.046475 -0.128300 \n", "as -0.059139 -0.122552 -0.137383 0.093218 0.096284 \n", "... ... ... ... ... ... \n", "hard_boiled_egg -0.034552 0.046470 -0.074013 0.048614 -0.027098 \n", "poached_quail_egg -0.020164 0.067209 0.015895 0.034758 -0.247553 \n", "egg_foo -0.131516 -0.025750 0.071284 -0.130913 -0.041167 \n", "sherri 0.143681 -0.035314 -0.060650 0.005531 -0.034962 \n", "hindrance -0.047842 -0.008730 -0.086408 -0.155469 0.036887 \n", "eggdrop_soup -0.150734 -0.008286 0.052836 0.041498 0.061759 \n", "arbitrarily -0.011420 0.006737 0.068212 -0.010306 0.162682 \n", "faisant -0.043806 -0.069513 -0.037047 -0.026478 -0.119066 \n", "marian 0.082224 0.036275 -0.028033 -0.076082 0.051976 \n", "9:30p -0.075822 -0.016008 0.012505 -0.068003 -0.066301 \n", "extra_chasu -0.075226 -0.140749 -0.054829 0.210201 -0.098395 \n", "lavosh_wrap -0.002414 0.022550 -0.024545 0.054083 0.101219 \n", "dum_dum -0.036521 -0.019889 -0.191169 0.034061 0.156001 \n", "triplet -0.094355 0.111214 -0.141414 -0.102281 0.013674 \n", "nantucket 0.048110 -0.006915 0.049056 0.191926 0.001897 \n", "gurl 0.076868 0.084859 -0.000508 -0.008275 0.026478 \n", "nordstroms 0.080675 0.005855 0.065331 -0.047586 -0.083942 \n", "eau_de 0.026519 0.022764 -0.161101 -0.106951 0.018758 \n", "extra_.5 -0.039636 -0.048088 -0.016923 -0.079315 0.078559 \n", "hazelnut_crunch -0.046137 0.083640 0.011891 -0.034513 0.159220 \n", "bella_notte 0.042433 0.060573 -0.263855 0.026291 -0.082852 \n", "homebrew 0.029383 0.007203 -0.207681 -0.026785 0.004940 \n", "conveyor_belt_oven -0.021832 -0.066087 -0.130704 -0.057886 0.000509 \n", "meekly 0.022525 0.122033 -0.047317 0.005451 0.035133 \n", "foccaccia -0.086754 0.078307 -0.052333 -0.051110 0.034550 \n", "clyde -0.004157 0.048990 -0.149425 -0.003808 -0.043461 \n", "original_g_spicy 0.043903 0.047436 -0.050745 -0.071357 0.027666 \n", "potaotes -0.095243 0.110111 0.049034 0.202477 0.018903 \n", "desert_botanical_gardens 0.066677 0.162242 -0.057449 -0.014113 0.114148 \n", "mi_match 0.091268 -0.173603 0.027926 -0.232217 -0.054804 \n", "\n", " ... 90 91 92 93 \\\n", "the ... 0.050136 0.044030 0.145281 -0.020442 \n", "be ... -0.199047 -0.081284 -0.198344 0.007257 \n", "and ... -0.049051 -0.212434 -0.042576 0.055731 \n", "i ... 0.028528 -0.016578 -0.179229 0.053357 \n", "a ... -0.101939 0.022392 0.057049 0.015819 \n", "to ... -0.017111 -0.067532 -0.022149 0.154788 \n", "it ... 0.045720 0.094828 0.089329 0.051623 \n", "have ... -0.051071 -0.090922 -0.022011 0.157082 \n", "of ... 0.085908 -0.211464 -0.084990 0.082315 \n", "not ... -0.150619 -0.060446 0.181940 -0.118538 \n", "for ... 0.032123 0.013365 0.008156 -0.021331 \n", "in ... -0.101997 -0.025117 0.101147 0.002555 \n", "we ... 0.054799 -0.029601 -0.197221 -0.081994 \n", "that ... -0.088909 0.062827 -0.114507 0.007300 \n", "but ... -0.129163 -0.022460 -0.200731 0.079950 \n", "with ... 0.036514 0.135176 -0.056771 -0.020261 \n", "my ... -0.051417 0.220378 -0.106171 0.159718 \n", "this ... -0.035842 -0.075413 -0.068598 0.122231 \n", "you ... 0.094615 0.029243 0.020553 -0.101657 \n", "on ... 0.084352 -0.119525 0.076835 -0.010369 \n", "they ... -0.122975 -0.054745 0.022250 -0.068428 \n", "food ... -0.096869 0.060159 -0.133541 0.166804 \n", "do ... -0.101949 -0.037880 -0.187836 0.037602 \n", "good ... -0.034984 -0.135347 -0.112965 0.056312 \n", "place ... -0.076251 -0.076574 -0.086146 -0.023936 \n", "so ... -0.050627 -0.008651 -0.034267 0.045445 \n", "get ... -0.040294 -0.038149 -0.180993 -0.143341 \n", "go ... -0.074458 -0.172092 -0.123518 0.006400 \n", "at ... -0.058233 -0.046821 0.042406 0.178607 \n", "as ... -0.052156 -0.106116 -0.088926 -0.079129 \n", "... ... ... ... ... ... \n", "hard_boiled_egg ... 0.161784 -0.011800 0.036288 0.102444 \n", "poached_quail_egg ... 0.074396 0.051672 0.038299 -0.001248 \n", "egg_foo ... 0.044335 0.100112 0.124033 -0.012426 \n", "sherri ... -0.010114 0.059706 0.068094 0.158161 \n", "hindrance ... 0.028251 0.136807 0.079745 -0.107322 \n", "eggdrop_soup ... 0.058192 0.170658 -0.069061 -0.043249 \n", "arbitrarily ... -0.165282 -0.078750 -0.051527 0.129190 \n", "faisant ... -0.063223 0.018042 -0.107165 0.028304 \n", "marian ... 0.042506 0.160951 0.003764 0.168254 \n", "9:30p ... 0.027208 -0.110685 0.139527 -0.031720 \n", "extra_chasu ... 0.049070 -0.028876 -0.173717 0.074353 \n", "lavosh_wrap ... 0.092470 -0.109932 0.019652 -0.090741 \n", "dum_dum ... 0.072851 0.016341 -0.009848 0.038048 \n", "triplet ... 0.009309 -0.071423 0.029983 -0.079734 \n", "nantucket ... -0.076790 -0.067047 0.020261 0.088759 \n", "gurl ... 0.038612 0.010746 0.052930 0.236658 \n", "nordstroms ... -0.029618 -0.259797 0.078861 0.081747 \n", "eau_de ... -0.051030 -0.074207 -0.017950 -0.091572 \n", "extra_.5 ... -0.180088 0.053834 -0.146145 0.114461 \n", "hazelnut_crunch ... 0.098174 0.030037 -0.028975 0.043462 \n", "bella_notte ... -0.089840 -0.048790 -0.040089 -0.112727 \n", "homebrew ... -0.138264 -0.097649 0.139629 0.101149 \n", "conveyor_belt_oven ... 0.020469 -0.251385 0.096959 0.079115 \n", "meekly ... 0.059881 0.080178 0.009784 0.125705 \n", "foccaccia ... 0.045899 -0.123955 -0.082976 0.117231 \n", "clyde ... -0.116701 0.082913 -0.020653 -0.006104 \n", "original_g_spicy ... 0.018813 0.062281 -0.057308 0.022598 \n", "potaotes ... -0.080152 0.054386 0.005011 -0.024257 \n", "desert_botanical_gardens ... -0.128910 -0.128824 0.168945 0.011161 \n", "mi_match ... -0.034624 0.001389 0.182999 -0.010098 \n", "\n", " 94 95 96 97 98 \\\n", "the 0.128879 -0.076461 0.075532 -0.012841 0.024710 \n", "be 0.075339 0.070266 -0.008326 -0.127542 -0.046246 \n", "and 0.117097 -0.206737 0.055435 -0.065056 0.052316 \n", "i 0.070913 0.036893 -0.000544 -0.007254 -0.056005 \n", "a -0.001798 0.001103 0.003096 0.037175 -0.074279 \n", "to -0.093789 -0.020456 0.065478 0.075484 -0.053530 \n", "it -0.108989 -0.145476 0.068617 0.090687 -0.101725 \n", "have -0.082406 -0.010306 -0.063481 -0.098728 -0.064020 \n", "of 0.223018 -0.142501 0.280647 0.003435 -0.037710 \n", "not -0.002879 0.018827 0.084586 0.040437 0.070277 \n", "for 0.025385 0.105075 0.184737 0.087325 -0.230621 \n", "in -0.075434 -0.031021 0.170358 -0.070997 -0.143472 \n", "we 0.114129 0.127746 0.057743 -0.044793 -0.080014 \n", "that -0.075059 -0.202200 0.003658 0.042448 -0.091925 \n", "but -0.002590 -0.113734 0.048470 0.037333 0.111525 \n", "with 0.213735 -0.116074 0.162992 0.015298 -0.152731 \n", "my -0.036391 -0.025573 0.133651 -0.157615 0.010161 \n", "this -0.097841 0.114074 0.111075 0.174843 -0.018743 \n", "you 0.039655 0.059782 -0.073931 -0.002060 -0.068405 \n", "on 0.035561 0.055588 0.119598 0.306402 -0.095085 \n", "they -0.009932 -0.012489 0.102740 0.071282 -0.165166 \n", "food 0.084901 0.109261 0.137871 0.018093 -0.158754 \n", "do -0.094156 -0.040069 -0.013014 -0.013038 -0.033346 \n", "good 0.055106 -0.026181 -0.135510 0.087664 0.009934 \n", "place 0.136419 -0.001543 -0.084301 0.016356 -0.148379 \n", "so -0.104442 -0.012076 -0.118052 -0.015163 -0.006679 \n", "get 0.140279 0.181399 0.054530 -0.152596 0.028443 \n", "go -0.085149 0.157569 -0.048633 0.017931 0.111066 \n", "at 0.181424 -0.120113 0.029031 0.113648 -0.107441 \n", "as 0.072921 -0.009605 -0.001447 0.068642 -0.022845 \n", "... ... ... ... ... ... \n", "hard_boiled_egg -0.036660 -0.259496 0.093633 -0.055128 0.099477 \n", "poached_quail_egg 0.074726 -0.081252 -0.078778 0.008109 0.023891 \n", "egg_foo 0.037161 0.058809 -0.033203 -0.105034 0.207285 \n", "sherri -0.076122 0.115804 -0.133826 0.022022 -0.115043 \n", "hindrance -0.092331 0.151693 0.126414 -0.036642 0.043212 \n", "eggdrop_soup 0.080981 -0.074859 0.026031 -0.048331 0.196082 \n", "arbitrarily -0.088332 0.000339 0.021954 0.224878 -0.020637 \n", "faisant -0.141706 -0.084532 0.097593 -0.029115 -0.016920 \n", "marian -0.057206 -0.067292 -0.254453 0.049995 -0.097059 \n", "9:30p -0.101919 0.105827 0.034134 0.083859 0.089299 \n", "extra_chasu -0.078363 -0.166292 -0.007546 0.116509 0.073052 \n", "lavosh_wrap -0.008825 0.052382 0.012688 -0.035351 -0.093695 \n", "dum_dum -0.026917 -0.035949 -0.022561 -0.000162 0.026049 \n", "triplet 0.017278 0.049596 0.000595 -0.111090 0.125764 \n", "nantucket 0.029744 0.020393 -0.033682 0.150856 0.276557 \n", "gurl 0.021199 -0.092340 -0.143270 0.038394 0.005431 \n", "nordstroms -0.147309 0.084849 -0.121320 0.219587 -0.045757 \n", "eau_de -0.142917 -0.220402 0.008315 -0.086124 0.101774 \n", "extra_.5 0.028400 0.033901 0.040767 0.131177 0.154335 \n", "hazelnut_crunch 0.046888 -0.187656 0.048226 -0.086743 0.060892 \n", "bella_notte -0.008889 -0.095856 -0.047674 0.194045 -0.118062 \n", "homebrew -0.145391 0.143334 0.121490 -0.018492 0.134825 \n", "conveyor_belt_oven 0.054866 -0.097657 0.223597 0.031170 0.070630 \n", "meekly 0.010245 0.068661 -0.171023 0.034595 0.104936 \n", "foccaccia -0.052423 -0.274294 0.093353 0.015437 -0.108372 \n", "clyde 0.045135 0.094833 -0.134000 0.077121 0.069405 \n", "original_g_spicy -0.043096 0.110582 0.028887 -0.102838 0.020102 \n", "potaotes 0.078248 -0.135774 -0.086026 0.067570 0.036091 \n", "desert_botanical_gardens -0.043418 0.006576 -0.059663 0.228896 -0.001921 \n", "mi_match -0.074026 -0.003859 0.082882 -0.061745 0.132040 \n", "\n", " 99 \n", "the -0.067555 \n", "be 0.110279 \n", "and -0.078666 \n", "i 0.106345 \n", "a 0.001683 \n", "to -0.005314 \n", "it 0.090377 \n", "have 0.153466 \n", "of -0.145140 \n", "not -0.047521 \n", "for 0.075051 \n", "in -0.039543 \n", "we -0.001816 \n", "that 0.045213 \n", "but -0.001558 \n", "with 0.070306 \n", "my -0.172925 \n", "this 0.087721 \n", "you -0.246893 \n", "on 0.053575 \n", "they 0.126805 \n", "food -0.042917 \n", "do -0.056112 \n", "good -0.111619 \n", "place -0.016498 \n", "so -0.074553 \n", "get -0.030319 \n", "go 0.040107 \n", "at -0.005374 \n", "as 0.197407 \n", "... ... \n", "hard_boiled_egg 0.069598 \n", "poached_quail_egg 0.075863 \n", "egg_foo -0.094321 \n", "sherri 0.043500 \n", "hindrance -0.016060 \n", "eggdrop_soup 0.050968 \n", "arbitrarily 0.019025 \n", "faisant -0.027668 \n", "marian 0.106862 \n", "9:30p 0.053762 \n", "extra_chasu 0.090262 \n", "lavosh_wrap -0.105152 \n", "dum_dum 0.074344 \n", "triplet -0.020409 \n", "nantucket -0.086213 \n", "gurl 0.177151 \n", "nordstroms -0.032065 \n", "eau_de 0.051059 \n", "extra_.5 0.170376 \n", "hazelnut_crunch 0.115703 \n", "bella_notte 0.139320 \n", "homebrew 0.120686 \n", "conveyor_belt_oven -0.136683 \n", "meekly 0.051597 \n", "foccaccia -0.065421 \n", "clyde 0.039477 \n", "original_g_spicy -0.127496 \n", "potaotes -0.236710 \n", "desert_botanical_gardens 0.073541 \n", "mi_match 0.038518 \n", "\n", "[50835 rows x 100 columns]" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# build a list of the terms, integer indices,\n", "# and term counts from the food2vec model vocabulary\n", "ordered_vocab = [(term, voc.index, voc.count)\n", " for term, voc in food2vec.vocab.iteritems()]\n", "\n", "# sort by the term counts, so the most common terms appear first\n", "ordered_vocab = sorted(ordered_vocab, key=lambda (term, index, count): -count)\n", "\n", "# unzip the terms, integer indices, and counts into separate lists\n", "ordered_terms, term_indices, term_counts = zip(*ordered_vocab)\n", "\n", "# create a DataFrame with the food2vec vectors as data,\n", "# and the terms as row labels\n", "word_vectors = pd.DataFrame(food2vec.syn0norm[term_indices, :],\n", " index=ordered_terms)\n", "\n", "word_vectors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Holy wall of numbers! This DataFrame has 50,835 rows — one for each term in the vocabulary — and 100 colums. Our model has learned a quantitative vector representation for each term, as expected.\n", "\n", "Put another way, our model has \"embedded\" the terms into a 100-dimensional vector space." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### So... what can we do with all these numbers?\n", "The first thing we can use them for is to simply look up related words and phrases for a given term of interest." ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_related_terms(token, topn=10):\n", " \"\"\"\n", " look up the topn most similar terms to token\n", " and print them as a formatted list\n", " \"\"\"\n", "\n", " for word, similarity in food2vec.most_similar(positive=[token], topn=topn):\n", "\n", " print u'{:20} {}'.format(word, round(similarity, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What things are like Burger King?" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mcdonalds 0.895\n", "wendy_'s 0.855\n", "mcd_'s 0.853\n", "mcdonald_'s 0.852\n", "denny_'s 0.816\n", "bk 0.808\n", "carl_'s_jr. 0.8\n", "red_robin 0.792\n", "mickey_d_'s 0.771\n", "sonic 0.765\n" ] } ], "source": [ "get_related_terms(u'burger_king')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has learned that fast food restaurants are similar to each other! In particular, *mcdonalds* and *wendy's* are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as *mcdonalds*, *mcdonald's* and *mcd's*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When is happy hour?" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hh 0.874\n", "reverse_happy_hour 0.801\n", "happy_hr 0.771\n", "during_happy_hour 0.672\n", "mon_fri 0.634\n", "3_6pm 0.632\n", "hh. 0.631\n", "4_7pm 0.625\n", "special 0.621\n", "happy_hour_3_6pm 0.618\n" ] } ], "source": [ "get_related_terms(u'happy_hour')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know.\n", "\n", "Taking a deeper look — the model has turned up phrases like *3-6pm*, *4-7pm*, and *mon-fri*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's make pasta tonight. Which style do you want?" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "lasagna 0.798\n", "spaghetti 0.773\n", "bolognese 0.757\n", "fettuccine 0.748\n", "penne 0.745\n", "rigatoni 0.743\n", "angel_hair 0.721\n", "linguine 0.716\n", "angel_hair_pasta 0.712\n", "penne_pasta 0.712\n", "carbonara 0.705\n", "fettucini 0.704\n", "alfredo 0.703\n", "tortellini 0.703\n", "manicotti 0.7\n", "ziti 0.698\n", "gnocci 0.694\n", "ravioli 0.694\n", "linguini 0.693\n", "risotto 0.691\n" ] } ], "source": [ "get_related_terms(u'pasta', topn=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word algebra!\n", "No self-respecting word2vec demo would be complete without a healthy dose of *word algebra*, also known as *analogy completion*.\n", "\n", "The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this:\n", "1. Provide a set of words or phrases that you'd like to add or subtract.\n", "1. Look up the vectors that represent those terms in the word vector model.\n", "1. Add and subtract those vectors to produce a new, combined vector.\n", "1. Look up the most similar vector(s) to this new, combined vector via cosine similarity.\n", "1. Return the word(s) associated with the similar vector(s).\n", "\n", "But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def word_algebra(add=[], subtract=[], topn=1):\n", " \"\"\"\n", " combine the vectors associated with the words provided\n", " in add= and subtract=, look up the topn most similar\n", " terms to the combined vector, and print the result(s)\n", " \"\"\"\n", " answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn)\n", " \n", " for term, similarity in answers:\n", " print term" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### breakfast + lunch = ?\n", "Let's start with a softball." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "brunch\n" ] } ], "source": [ "word_algebra(add=[u'breakfast', u'lunch'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, so the model knows that *brunch* is a combination of *breakfast* and *lunch*. What else?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### lunch - day + night = ?" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dinner\n" ] } ], "source": [ "word_algebra(add=[u'lunch', u'night'], subtract=[u'day'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we're getting a bit more nuanced. The model has discovered that:\n", "- Both *lunch* and *dinner* are meals\n", "- The main difference between them is time of day\n", "- Day and night are times of day\n", "- Lunch is associated with day, and dinner is associated with night\n", "\n", "What else?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### taco - mexican + chinese = ?" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dumpling\n" ] } ], "source": [ "word_algebra(add=[u'taco', u'chinese'], subtract=[u'mexican'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an entirely new and different type of relationship that the model has learned.\n", "- It knows that tacos are a characteristic example of Mexican food\n", "- It knows that Mexican and Chinese are both styles of food\n", "- If you subtract *Mexican* from *taco*, you're left with something like the concept of a *\"characteristic type of food\"*, which is represented as a new vector\n", "- If you add that new *\"characteristic type of food\"* vector to Chinese, you get *dumpling*.\n", "\n", "What else?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### bun - american + mexican = ?" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tortilla\n" ] } ], "source": [ "word_algebra(add=[u'bun', u'mexican'], subtract=[u'american'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model knows that both *buns* and *tortillas* are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with.\n", "\n", "What else?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### filet mignon - beef + seafood = ?" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "raw_oyster\n" ] } ], "source": [ "word_algebra(add=[u'filet_mignon', u'seafood'], subtract=[u'beef'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has learned a concept of *delicacy*. If you take filet mignon and subtract beef from it, you're left with a vector that roughly corresponds to delicacy. If you add the delicacy vector to *seafood*, you get *raw oyster*.\n", "\n", "What else?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### coffee - drink + snack = ?" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pastry\n" ] } ], "source": [ "word_algebra(add=[u'coffee', u'snack'], subtract=[u'drink'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model knows that if you're on your coffee break, but instead of drinking something, you're eating something... that thing is most likely a pastry.\n", "\n", "What else?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Burger King + fine dining = ?" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "denny_'s\n" ] } ], "source": [ "word_algebra(add=[u'burger_king', u'fine_dining'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Touché. It makes sense, though. The model has learned that both Burger King and Denny's are large chains, and that both serve fast, casual, American-style food. But Denny's has some elements that are slightly more upscale, such as printed menus and table service. Fine dining, indeed.\n", "\n", "*What if we keep going?*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Denny's + fine dining = ?" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "applebee_'s\n" ] } ], "source": [ "word_algebra(add=[u\"denny_'s\", u'fine_dining'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This seems like a good place to land... what if we explore the vector space around *Applebee's* a bit, in a few different directions? Let's see what we find.\n", "\n", "#### Applebee's + italian = ?" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "olive_garden\n" ] } ], "source": [ "word_algebra(add=[u\"applebee_'s\", u'italian'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Applebee's + pancakes = ?" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ihop\n" ] } ], "source": [ "word_algebra(add=[u\"applebee_'s\", u'pancakes'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Applebee's + pizza = ?" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pizza_hut\n" ] } ], "source": [ "word_algebra(add=[u\"applebee_'s\", u'pizza'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could do this all day. One last analogy before we move on..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### wine - grapes + barley = ?" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "beer\n" ] } ], "source": [ "word_algebra(add=[u'wine', u'barley'], subtract=[u'grapes'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word Vector Visualization with t-SNE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf), or *t-SNE* for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space.\n", "\n", "scikit-learn provides a convenient implementation of the t-SNE algorithm with its [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class." ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.manifold import TSNE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first:\n", "1. Drop stopwords — it's probably not too interesting to visualize *the*, *of*, *or*, and so on\n", "1. Take only the 5,000 most frequent terms in the vocabulary — no need to visualize all ~50,000 terms right now." ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tsne_input = word_vectors.drop(spacy.en.STOPWORDS, errors=u'ignore')\n", "tsne_input = tsne_input.head(5000)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...90919293949596979899
food-0.1641330.0077450.058311-0.169839-0.0422780.0040950.203732-0.021252-0.084491-0.016372...-0.0968690.060159-0.1335410.1668040.0849010.1092610.1378710.018093-0.158754-0.042917
good-0.239592-0.232940-0.005036-0.0282260.149816-0.133312-0.034164-0.130310-0.0137570.008618...-0.034984-0.135347-0.1129650.0563120.055106-0.026181-0.1355100.0876640.009934-0.111619
place0.0254790.1303110.119834-0.0963650.0137930.074431-0.0637800.063191-0.0042730.111458...-0.076251-0.076574-0.086146-0.0239360.136419-0.001543-0.0843010.016356-0.148379-0.016498
order0.045996-0.035101-0.045906-0.2803360.157393-0.146304-0.064311-0.0827540.124693-0.194072...-0.0633060.125928-0.194433-0.1295510.0396800.058868-0.023189-0.1537150.152482-0.003842
great-0.189524-0.259744-0.0411820.0216640.132266-0.030005-0.078055-0.004111-0.0161840.054305...-0.052275-0.122914-0.0755550.1933830.017008-0.117156-0.0410650.140430-0.067349-0.040291
\n", "

5 rows × 100 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "food -0.164133 0.007745 0.058311 -0.169839 -0.042278 0.004095 0.203732 \n", "good -0.239592 -0.232940 -0.005036 -0.028226 0.149816 -0.133312 -0.034164 \n", "place 0.025479 0.130311 0.119834 -0.096365 0.013793 0.074431 -0.063780 \n", "order 0.045996 -0.035101 -0.045906 -0.280336 0.157393 -0.146304 -0.064311 \n", "great -0.189524 -0.259744 -0.041182 0.021664 0.132266 -0.030005 -0.078055 \n", "\n", " 7 8 9 ... 90 91 92 \\\n", "food -0.021252 -0.084491 -0.016372 ... -0.096869 0.060159 -0.133541 \n", "good -0.130310 -0.013757 0.008618 ... -0.034984 -0.135347 -0.112965 \n", "place 0.063191 -0.004273 0.111458 ... -0.076251 -0.076574 -0.086146 \n", "order -0.082754 0.124693 -0.194072 ... -0.063306 0.125928 -0.194433 \n", "great -0.004111 -0.016184 0.054305 ... -0.052275 -0.122914 -0.075555 \n", "\n", " 93 94 95 96 97 98 99 \n", "food 0.166804 0.084901 0.109261 0.137871 0.018093 -0.158754 -0.042917 \n", "good 0.056312 0.055106 -0.026181 -0.135510 0.087664 0.009934 -0.111619 \n", "place -0.023936 0.136419 -0.001543 -0.084301 0.016356 -0.148379 -0.016498 \n", "order -0.129551 0.039680 0.058868 -0.023189 -0.153715 0.152482 -0.003842 \n", "great 0.193383 0.017008 -0.117156 -0.041065 0.140430 -0.067349 -0.040291 \n", "\n", "[5 rows x 100 columns]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tsne_input.head()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tsne_filepath = os.path.join(intermediate_directory,\n", " u'tsne_model')\n", "\n", "tsne_vectors_filepath = os.path.join(intermediate_directory,\n", " u'tsne_vectors.npy')" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 31.3 ms, sys: 2.47 ms, total: 33.8 ms\n", "Wall time: 32.9 ms\n" ] } ], "source": [ "%%time\n", "\n", "if 0 == 1:\n", " \n", " tsne = TSNE()\n", " tsne_vectors = tsne.fit_transform(tsne_input.values)\n", " \n", " with open(tsne_filepath, 'w') as f:\n", " pickle.dump(tsne, f)\n", "\n", " pd.np.save(tsne_vectors_filepath, tsne_vectors)\n", " \n", "with open(tsne_filepath) as f:\n", " tsne = pickle.load(f)\n", " \n", "tsne_vectors = pd.np.load(tsne_vectors_filepath)\n", "\n", "tsne_vectors = pd.DataFrame(tsne_vectors,\n", " index=pd.Index(tsne_input.index),\n", " columns=[u'x_coord', u'y_coord'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a two-dimensional representation of our data! Let's take a look." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x_coordy_coord
food2.3138866.475995
good8.7630304.633407
place-8.9421782.221976
order-2.876029-2.300830
great9.5157725.076319
\n", "
" ], "text/plain": [ " x_coord y_coord\n", "food 2.313886 6.475995\n", "good 8.763030 4.633407\n", "place -8.942178 2.221976\n", "order -2.876029 -2.300830\n", "great 9.515772 5.076319" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tsne_vectors.head()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tsne_vectors[u'word'] = tsne_vectors.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plotting with Bokeh" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = \"1\";\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force !== \"\") {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force !== \"\") {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " if (window.Bokeh !== undefined) {\n", " Bokeh.$(\"#3bef7f1e-fd2e-47ed-81a4-94022ceadc48\").text(\"BokehJS successfully loaded.\");\n", " } else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"3bef7f1e-fd2e-47ed-81a4-94022ceadc48\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid '3bef7f1e-fd2e-47ed-81a4-94022ceadc48' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = ['https://cdn.pydata.org/bokeh/release/bokeh-0.12.2.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.2.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-compiler-0.12.2.min.js'];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " Bokeh.$(\"#3bef7f1e-fd2e-47ed-81a4-94022ceadc48\").text(\"BokehJS is loading...\");\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.2.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.2.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.2.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.2.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((window.Bokeh !== undefined) || (force === \"1\")) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }if (force === \"1\") {\n", " display_loaded();\n", " }} else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!window._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " window._bokeh_failed_load = true;\n", " } else if (!force) {\n", " var cell = $(\"#3bef7f1e-fd2e-47ed-81a4-94022ceadc48\").parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from bokeh.plotting import figure, show, output_notebook\n", "from bokeh.models import HoverTool, ColumnDataSource, value\n", "\n", "output_notebook()" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# add our DataFrame as a ColumnDataSource for Bokeh\n", "plot_data = ColumnDataSource(tsne_vectors)\n", "\n", "# create the plot and configure the\n", "# title, dimensions, and tools\n", "tsne_plot = figure(title=u't-SNE Word Embeddings',\n", " plot_width = 800,\n", " plot_height = 800,\n", " tools= (u'pan, wheel_zoom, box_zoom,'\n", " u'box_select, resize, reset'),\n", " active_scroll=u'wheel_zoom')\n", "\n", "# add a hover tool to display words on roll-over\n", "tsne_plot.add_tools( HoverTool(tooltips = u'@word') )\n", "\n", "# draw the words as circles on the plot\n", "tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data,\n", " color=u'blue', line_alpha=0.2, fill_alpha=0.1,\n", " size=10, hover_line_color=u'black')\n", "\n", "# configure visual elements of the plot\n", "tsne_plot.title.text_font_size = value(u'16pt')\n", "tsne_plot.xaxis.visible = False\n", "tsne_plot.yaxis.visible = False\n", "tsne_plot.grid.grid_line_color = None\n", "tsne_plot.outline_line_color = None\n", "\n", "# engage!\n", "show(tsne_plot);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whew! Let's round up the major components that we've seen:\n", "1. Text processing with **spaCy**\n", "1. Automated **phrase modeling**\n", "1. Topic modeling with **LDA** $\\ \\longrightarrow\\ $ visualization with **pyLDAvis**\n", "1. Word vector modeling with **word2vec** $\\ \\longrightarrow\\ $ visualization with **t-SNE**\n", "\n", "#### Why use these models?\n", "Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like:\n", "- Text classification\n", "- Search\n", "- Recommendations\n", "- Question answering\n", "\n", "...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Science @ S&P Global — *we are hiring!*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }