{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Supervised sentiment: Overview of the Stanford Sentiment Treebank" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts\"\n", "__version__ = \"CS224u, Stanford, Spring 2018 term\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "## Contents\n", "\n", "0. [Overview of this unit](#Overview-of-this-unit)\n", "0. [Paths through the material](#Paths-through-the-material)\n", "0. [Overview of this notebook](#Overview-of-this-notebook)\n", "0. [The complexity of sentiment analysis](#The-complexity-of-sentiment-analysis)\n", "0. [Set-up](#Set-up)\n", "0. [Data readers](#Data-readers)\n", " 0. [Main readers](#Main-readers)\n", " 0. [All-nodes readers](#All-nodes-readers)\n", " 0. [Methodological notes](#Methodological-notes)\n", "0. [Modeling the SST labels](#Modeling-the-SST-labels)\n", " 0. [Train label distributions](#Train-label-distributions)\n", " 0. [Dev label distributions](#Dev-label-distributions)\n", "0. [Additional sentiment resources](#Additional-sentiment-resources)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview of this unit\n", "\n", "We have a few inter-related goals for this unit:\n", "\n", "* Provide a basic introduction to supervised learning in the context of a problem that has long been central to academic research and industry applications: __sentiment analysis__.\n", "\n", "* Explore and evaluate a diverse array of methods for modeling sentiment:\n", " * Hand-built feature functions with (mostly linear) classifiers\n", " * Dense feature representations derived from VSMs as we built them in the previous unit\n", " * Recurrent neural networks (RNNs)\n", " * Tree-structured neural networks\n", " \n", "* Begin discussing and implementing responsible methods for __hyperparameter optimization__ and __classifier assessment and comparison__.\n", "\n", "The unit is built around the [Stanford Sentiment Treebank (SST)](http://nlp.stanford.edu/sentiment/), a widely-used resource for evaluating supervised NLU models, and one that provides rich linguistic representations." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Paths through the material\n", "\n", "* If you're relatively new to supervised learning, we suggest studying the details of this notebook closely and following the links to [additional resources](#Additional-sentiment-resources). \n", "\n", "* If you're familiar with supervised learning, then you can focus right away on innovative feature representations and modeling. \n", "\n", "* As of this writing, the state-of-the-art for the SST seems to be around 88% accuracy for the binary problem and 48% accuracy for the five-class problem. Perhaps you can best these numbers!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Overview of this notebook\n", "\n", "This is the first notebook in this unit. It does two things:\n", "\n", "* Introduces sentiment analysis as a task.\n", "* Introduces the SST and our tools for reading that corpus. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The complexity of sentiment analysis\n", "\n", "Sentiment analysis seems simple at first but turns out to exhibit all of the complexity of full natural language understanding. To see this, consider how your intuitions about the sentiment of the following sentences can change depending on perspective, social relationships, tone of voice, and other aspects of the context of utterance:\n", "\n", "1. There was an earthquake in LA.\n", "1. The team failed the physical challenge. (We win/lose!)\n", "1. They said it would be great. They were right/wrong.\n", "1. Many consider the masterpiece bewildering, boring, slow-moving or annoying.\n", "1. The party fat-cats are sipping their expensive, imported wines.\n", "1. Oh, you're terrible!\n", "\n", "SST mostly steers around these challenges by including only focused, evaluative texts (sentences from movie reviews), but you should have them in mind if you consider new domains and applications for the ideas." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Set-up\n", "\n", "* Make sure your environment includes all the requirements for [the cs224u repository](https://github.com/cgpotts/cs224u).\n", "\n", "* Download [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip), unzip it, and put the resulting folder in the same directory as this notebook. It will be called `trees`.\n", "\n", "* Make sure you still have the `vsmdata` directory and its contents. ([Here's a link in case you need to redownload it.](http://web.stanford.edu/class/cs224u/data/vsmdata.zip)) In addition, you might want [the Wikipedia 2014 + Gigaword 5 distribution of the pretrained GloVe vectors](http://nlp.stanford.edu/data/glove.6B.zip). This might already be in `vsmdata`, depending on what kind of work you did as part of the VSM unit." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from nltk.tree import Tree\n", "import pandas as pd\n", "import sst" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Data readers\n", "\n", "* The train/dev/test SST distribution contains files that are lists of trees where the part-of-speech tags have been replaced with sentiment scores `0...4`:\n", " * `0` and `1` are negative labels.\n", " * `2` is a neutral label.\n", " * `3` and `4` are positive labels. \n", "\n", "* Our readers are iterators that yield `(tree, label)` pairs, where `tree` is an [NLTK Tree](http://www.nltk.org/_modules/nltk/tree.html) instance and `score` is a string." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Main readers\n", "\n", "We'll mainly work with `sst.train_reader` and `sst.dev_reader`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "tree, score = next(sst.train_reader())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, `score` is one of the labels. `tree` is an NLTK Tree instance. It should render pretty legibly in your browser:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Tree('S', [Tree('2', [Tree('2', ['The']), Tree('2', ['Rock'])]), Tree('4', [Tree('3', [Tree('2', ['is']), Tree('4', [Tree('2', ['destined']), Tree('2', [Tree('2', [Tree('2', [Tree('2', [Tree('2', ['to']), Tree('2', [Tree('2', ['be']), Tree('2', [Tree('2', ['the']), Tree('2', [Tree('2', ['21st']), Tree('2', [Tree('2', [Tree('2', ['Century']), Tree('2', [\"'s\"])]), Tree('2', [Tree('3', ['new']), Tree('2', [Tree('2', ['``']), Tree('2', ['Conan'])])])])])])])]), Tree('2', [\"''\"])]), Tree('2', ['and'])]), Tree('3', [Tree('2', ['that']), Tree('3', [Tree('2', ['he']), Tree('3', [Tree('2', [\"'s\"]), Tree('3', [Tree('2', ['going']), Tree('3', [Tree('2', ['to']), Tree('4', [Tree('3', [Tree('2', ['make']), Tree('3', [Tree('3', [Tree('2', ['a']), Tree('3', ['splash'])]), Tree('2', [Tree('2', ['even']), Tree('3', ['greater'])])])]), Tree('2', [Tree('2', ['than']), Tree('2', [Tree('2', [Tree('2', [Tree('2', [Tree('1', [Tree('2', ['Arnold']), Tree('2', ['Schwarzenegger'])]), Tree('2', [','])]), Tree('2', [Tree('2', ['Jean-Claud']), Tree('2', [Tree('2', ['Van']), Tree('2', ['Damme'])])])]), Tree('2', ['or'])]), Tree('2', [Tree('2', ['Steven']), Tree('2', ['Segal'])])])])])])])])])])])])]), Tree('2', ['.'])])])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tree" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is what it actually looks like, of course:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Tree('S', [Tree('2', [Tree('2', ['The']), Tree('2', ['Rock'])]), Tree('4', [Tree('3', [Tree('2', ['is']), Tree('4', [Tree('2', ['destined']), Tree('2', [Tree('2', [Tree('2', [Tree('2', [Tree('2', ['to']), Tree('2', [Tree('2', ['be']), Tree('2', [Tree('2', ['the']), Tree('2', [Tree('2', ['21st']), Tree('2', [Tree('2', [Tree('2', ['Century']), Tree('2', [\"'s\"])]), Tree('2', [Tree('3', ['new']), Tree('2', [Tree('2', ['``']), Tree('2', ['Conan'])])])])])])])]), Tree('2', [\"''\"])]), Tree('2', ['and'])]), Tree('3', [Tree('2', ['that']), Tree('3', [Tree('2', ['he']), Tree('3', [Tree('2', [\"'s\"]), Tree('3', [Tree('2', ['going']), Tree('3', [Tree('2', ['to']), Tree('4', [Tree('3', [Tree('2', ['make']), Tree('3', [Tree('3', [Tree('2', ['a']), Tree('3', ['splash'])]), Tree('2', [Tree('2', ['even']), Tree('3', ['greater'])])])]), Tree('2', [Tree('2', ['than']), Tree('2', [Tree('2', [Tree('2', [Tree('2', [Tree('1', [Tree('2', ['Arnold']), Tree('2', ['Schwarzenegger'])]), Tree('2', [','])]), Tree('2', [Tree('2', ['Jean-Claud']), Tree('2', [Tree('2', ['Van']), Tree('2', ['Damme'])])])]), Tree('2', ['or'])]), Tree('2', [Tree('2', ['Steven']), Tree('2', ['Segal'])])])])])])])])])])])])]), Tree('2', ['.'])])]),)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(tree,)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a smaller example:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAALgAAACMCAIAAABAuvQrAAAACXBIWXMAAA3XAAAN1wFCKJt4AAAAHXRFWHRTb2Z0d2FyZQBHUEwgR2hvc3RzY3JpcHQgOS4xNnO9PXQAAAnVSURBVHic7Z0xbONGFoZnD1cE62tYeNsA3M4ueVtrA5CN3Jpq7TQUkkW6BGSbjoT6AGSTTXcgXR2wbsjCbrMaIM2q88DbWoAGCE5ODsFFVwxujktL1IgSNTPU+ypRIqk3nF9vZjjDX88WiwUCgHX8RXYAgB6AUAAhQCiAECAUQAgQCiCEHkIJgiAIAtlRHDQaCCVJEsuyMMayAzloVBcKpRRj7Lqu7EAOHdWFEkWR53myowDUFgohBCFkWZbsQAD0V9kB1BEEQRzHsqMAEFJZKBhjSmkURWyTEJJlGXRWZKGuUEzT9H2fbxZFYZqmxHgOnGe6zB47jpPnuewoDhelO7OMLMscx8EYwz03iWiTUQC5aJBRABUAoQBCgFAAITQQCn18/Ocvv5DpVHYgB42691EQQvTxMbq+Tm5u/vPnn7/+9pv3+rXf75vHx7LjOkQUHfVwidD53Hv9+qsvvvjHzz/zTZDL/lFOKBWJlDVR8xHQNgoJRVAHIBcpKCGUBnUPctkzkoWyZX2DXPaGNKHssI5BLntAglBaqleQS6vsVSh7qEuQS0vsSSh7rj+Qy85pXSgS6wzkskNaFIoi9aRIGLrTilAUrBsFQ9KLHQtF8fpQPDyV2ZlQNKoDjUJVhx0IRdPrrmnYsthKKB241h0own5oKJSOXd+OFacNmgiFTKd///777l3TslzSr792X72SHZFCNMwowdWV1+t1RiJlmFz8ft94/lx2LAqhxHoUQH00WIUPqAAIBRAChAIIsdlzPRjjKIoopQgh13U77K7GnBPCMJQdiCpskFEopcwrK8/zPM8ppVmWtReZRMCw9CkbCIUQ4vu+YRhs0/f9TgoFDEuXskHTU3FnxBh30isLDEuX0rAzSylNkqTssdYNwLB0FU2EQikdDAae5/FmqDMEQdA99e+Ejd0MWJc2DMPu/ezAsLSGzYRCCImiKAzD7uUSBIal9SyEGY/Htm3PZjP+ThiG4odrh23bskNQiA0mBR3HoZRWckknvV+zLEuSBGPseR7cc2PA7DEgBMz1AEKAUAAhQCiAEEq7QkqBLYX89x9/fGPbnVzr2Ywmndng6grf3+fffttGQHJJbm+DLKPz+d8+++xfv//esdXj29Awo3TPHji5vY2ur8nDg316Gl9cGEdHbEV+cnPjn53BWuuGGSV6927x449tBLR/iskkyDJ8f2+fnvr9vn1ywj9izVD07p1xdMSyy8HK5aD7KMVkEl1fFx8+mC9e5N99V5YIw3j+PDw/93o9Jpfk5uZg5XKgGYVMp6xlMV+88Pt9r9cTP+Qws8vBZZRyfftnZ+H5ueCB5vFxfHHh9/ssu2Tv3wsqrBs0ySjFZOKMRtplFP7EKEJoy5SAP34Msoy1WQcil0MRSnB1xZ9BD113J61GuYvTebl0v+nh496d3xSxT07skxMml+Hbt9H1dXh+3tVH25sLpZhMng4TlCJ7/z64uuK3RlqKtiyXwQ8/PB1jd4NuZhTeKLQqkTJMLix7OaNR9+TSNaGQ6XT400+s3xBfXu653+D1el6v10m5dEco5XHv/iVS5qlc4osL7SeMGiyfvHt4QJeX+YcPu1uRuRWz+dzPMnR5abx542fZbD6XHdH/iW9ujDdv0OWl9/bt3cOD7HCa0ySjqPPjKJtpqTl15/V67qtXfH5R3+nohmtmn3355dLJkX3ClwRocfV19xNsLhSJ/YDKkgCNrrj6KXAVzYWy0UTJrqhZEqAROq5eaDjqsU9P9/87JtOpMxqtWhKgEZXVC3Q+jy8uZAe1Bs2e68EfP1qffy47il3C1gqq33pqJhRAFvC4BiAECAUQ4pPOLCGEEGKaJrd7KIoCIWQYBjO/Y36QCKHyPk/3fLq5Dd22otTFfvIToWCMi6LAGOd5bhgGIYRtmqYZxzGz9mPeMqZpliuMec6g/ymDH2hZ1pZCYb49aZoyF4UoirpkbsPsJ5MkkR2IAJVb+nme+77v+z5/x/f9PM/55irXEOYpWn6nfJLGjMfjymld193+tCowm808z1toYsSypI9i2zallNneSceyLNu2+WaXrCj1sp9c3pkNw5C1nUrRJStK7ewnlwuFdTWU8hvumBWldvaTK4fHzJiaD3NWsZ/E0zErSm4/GQRBEAR8KKAydXM9nudxL81VlA3jK92atSITpHtWlDraT9bdcGO9WsH6tiyL3TthiB9YD8Z4OByWVbJWu+pjGIZdYid3m9rmk7keQshgMEAImaaZpilCiFL68uXLNE1t2x4MBuwPBSql4saQRVFEUcQ+xRjzmx/b0G0rSo3sJ3c/KVgUhRY/EWAjYPYYEAImBQEhQCiAECAUQAidhFJMJvTxUXYUO6aYTLRwTtRJKM5ohO/vZUexY5zRKLm9lR3FenQSCiAREIpk7NNT2SEIAUIBhAChAEKAUOSjRQ8dhAIIAUIBhAChAEKAUCSj/uPpDBCKZIyjI9khCAFCAYQAocgHJgUBIcjDg+wQ1gNCAYQAoQBCgFAAIUAoktHF3lInofhnZ7rcnhLHPD72z85kR7EeeK4HEEKnjAJIBIQCCKGuUMomlAqieHgVto9WXaEEQVA2X9knSZIEQTAcDmsC2Cg8QgjzzKnZp1XZbX8x1RUKMymV8tXMhMIwjJrK2yg80zTDMKyvquFwuFmUm7D9xVT3PwUVd0zceXitZpTto1VRKDxPLjVtwxizHG4YBvuVLLWgybKMedCx3Xzf5248g8GAOW8z5zT2cxf3/KkJL0kSbszMTPAIIcySiO9QFAWltPylhBDWzDmOw3YzDKN81DZlqYm2/thyWdI0bfLnk/uhYoTMsSxrNpux1+PxeKmbb5qmzOuX71axMTYMI45j/ml557UBrPo0TdOKkXPltPVfusqWuNWyrDq2UpbFYqGfUFzXTdOUb97d3S3dp/JOHMfloyq1srSSNhVK5SSz2az+W0RiWLRcllXHPj2Jik1PPXEcs1EJIcQwjKWtb1EUPI1z9uygz1vGLVGhLEjNPkoNrMfHvTcppY7jjMfjym62bZfb+P1Q6eWwvyrZ/rQqlAWpPDxeCsa4/F8Uq3qgrutWXEZ3VW01eJ43HA6ZlJmD8kaOh6Zplgc+/LUKZUEKTgoyR02EEGtZmBTiOGZpvCiKJEn4+4QQ27aXtj5RFPF/WGBF5bdGBoNB2bEzCIIkSfhmfQAi4bEq930/SRJ2zqdf6jgOxth13TiOWcBsNMe0xQLmH7VRlrXHlssSx7FyQhGB2d0ihMp/vLEU5pFsWZYU1+vhcMgrWwRerqUByy2LlkLRgqIoiqJQ3GZYHM06s4rD8jl7bVmW1ioplyXPc8gogBCajXoAWYBQACFAKIAQIBRAiP8CBCFm1Qc2iiMAAAAASUVORK5CYII=", "text/plain": [ "Tree('4', [Tree('2', ['NLU']), Tree('4', [Tree('2', ['is']), Tree('4', ['enlightening'])])])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Tree.fromstring(\"\"\"(4 (2 NLU) (4 (2 is) (4 enlightening)))\"\"\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### All-nodes readers\n", "\n", "In SST parlance, the __all-nodes task__ trains and assesses, not just with the full sentence, but also with all the labeled subtrees. We won't explore this task here, but it's good to know about it, and these readers will give you access to this version of the dataset:\n", " * `sst.allnodes_train_reader`\n", " * `sst.allnodes_dev_reader`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Methodological notes\n", "\n", "* We've deliberately ignored `test` readers. We urge you not to use the `test` set until and unless you are running experiments for a final project or similar. Overuse of test-sets corrupts them, since even subtle lessons learned from those runs can be incorporated back into model-building efforts.\n", "\n", "* We actually have mixed feelings about the overuse of `dev` that might result from working with these notebooks! We've tried to encourage using just splits of the training data for assessment most of the time, with only occasionally use of `dev`. This will give you a clearer picture of how you will ultimately do on `test`; over-use of `dev` can lead to over-fitting on that particular dataset with a resulting loss of performance of `test`." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "slide" } }, "source": [ "## Modeling the SST labels\n", "\n", "Working with the SST involves making decisions about how to handle the raw SST labels. The interpretation of these labels is as follows ([Socher et al., sec. 3](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf)):\n", "\n", "* `'0'`: very negative\n", "* `'1'`: negative\n", "* `'2'`: neutral\n", "* `'3'`: positive\n", "* `'4'`: very positive\n", "\n", "The labels look like they could be treated as totally ordered, even continuous. However, conceptually, they do not form such an order. Rather, they consist of three separate classes, with the negative and positive classes being totally ordered in opposite directions:\n", "\n", "* `'0' > '1'`: negative\n", "* `'2'`: neutral\n", "* `'4' > '3'`: positive\n", "\n", "Thus, in this notebook, we'll look mainly at binary (positive/negative) and ternary tasks.\n", "\n", "A related note: the above shows that the __fine-grained sentiment task__ for the SST is particularly punishing as usually formulated, since it ignores the partial-order structure in the categories completely. As a result, mistaking `'0'` for `'1'` is as bad as mistaking `'0'` for `'4'`, though the first error is clearly less severe than the second.\n", "\n", "The functions `sst.binary_class_func` and `sst.ternary_class_func` will convert the labels for you, and recommended usage is to use them as the `class_func` keyword argument to `train_reader` and `dev_reader`; examples below." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Train label distributions\n", "\n", "Check that these numbers all match those reported in [Socher et al. 2013, sec 5.1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf)." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "train_labels = [y for tree, y in sst.train_reader()]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total train examples: 8,544\n" ] } ], "source": [ "print(\"Total train examples: {:,}\".format(len(train_labels)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distribution over the full label set:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3 2322\n", "1 2218\n", "2 1624\n", "4 1288\n", "0 1092\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(train_labels).value_counts()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Binary label conversion:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "binary_train_labels = [\n", " y for tree, y in sst.train_reader(class_func=sst.binary_class_func)]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total binary train examples: 6,920\n" ] } ], "source": [ "print(\"Total binary train examples: {:,}\".format(len(binary_train_labels)))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 3610\n", "negative 3310\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(binary_train_labels).value_counts()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Ternary label conversion:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 3610\n", "negative 3310\n", "neutral 1624\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ternary_train_labels = [\n", " y for tree, y in sst.train_reader(class_func=sst.ternary_class_func)]\n", "\n", "pd.Series(ternary_train_labels).value_counts()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Dev label distributions\n", "\n", "Check that these numbers all match those reported in [Socher et al. 2013, sec 5.1](http://www.aclweb.org/anthology/D/D13/D13-1170.pdf)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "dev_labels = [y for tree, y in sst.dev_reader()]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total dev examples: 1,101\n" ] } ], "source": [ "print(\"Total dev examples: {:,}\".format(len(dev_labels)))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 289\n", "3 279\n", "2 229\n", "4 165\n", "0 139\n", "dtype: int64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(dev_labels).value_counts()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Binary label conversion:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "binary_dev_labels = [\n", " y for tree, y in sst.dev_reader(class_func=sst.binary_class_func)]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total binary dev examples: 872\n" ] } ], "source": [ "print(\"Total binary dev examples: {:,}\".format(len(binary_dev_labels)))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 444\n", "negative 428\n", "dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(binary_dev_labels).value_counts()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Ternary label conversion:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "positive 444\n", "negative 428\n", "neutral 229\n", "dtype: int64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ternary_dev_labels = [\n", " y for tree, y in sst.dev_reader(class_func=sst.ternary_class_func)]\n", "\n", "pd.Series(ternary_dev_labels).value_counts()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Additional sentiment resources\n", "\n", "Here are a few publicly available datasets and other resources; if you decide to work on sentiment analysis, get in touch with the teaching staff — we have a number of other resources that we can point you to.\n", "\n", "* Sentiment lexica: http://sentiment.christopherpotts.net/lexicons.html\n", "* NLTK now has a SentiWordNet module: http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.sentiwordnet\n", "* Stanford Large Movie Review Dataset: http://ai.stanford.edu/~amaas/data/sentiment/index.html\n", "* SemEval-2013: Sentiment Analysis in Twitter: https://www.cs.york.ac.uk/semeval-2013/task2/\n", "* Starter code for a sentiment-aware tokenizer: http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 1 }