{
 "metadata": {
  "name": "",
  "signature": "sha256:08bf04e871858dc8ea09a0bf5063148bc444d881f598c04af3350380fbd5a606"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Lab 5 Stats and NLP"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In this lab we'll review some basic tools for statistical analysis, and then explore NLP with the Stanford Parsing suite. To get started, download <a href=\"https://bcourses.berkeley.edu/files/59900717/download?download_frd=1\">this small data file</a> and save in a lab5 directory. Save this notebook there as well using the icon at top right, and start ipython notebook to view it. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Inferential Statistics"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Inferential statistics is about making inferences about what produced the data: a single process or two? a linear process or a constant one? It contrasts with descriptive statistics which is about measuring properties of the data, and in turn making estimates about a population. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll begin by initializing a random seed for Numpy's random number generator. The random numbers generated by computers are actually pseudo-random, meaning they form a repeatable sequence given the seed. With the same seed, you will always get the same sequence of pseudo-random numbers which is good for repeatability or testing correctness of a program. On the other hand, if you execute the other cells of this notebook without setting the seed, you can see how the results vary between random samples. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pylab import *\n",
      "%matplotlib inline\n",
      "from scipy import stats\n",
      "import numpy as np\n",
      "np.random.seed(12345678)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we'll generate two samples of 400 points from the normal distribution. The first has mean zero, the second has a mean (loc param) of 0.15. Both have unit variance."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "a = stats.norm.rvs(size=400)\n",
      "b = stats.norm.rvs(loc=0.15, size=400)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Lets look at histograms of both distributions on the same set of bins:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bins = linspace(-3,3,10)\n",
      "hist(a,bins,rwidth=0.5)\n",
      "hist(b,bins,rwidth=0.5,align=u'right')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Two-Sample T-test"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Lets start by analyzing our data with a two-sample T-test. This test will evaluate the null hypothesis that the two samples have the same mean."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "stats.ttest_ind(a, b)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The test returns the value of the t-statistic, and the p-value of the t-test. Review the formula for the t-statistic, and make sure you agree with the value it generated. Was the test significant at the 0.05 significance level? "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Enter the t-statistic value, and the p-value in the lab responses form."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Try running the cells above again *without* initializing the random number generator again (dont eval that cell). What happened this time? This is a good reminder that unusual events can happen by chance. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now lets try transforming the data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ea=exp(a)\n",
      "eb=exp(b)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Which will produce skewed (non-normal) distributions for the datasets. Now lets try the t-test again. This time **run the entire sequence of cells from the beginning** in order to initialize the random seed again."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: repeat the t-test on these new vectors. Enter the results on the lab responses form.\n",
      "Increase the bin range to (-3,20). Then plot a histogram of both distributions. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "stats.ttest_ind(ea, eb)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "bins2 = linspace(-3,20,10)\n",
      "hist(ea,bins2,rwidth=0.5)\n",
      "hist(eb,bins2,rwidth=0.5,align=u'right')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Describe the distributions now."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "The K-S Test"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The Kolmogorov-Smirnov test is a very versatile, non-parametric test that compares samples of data from (potentially) different distributions. You can apply it directly to datasets such as ea,eb above. Go ahead and try it:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "stats.ks_2samp(ea,eb)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Was the ks-test result significant at 0.05? Compare it with the test on (a,b) and the test on (ea,ab). Which result is it closer to? What does this tell you about the reliability of these tests? "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Analyzing Discrete Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Often, you'll want to analyze discrete data, especially count data. Bag-of-Words text data is an important case. Next load the dataset for this exercise. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "nips = pd.read_csv(\"nips10cols.txt\",sep=\"\\t\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and take a look at the first few rows:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nips.head(10)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The rows represent documents from a conference (NIPS). The columns represent certain key words (they've been obfuscated). From the frequencies of these words, you can determine something about the topic of the paper. But you cant simply compare counts. The chances of two documents having exactly the same word counts is close to zero, even if the topics are very similar. \n",
      "\n",
      "Instead we can use a statistical test to ask whether the same random process might have generated two different rows in the table. The appropriate test is the Chi-squared test. \n",
      "\n",
      "Lets compare the counts for documents 1479 and 1480 for term1 and term2: (note we have to number carefully since column labels are one-based)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "nmat = nips.as_matrix()\n",
      "m=nmat[1479:1481,[0,1]]\n",
      "m"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "stats.chi2_contingency(m)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The test returns (first) the statistic value, and then the p-value."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Was the chi-squared test significant at 0.05 ? "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Repeat the test for term4 and term6 (careful about numbering). What was the p-value? "
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Natural Language Analysis of Content"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here we're going to use a constituency parser to extract some \"facts\" from natural language. The text is from the simplified wikipedia site: http://simple.wikipedia.org. It has been filtered to find sentences about cats. Download <a href=\"https://bcourses.berkeley.edu/files/61440818/download?download_frd=1\">this file (cat.txt)</a>  into your lab5 directory to get it. "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Stanford Parser Setup"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You may have done most of this setup for HW3 already. In case you havent, here it is:\n",
      "    \n",
      "Download the Stanford parser from <a href=\"https://bcourses.berkeley.edu/files/59900819/download?download_frd=1\">here</a>. If your network is not working, download from a browser on your host machine and then use drag-and-drop.\n",
      "\n",
      "Either way, you can put the parser in your \"Downloads\" directory. Unpack it with\n",
      "\n",
      "<pre>tar xvzf stanfordparser.tar.gz</pre>\n",
      "\n",
      "and then move it to the /opt directory with\n",
      "\n",
      "<pre>sudo mv StanfordParser /opt</pre>\n",
      "\n",
      "It will be helpful to have links to the parser scripts from your bin directory. If you havent already, create a directory ~/bin. Then\n",
      "<pre>\n",
      "cd ~/bin\n",
      "ln -s /opt/StanfordParser/lexparser.sh lexparser.sh\n",
      "ln -s /opt/StanfordParser/lexparser-gui.sh lexparser-gui.sh\n",
      "ln -s /opt/StanfordParser/dependencyviewer/dependencyviewer.sh dependencyviewer.sh\n",
      "</pre>\n",
      "\n",
      "These files will be in your path the next time you login. You can logout from the start button at the top right of the VM window. Then log back in again.    "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Running the Parser"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From a terminal window, type\n",
      "\n",
      "<pre>lexparser-gui.sh</pre>\n",
      "\n",
      "This brings up a GUI interface to the Stanford parser. To use it, click on \"Load Parser\" which brings up a file selection dialog. Navigate to\n",
      "\n",
      "<pre>/opt/StanfordParser/stanford-parser-3.4.1-models.jar</pre>\n",
      "\n",
      "and open it.\n",
      "\n",
      "Then you will see a list of parsers to use. Select\n",
      "\n",
      "<pre>englishPCFG.ser.gz</pre>\n",
      "\n",
      "You're now ready to parse some text!\n",
      "\n",
      "Click on the \"Load File\" button, and browse to the lab5 directory and load the cat.txt file. Click on \"Parse\" to parse the current sentence (highlighted in yellow).\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Parsing to XML"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We'll parse the cat sentence file to XML. To do this, we'll make a customized version of the parser script. Copy the file:\n",
      "\n",
      "<pre>/opt/StanfordParser/lexparser.sh</pre>\n",
      "\n",
      "and save it as:\n",
      "\n",
      "<pre>/opt/StanfordParser/parsetoxml.sh</pre>\n",
      "\n",
      "Edit it so that its outputFormat is:\n",
      "\n",
      "<pre>-outputFormat \"xmlTree\"</pre>\n",
      "\n",
      "and add a new option:\n",
      "\n",
      "<pre>-outputFormatOptions \"xml\"</pre>\n",
      "\n",
      "and create an alias to parsetoxml.sh it in your ~/bin directory. Now run from your lab5 directory\n",
      "\n",
      "<pre>parsetoxml.sh cat.txt > cat.xml</pre>\n",
      "\n",
      "you're ready now to analyze the cat data. We'll use Python's builtin ElementTree parser."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from lxml import etree\n",
      "parser = etree.XMLParser(recover=True)\n",
      "tree = etree.parse('/home/datascience/labs/lab5/cat.xml',parser) # fix this path if you put the file somewhere else"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can examine the root of this tree:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "root=tree.getroot()\n",
      "root.tag"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(root)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "root[0].tag"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "i.e. we have found the first sentence. The xmlTree representation is a little tricky however, as POS tags are stored as attributes of nodes rather than node tags. To get to the actual root node, we need to dig a little deeper (and we'll use the second sentence which is a bit more conventional):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "root[1][0][0].attrib['value']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "going down one level gets us to the actual sentence node:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s=root[6][0][0][0]\n",
      "s.attrib['value']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "and to get its children we can do:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s[:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is not too helpful, because the node types are hidden in the value attribs of these nodes. To see them, we can use a python anonymous function and map it over the list."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "map(lambda (x): x.attrib['value'], s[:])\n",
      "# of if you prefer list comprehensions: [x.attrib['value'] for x in s[:]]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now lets see if we can find sentences starting with noun phrases containing a given noun. The final function supports a flexible syntax (similar to xpath) for locating elements of given type or attributes. A slash \"/\" is like a directory specifier, and defines a child node. A double slash \"//\" specifies any descendent, child, grandchild, great-grandchild etc. The \"node[@value='NP']\" specifies a node with the given attribute value."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "agent = s.findall(\"./node[@value='NP']//node[@value='NN']//leaf[@value='cat']\")\n",
      "agent"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "finds all the nodes starting with an 'NP' child of s, and having a 'NN' node above a leaf with 'cat' value.\n",
      "\n",
      "We can similarly look for a verb in a verb phrase under the root node:\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "verb = s.findall(\"./node[@value='VP']//node[@value='VBZ']//leaf[@value='is']\")\n",
      "verb"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Putting these together, we can discover sentences containing a given pair of (agent,action) pairs:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def printnode(node):\n",
      "    for i in node.findall(\".//leaf\"):\n",
      "        print(\" \" + i.attrib['value']),\n",
      "    print('')\n",
      "\n",
      "def testnode(node, agent, action):\n",
      "    aa = node.findall(\"./node[@value='NP']//node[@value='NN']//leaf[@value='\"+agent+\"']\")\n",
      "    bb = node.findall(\"./node[@value='VP']//leaf[@value='\"+action+\"']\")\n",
      "    if (len(aa) > 0 and len(bb) > 0):\n",
      "        printnode(node)    \n",
      "\n",
      "def agentact(node, agent, action):\n",
      "    testnode(node, agent, action)\n",
      "    snodes = node.findall(\".//node[@value='S']\")\n",
      "    for snode in snodes:\n",
      "        testnode(snode, agent, action)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "title = 'cat'\n",
      "agentact(s, title, 'is')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Next we can apply the agentact function to all the sentences in the Wikipedia entry"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[agentact(nn[0][0][0], title, 'is') for nn in root]\n",
      "[] # hide the return bvalue"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Create a new testnode2 function that captures some more facts about cats. Look for other patterns that might have interesting facts about cats: e.g. where the cat is the object of a sentence. Put it in the cell below, and then evaluate it.\n",
      "\n",
      "> Cut and paste the resulting output table into the lab responses form."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def agentact2(node, agent, action):\n",
      "    testnode2(node, agent, action)\n",
      "    snodes = node.findall(\".//node[@value='S']\")\n",
      "    for snode in snodes:\n",
      "        testnode2(snode, agent, action)\n",
      "        \n",
      "map(lambda (nn): agentact2(nn[0][0][0], title, 'is'), root)\n",
      "[]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Lab Responses"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Fill out the Lab 5 responses <a href=\"https://bcourses.berkeley.edu/courses/1377158/quizzes/2045072\">Here</a>."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}