{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "8f2868af", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk\n", "from nltk.corpus import names\n", "from pylab import *\n", "import random as pyrandom" ] }, { "cell_type": "markdown", "id": "1de0c279", "metadata": {}, "source": [ "# A Simple Classifier" ] }, { "cell_type": "code", "execution_count": 2, "id": "cfb7ddeb", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'first_letter': 'P', 'last_letter': 'a', 'length': 5}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def gender_features(w):\n", " return dict(last_letter=w[-1],\n", " first_letter=w[0],\n", " length=len(w))\n", "gender_features(\"Petra\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "ffd17f1f", "metadata": { "collapsed": false }, "outputs": [], "source": [ "male = [(name,'male') for name in names.words('male.txt')]\n", "female = [(name,'female') for name in names.words('female.txt')]\n", "nlist = male+female\n", "pyrandom.shuffle(nlist)\n", "featuresets = [(gender_features(n),g) for n,g in nlist]\n", "training_set = featuresets[500:]\n", "test_set = featuresets[:500]" ] }, { "cell_type": "code", "execution_count": 5, "id": "4505a570", "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier = nltk.NaiveBayesClassifier.train(training_set)" ] }, { "cell_type": "markdown", "id": "4fccae98", "metadata": {}, "source": [ "# Accuracy and Error Rate" ] }, { "cell_type": "code", "execution_count": 14, "id": "eda216dd", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.788" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.classify.accuracy(classifier,test_set)" ] }, { "cell_type": "code", "execution_count": 15, "id": "6fc7782c", "metadata": { "collapsed": false }, "outputs": [], "source": [ "truth = [s[1] for s in test_set]\n", "predicted = classifier.batch_classify([s[0] for s in test_set])" ] }, { "cell_type": "code", "execution_count": 17, "id": "c327704c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.788" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.metrics.accuracy(truth,predicted)" ] }, { "cell_type": "code", "execution_count": 20, "id": "2a1e3214", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.78800000000000003" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(array(truth)==array(predicted))*1.0/len(truth)" ] }, { "cell_type": "code", "execution_count": 28, "id": "e4abdbf5", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.21199999999999999" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(array(truth)!=array(predicted))*1.0/len(truth)" ] }, { "cell_type": "markdown", "id": "88935e14", "metadata": {}, "source": [ "# Confusion Matrix" ] }, { "cell_type": "code", "execution_count": 34, "id": "3f701142", "metadata": { "collapsed": false }, "outputs": [], "source": [ "cm = nltk.ConfusionMatrix(truth,predicted)" ] }, { "cell_type": "code", "execution_count": 36, "id": "074583fb", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " | f |\n", " | e |\n", " | m m |\n", " | a a |\n", " | l l |\n", " | e e |\n", "-------+---------+\n", "female |<261> 48 |\n", " male | 58<133>|\n", "-------+---------+\n", "(row = reference; col = test)\n", "\n" ] } ], "source": [ "print cm.pp()" ] }, { "cell_type": "markdown", "id": "77cb502a", "metadata": {}, "source": [ "Usually, we use the confusion matrix in multi-class problems to see which classes are particularly frequently confused.\n", "\n", "However, in a binary classification problem, it also gives us the four categories of error. If we think of the above problem as a \"female detector\", then...\n", "\n", "- true positives = 261\n", "- true negatives = 133\n", "- false positives = 48\n", "- false negatives = 58" ] }, { "cell_type": "markdown", "id": "6c0cb6cb", "metadata": {}, "source": [ "# Precision and Recall" ] }, { "cell_type": "markdown", "id": "c5e2756b", "metadata": {}, "source": [ "Precision and recall is a way of thinking about errors in terms of \"retrieved items\".\n", "\n", "Think of the classification problem as the problem of \"retrieving all the records that are female\".\n", "Then we define:\n", "\n", "- precision = fraction of retrieved documents that were relevant = TP / (TP + FP)\n", "- recall = fraction of all positives that were actually retrieved = TP / (TP + FN)\n" ] }, { "cell_type": "code", "execution_count": 25, "id": "5b800b1b", "metadata": { "collapsed": false }, "outputs": [], "source": [ "pdocs = set(find(array(predicted)==\"female\"))\n", "tdocs = set(find(array(truth)==\"female\"))" ] }, { "cell_type": "code", "execution_count": 26, "id": "8086d2d9", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.8181818181818182" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.metrics.precision(tdocs,pdocs)" ] }, { "cell_type": "code", "execution_count": 27, "id": "b357303b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.8446601941747572" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.metrics.recall(tdocs,pdocs)" ] }, { "cell_type": "code", "execution_count": 30, "id": "195fbd65", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.8312101910828026" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.metrics.f_measure(tdocs,pdocs)" ] }, { "cell_type": "code", "execution_count": 32, "id": "5a7c41cb", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.8207547169811322" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nltk.metrics.f_measure(tdocs,pdocs,alpha=0.9)" ] }, { "cell_type": "markdown", "id": "05c42996", "metadata": {}, "source": [ "# ROC Curve" ] }, { "cell_type": "code", "execution_count": 47, "id": "318382da", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "66" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features = [t[0].items() for t in training_set]\n", "features = list(sorted(set([x for l in features for x in l])))\n", "len(features)" ] }, { "cell_type": "code", "execution_count": 49, "id": "9912c045", "metadata": { "collapsed": false }, "outputs": [], "source": [ "xs = zeros((len(training_set),len(features)))\n", "ys = zeros(len(training_set))\n", "for i,(f,c) in enumerate(training_set):\n", " for p in f.items():\n", " xs[i,features.index(p)] = 1.0\n", " ys[i] = (c==\"female\")" ] }, { "cell_type": "code", "execution_count": 50, "id": "d053507a", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, penalty='l2', tol=0.0001)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "lr = LogisticRegression()\n", "lr.fit(xs,ys)" ] }, { "cell_type": "code", "execution_count": 58, "id": "190718e7", "metadata": { "collapsed": false }, "outputs": [], "source": [ "test_xs = zeros((len(test_set),len(features)))\n", "test_ys = zeros(len(test_set))\n", "for i,(f,c) in enumerate(test_set):\n", " for p in f.items():\n", " test_xs[i,features.index(p)] = 1.0\n", " test_ys[i] = (c==\"female\")" ] }, { "cell_type": "code", "execution_count": 62, "id": "92cfe851", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.218" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pred = lr.predict(test_xs)\n", "sum(pred!=test_ys)*1.0/len(test_ys)" ] }, { "cell_type": "code", "execution_count": 69, "id": "91a73260", "metadata": { "collapsed": false }, "outputs": [], "source": [ "def tpfp(truth,pred):\n", " tp = sum((truth>.5)*(pred>.5))*1.0/len(truth)\n", " fp = sum((truth<.5)*(pred>.5))*1.0/len(truth)\n", " return tp,fp" ] }, { "cell_type": "code", "execution_count": 70, "id": "1af8a628", "metadata": { "collapsed": false }, "outputs": [], "source": [ "truth = test_ys\n", "prob = lr.predict_proba(test_xs)[:,1]" ] }, { "cell_type": "code", "execution_count": 71, "id": "a0f99be6", "metadata": { "collapsed": false }, "outputs": [], "source": [ "roc = [tpfp(truth,prob>threshold) for threshold in linspace(0,1.0,100)]\n", " " ] }, { "cell_type": "code", "execution_count": 74, "id": "4b59e3d9", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEKCAYAAAD5MJl4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHqVJREFUeJzt3X10VPWdx/HP4IwK2IIKVZhJiZCYjCx5wEQEQQdrjaBG\nRHRTXdaHiKmaKuJaaz27BvtwzC66teZY0x6161OIBUughaBBZklRMghUPMuDAYkO8QmwSBAkMNz9\n4x4C4SaQhztzZ5L365x7MpP53TsfrnG+c3+/e3/XZRiGIQAAjtHH6QAAgPhDcQAAWFAcAAAWFAcA\ngAXFAQBgQXEAAFhEpThUV1crPT1dqampKi0ttbw+Z84cZWdnKzs7W6NGjZLb7dbu3bujEQUA0AUu\nu69ziEQiSktLU01Njbxer3Jzc1VRUSG/399m+7/85S/6zW9+o5qaGjtjAAC6wfYjh1AopJSUFCUn\nJ8vj8aigoEBVVVXttn/ttdf0ox/9yO4YAIBucNu9wcbGRiUlJbU89/l8qqura7Ptvn37tHTpUj37\n7LOW11wul93RAKBXsKNDyPYjh858qC9atEjjx4/XwIED23zdMIy4Xx577DHHM5CTjOQk55HFLrYX\nB6/Xq3A43PI8HA7L5/O12Xbu3Ll0KQFAHLK9OOTk5Ki+vl4NDQ1qbm5WZWWl8vPzLe2+/vprrVix\nQtddd53dEQAA3WT7mIPb7VZZWZny8vIUiURUWFgov9+v8vJySVJRUZEkacGCBcrLy1Pfvn3tjhBT\ngUDA6QgdQk77JEJGiZx2S5ScdrH9VFa7uFwuW/vPAKA3sOuzkyukAQAWFAcAgAXFAQBgQXEAAFhQ\nHAAAFhQHAIAFxQEAYEFxAABYUBwAABYUBwCABcUBAGBBcQAAWFAcAAAWFAcAgAXFAQBgQXEAAFhQ\nHAAAFhQHAIAFxQEAYEFxAABYuJ0OAAA9lWFIjY3S4cMnbrdrlzRjhtS//9HfPf+8lJIS3XwnQnEA\n0OPt2SOtXy/9/e/m41jYu1eaN0/6+mvptNNO3LapSfL5pCefPPq7c8+Nbr6TcRmGYTgboW0ul0tx\nGg1AN4XD0kcfmR+gxy9NTe3/vrm58+/17bfSl19K//RPUlaWdPbZ9v972uJ2S5MnS2PGSC5XbN5T\nsu+zMyrFobq6WjNnzlQkEtGdd96phx9+2NImGAzqgQce0MGDBzVo0CAFg8HWwSgOQI918cXmB/25\n50pnnCF95zvmz+OX439/6qmd/6D1eKTkZPPDujeI2+IQiUSUlpammpoaeb1e5ebmqqKiQn6/v6XN\n7t27dckll2jp0qXy+XzauXOnBg0a1DoYxQFIGN9+K33yibl05Nv9tGnSihVSTk70s/U2dn122l5L\nQ6GQUlJSlJycLEkqKChQVVVVq+Lw2muv6YYbbpDP55MkS2E4oqSkpOVxIBBQIBCwOy6ADjh40Pzg\nb2gwl23bWv/cuVNKSpK+/33p9NNPvr3LLjPbo/uCwaCl58UOtheHxsZGJR3zX93n86murq5Vm/r6\neh08eFATJ05UU1OT7r//fk2fPt2yrWOLA4DYOXBAqquTli83l9WrpcGDze6Z884zf1555dHnQ4dK\np5zicOhe6vgvzrNnz7Zlu7YXB1cHOgQPHjyotWvXatmyZdq3b5/Gjh2riy++WKmpqXbHAdCOw4fN\nb/6vvmp28Rzpifj2W/PMnvR06fLLpZ/9TBo/3uzzR+9he3Hwer0Kh8Mtz8PhcEv30RFJSUkaNGiQ\n+vbtq759++rSSy/V+++/T3EAouDLL6V168wP/I8+OtoV9PHH0plnSlOnSg8+aA72SubAbVaWNGCA\no7HhMNuLQ05Ojurr69XQ0KChQ4eqsrJSFRUVrdpcd911Ki4uViQS0YEDB1RXV6dZs2bZHQXokZqb\npY0bj37TP5ZhmAVg3TrznP5166R9+8wP+6ws83TOa64xu4KGDWt90RVwLNuLg9vtVllZmfLy8hSJ\nRFRYWCi/36/y8nJJUlFRkdLT03XVVVcpIyNDffr00YwZM3TBBRfYHQXocbZtk266ybywql+/ttsk\nJUnZ2VJhoflz2LDYnmePnoGL4IA4tHu3OSC8dWvrs4Lq66WSEun++/nAR9vi9joHu1Ac0Js0NUm1\ntUfPDtq82bwG4Pzzj54ddN555lw7sbrCF4mJ4gDEgdpa6eabpe3bu7ed006Txo6VJk40l4suOvl8\nPEBb4vYiOCDeffONuRzLMMyB2/bm9mnrd19/La1dK734ojRpkjP/FiBaOHJAr7J9uzlI21Z/ff/+\nJ5/b5/jf5+RIQ4bE/t8BtIduJaCTDMM8jXPMGOk//sPpNEB00K0EdNDhw9L770uVleaRw5//7HQi\nIP5RHNDjHD4s/d//HT3zZ8UKc16giROlP/3p6JXAANpHtxISnmFImzYdLQbBoPTd7x4982fiRHNi\nOKA3YMwBkPTWW+a8QF9/3boYfP/7TicDnMGYA3q8/fvNpS2ffir9/OfShg3SnDnSdddxxTBgJ4oD\nHGEY0q5d5gBxY6P58/gbyeze3f78QaefLj3wgDmGwMVigP3oVkJUHDhgThG9ffvR5UgROPK4b1/J\n5zMXr9ecIO7YqSLOPVfq08fpfwmQWBhzQFwyDPNU0YceMi8qO++8owXg2ELg9TJdNBANjDkg7rz3\nnlkUdu6UysulK65wOhGArqI4oFu++kp67TXphRfMovDzn0t33mneTQxA4qJbCZ0WiUg1NWZBWLpU\nmjxZuv12837D3GQecBZjDnDEn/8s3XefOVh8++3Sj35k3ocYQHygOCDmVqyQpk2TFiyQxo1zOg2A\nttj12cmJgjgpw5AqKqQbb5RefZXCAPQGDBvihOrqpJkzpYMHzS4lCgPQO3DkgDZt3y5Nny5NnSr9\n+MdSKERhAHoTigNa2bdPmj1bysw0r1jevFm69VauVAZ6m6j8L19dXa309HSlpqaqtLTU8nowGNSA\nAQOUnZ2t7Oxs/fKXv4xGDHTSjh3S6NHmZHZr10q//KV5K0wAvY/tYw6RSETFxcWqqamR1+tVbm6u\n8vPz5ff7W7W77LLLtHDhQrvfHl20f7+Uny/dcIP0q185nQaA02w/cgiFQkpJSVFycrI8Ho8KCgpU\nVVVlacdpqvGhoUEqKZH8fun8882jBQCw/cihsbFRSUlJLc99Pp/q6upatXG5XHrnnXeUmZkpr9er\nOXPm6IILLrBsq6SkpOVxIBBQIBCwO26vsWOHtHHj0emwt20zxxPq66WbbzbPRMrOdjolgM4KBoMK\nBoO2b9f24uDqwB1XRo8erXA4rH79+mnJkiWaMmWKPvzwQ0u7Y4sDum7ZMumf/1lKTz86HfaECeZA\n87hx3A8BSGTHf3GePXu2Ldu1vTh4vV6Fw+GW5+FwWD6fr1Wb73znOy2PJ02apHvuuUdfffWVzjrr\nLLvj9Hp/+Yt0xx3S/PnSZZc5nQZAorB9zCEnJ0f19fVqaGhQc3OzKisrlZ+f36rNF1980TLmEAqF\nZBgGhSEKXn9dKiw0CwSFAUBn2H7k4Ha7VVZWpry8PEUiERUWFsrv96u8vFySVFRUpHnz5ul3v/ud\n3G63+vXrp7lz59odo9dbtky6/37prbekjAyn0wBINEy81wPt3SuNGiU9+6w0aZLTaQDEErOywqK5\n2byA7amnzCua//hHpxMBiDVmZYUk88Y7zz8vXXihNHCgdMstZmF46imnkwFIZMzKmsD+93+lBx6Q\n+vWT5syRxowxHwNAd1EcEtSzz0qlpdJ//qd0001SBy4vAYAOozgkqM2bpVmzzIvbAMBujDkkKMbq\nAUQTxSEBzZsnzZ0rZWU5nQRAT0W3UgJpbpYeekhatEhassQ8QwkAooHikEDy881J8taskc480+k0\nAHoyLoJLIN/7nvTBB9I55zidBEC8suuzkyOHBPDFF9Irr0jffCOdeqrTaQD0BgxIx6mDB6WqKmnK\nFCktzTxiePNNupMAxAZHDnHi8GHps8+kjz6SFi6UXn5ZSkmRbr/dfHzMLTAAIOooDg56/XVzXqRt\n26RPPjHnRkpOlgIBc2qMtDSnEwLorRiQdtC//qs0ZIh0223SsGHMiwSg+xiQ7iFGjpT8fqdTAEBr\nDEg76MABpxMAQNsoDg44eFD66U+ld9+Vxo1zOg0AWNGtFGONjVJBgdS/v7R2rTRokNOJAMCKI4cY\nqqmRcnKkvDxp8WIKA4D4xZFDDHz4ofS730mVldKrr0qXX+50IgA4MYpDFOzcaV6nsHy5uezaJf3L\nv0jvvScNHep0OgA4Oa5zsFFdnXTXXVJDg3TJJdLEieaSlSW5KcMAYsCuz86ojDlUV1crPT1dqamp\nKi0tbbfd6tWr5Xa79cYbb0QjRkwZhnTffdKPf2weOSxebN57ISeHwgAg8dheHCKRiIqLi1VdXa0N\nGzaooqJCGzdubLPdww8/rKuuuirhjhDaUlVlXrdQVCR5PE6nAYDusb04hEIhpaSkKDk5WR6PRwUF\nBaqqqrK0e+aZZzRt2jQNHjzY7ggxE4lImzaZt+x8+GHpV7+S+nD+F4AewPYOj8bGRiUlJbU89/l8\nqqurs7SpqqrS22+/rdWrV8vlcrW5rZKSkpbHgUBAgUDA7rhdcuiQ9O//LpWVSYMHS9nZ0r33SpMn\nO50MQG8TDAYVDAZt367txaG9D/pjzZw5U0888UTLwEl73UrHFod48emn5kVs/fqZp6gOGeJ0IgC9\n2fFfnGfPnm3Ldm0vDl6vV+FwuOV5OByWz+dr1WbNmjUqKCiQJO3cuVNLliyRx+NRfn6+3XFstWyZ\nNH26dPfd0qOP0oUEoOey/VTWQ4cOKS0tTcuWLdPQoUN10UUXqaKiQv52ph69/fbbde2112rq1Kmt\ng8XRqayHD5vjCc8+a96u8wc/cDoRALQtbqfsdrvdKisrU15eniKRiAoLC+X3+1VeXi5JKioqsvst\no+7RR82L2tas4SI2AL0DF8GdxOrV0jXXSOvXS+ec43QaADixuL4IrqeIRKQ77pD++78pDAB6F44c\nTuCrr6Thw6V//EPqwElYAOA4jhxipE8fCgOA3oficALxeUwFANFHcWjHl1+aF7uNH+90EgCIPYpD\nG/72N+nCC6WLLpJ6wISxANBp7V7nsH//fj333HPasmWLMjIyVFhYKHcPn3vaMKQnn5T+67+kF19k\nriQAvVe7n/a33nqrTj31VI0fP16LFy/Whg0b9PTTT8cyW0zt3i3ddpv02WdSKCQNG+Z0IgBwTrun\nso4aNUoffPCBJHNKjNzcXK1bty52wWJ4KuvatdKNN0pXXy3NmSOdempM3hYAbBf16TOO7ULqyd1J\nhw9LU6eacyfdcovTaQAgPrR75NCnTx/179+/5fn+/fvVt29fcyWXS3v27IlusBgdObz9tjRrlvT3\nv0f9rQAg6qJ+5JCZmRnTbiSn/PGP5lgDAOCoXn0qa1OTtHChdPPNTicBgPjS7pHDjh079NRTT7V5\neOJyuTRr1qyoBouFN96QLrtM+t73nE4CAPGl3eIQiUTU1NQUyywxV1NjTscNAGit3QHp7OxsR8cc\nYjEgnZwsVVdL6elRfRsAiBlmZe2mcFjat09KS3M6CQDEn3aLQ01NTSxzxFxtrTmpHtNxA4BVu8Xh\n7LPPjmWOmKutlSZMcDoFAMSnXtutVFsrXXqp0ykAID71ytuE7tghjRhh3ga0B88MAqAXYkC6G+bO\nla69lsIAAO3plcWBKTMA4MSiUhyqq6uVnp6u1NRUlZaWWl6vqqpSZmamsrOzdeGFF+rtt9+ORow2\nrV9v3gL08stj9pYAkHBsH3OIRCJKS0tTTU2NvF6vcnNzVVFRIb/f39Lmm2++aZnx9YMPPtD111+v\nLVu2tA4WpTGHBx+UTj/dnKIbAHqauB1zCIVCSklJUXJysjwejwoKClRVVdWqzbFTge/du1eDBg2y\nO0abPv5YeuklqbAwJm8HAAnL9iHZxsZGJSUltTz3+Xyqq6uztFuwYIEeeeQRffbZZ3rzzTfb3FZJ\nSUnL40AgoEAg0OVchiHddZf0wAPS8OFd3gwAxJVgMKhgMGj7dm0vDq4OXnI8ZcoUTZkyRbW1tZo+\nfbo2b95saXNsceiuuXPNU1gfesi2TQKA447/4jx79mxbtmt7t5LX61U4HG55Hg6H5fP52m0/YcIE\nHTp0SLt27bI7SitlZVJJieTxRPVtAKBHsL045OTkqL6+Xg0NDWpublZlZaXy8/Nbtdm6dWvLgMna\ntWslRXe6jg8/lLZulSZNitpbAECPYnu3ktvtVllZmfLy8hSJRFRYWCi/36/y8nJJUlFRkebPn6+X\nXnpJHo9HZ5xxhubOnWt3jFb+53+kW27hqAEAOqrHT58RiZj3bfjrX6WMjO7nAoB4Frenssab2lrp\n7LMpDADQGT2+OMybJ910k9MpACCx9OhupcOHJZ9PCgal88+3JxcAxDO6lTrgnXekQYMoDADQWT26\nOLz+ujRtmtMpACDx9Kg7GmzfLi1ffnQ5cMAckAYAdE6PGHP49FNpyhTpo4+kQECaONFc/H6pg7N5\nAECPYNeYQ8IfOTQ0SFdcId1xh/Szn0l9enRHGQDERkIXhw8/lH74Q+nf/k36yU+cTgMAPUfCdivt\n2mVe2PaLX5hHDQAATmVVY6N01lkUBgCIhoQtDtu2MZEeAERLQo45vPWWdOed0muvOZ0EAHqmhDty\nWLjQnH77jTfMwWgAgP0S6shh/nzp3nulxYulnByn0wBAz5UwZyvV1ko33GB2KWVmOhgMAOJYrzpb\nadMm6cYbzTEGCgMARF/cF4cvvpAmT5ZKS80roQEA0RfXxeGTT6SrrpJuu0269Van0wBA7xHXYw7n\nnGNo1izpoYeYQA8AOqJXTLx3993ST3/qdAoA6H3iulvpjDOcTgAAvVNcFwcAgDOiUhyqq6uVnp6u\n1NRUlZaWWl5/9dVXlZmZqYyMDF1yySVav359NGIAALrI9jGHSCSi4uJi1dTUyOv1Kjc3V/n5+fL7\n/S1thg8frhUrVmjAgAGqrq7WXXfdpVWrVtkdBQDQRbYfOYRCIaWkpCg5OVkej0cFBQWqqqpq1Wbs\n2LEaMGCAJGnMmDHavn273TEAAN1g+5FDY2OjkpKSWp77fD7V1dW12/7555/X5MmT23ztzTdL1NRk\nPg4EAgoEAnZGBYCEFwwGFQwGbd+u7cXB1YkLEpYvX64XXnhBK1eubPP1K68s0YMP2pUMAHqe4784\nz54925bt2l4cvF6vwuFwy/NwOCyfz2dpt379es2YMUPV1dU688wz7Y4BAOgG28cccnJyVF9fr4aG\nBjU3N6uyslL5+fmt2nzyySeaOnWqXnnlFaWkpNgdAQDQTbYfObjdbpWVlSkvL0+RSESFhYXy+/0q\nLy+XJBUVFenxxx/XP/7xD919992SJI/Ho1AoZHcUAEAXxfXcSnPmGIw5AEAn9Kr7OQAAYoviAACw\noDgAACwoDgAAC4oDAMCC4gAAsKA4AAAsKA4AAAuKAwDAguIAALCgOAAALCgOAAALigMAwILiAACw\noDgAACwoDgAAC4oDAMCC4gAAsKA4AAAsKA4AAAuKAwDAguIAALCgOAAALKJSHKqrq5Wenq7U1FSV\nlpZaXt+0aZPGjh2r008/XU8++WQ0IgAAusFt9wYjkYiKi4tVU1Mjr9er3Nxc5efny+/3t7Q5++yz\n9cwzz2jBggV2vz0AwAa2HzmEQiGlpKQoOTlZHo9HBQUFqqqqatVm8ODBysnJkcfjsfvtAQA2sP3I\nobGxUUlJSS3PfT6f6urqurStN98sUVOT+TgQCCgQCNiQEAB6jmAwqGAwaPt2bS8OLpfLtm1deWWJ\nHnzQts0BQI9z/Bfn2bNn27Jd27uVvF6vwuFwy/NwOCyfz2f32wAAosj24pCTk6P6+no1NDSoublZ\nlZWVys/Pb7OtYRh2vz0AwAa2dyu53W6VlZUpLy9PkUhEhYWF8vv9Ki8vlyQVFRXp888/V25urvbs\n2aM+ffro6aef1oYNG3TGGWfYHQcA0AUuI06/vrtcLs2ZYzDmAACd4HK5bOmV4QppAIAFxQEAYEFx\nAABYUBwAABYUBwCABcUBAGBBcQAAWFAcAAAWFAcAgAXFAQBgQXEAAFhQHAAAFhQHAIAFxQEAYEFx\nAABYUBwAABYUBwCABcUBAGBBcQAAWFAcAAAWFAcAgAXFAQBgQXHopmAw6HSEDiGnfRIho0ROuyVK\nTrtEpThUV1crPT1dqampKi0tbbPNfffdp9TUVGVmZmrdunXRiBETifIHQ077JEJGiZx2S5ScdrG9\nOEQiERUXF6u6ulobNmxQRUWFNm7c2KrN4sWLtWXLFtXX1+v3v/+97r77brtjAAC6wfbiEAqFlJKS\nouTkZHk8HhUUFKiqqqpVm4ULF+rWW2+VJI0ZM0a7d+/WF198YXcUAEBXGTb705/+ZNx5550tz19+\n+WWjuLi4VZtrrrnGWLlyZcvzH/zgB8Z7773Xqo0kFhYWFpYuLHZwy2Yul6tD7czP//bXO/51AEDs\n2N6t5PV6FQ6HW56Hw2H5fL4Tttm+fbu8Xq/dUQAAXWR7ccjJyVF9fb0aGhrU3NysyspK5efnt2qT\nn5+vl156SZK0atUqDRw4UOecc47dUQAAXWR7t5Lb7VZZWZny8vIUiURUWFgov9+v8vJySVJRUZEm\nT56sxYsXKyUlRf3799eLL75odwwAQHfYMnLRSUuWLDHS0tKMlJQU44knnmizzU9+8hMjJSXFyMjI\nMNauXdupdeMh57Bhw4xRo0YZWVlZRm5urqM5N27caFx88cXGaaedZsyZM6dT68ZLzljtz5NlfOWV\nV4yMjAxj1KhRxrhx44z333+/w+vGS854+ttcsGCBkZGRYWRlZRmjR482li1b1uF14yVnPO3PI0Kh\nkHHKKacY8+bN6/S6R8S8OBw6dMgYMWKEsW3bNqO5udnIzMw0NmzY0KrNX//6V2PSpEmGYRjGqlWr\njDFjxnR43XjIaRiGkZycbOzatSsq2Tqb88svvzRWr15tPProo60+dONtf7aX0zBisz87kvGdd94x\ndu/ebRiG+T9bvP5ttpfTMOLrb3Pv3r0tj9evX2+MGDGiw+vGQ07DiK/9eaTdxIkTjauvvrqlOHRl\nf8Z8+oyuXgfx+eefd2hdp3Mee72GEYMzrjqSc/DgwcrJyZHH4+n0uvGQ84ho78+OZBw7dqwGDBgg\nyfxvvn379g6vGw85j4iXv83+/fu3PN67d68GDRrU4XXjIecR8bI/JemZZ57RtGnTNHjw4E6ve6yY\nF4fGxkYlJSW1PPf5fGpsbOxQm08//fSk68ZDTsk8NfeKK65QTk6O/vCHP0QlY0dzRmPdzurue8Vi\nf3Y24/PPP6/Jkyd3aV2nckrx97e5YMEC+f1+TZo0Sb/97W87ta7TOaX42p+NjY2qqqpqmXXiyCUC\nXdmftg9In0xXr4OIte7m/Nvf/qahQ4dqx44d+uEPf6j09HRNmDDBzoiSOp7T7nVj/V4rV67UkCFD\noro/O5Nx+fLleuGFF7Ry5cpOr9td3ckpxWZfdibnlClTNGXKFNXW1mr69OnatGmT7VlOpKs5N2/e\nLCm+9ufMmTP1xBNPyOVyyTCHDTq87vFifuTQ1esgfD5fh9Z1OueR6zWGDh0qyewquf766xUKhRzL\nGY11O6u77zVkyBBJ0d2fHc24fv16zZgxQwsXLtSZZ57ZqXWdzinFZl92JucREyZM0KFDh/TVV1/J\n5/PF3f48PueuXbskxdf+XLNmjQoKCnTeeedp/vz5uueee7Rw4cKu/X3aOmLSAQcPHjSGDx9ubNu2\nzThw4MBJB3rffffdlsG0jqwbDzm/+eYbY8+ePYZhmANZ48aNM5YuXepYziMee+yxVgO98bY/28sZ\nq/3ZkYwff/yxMWLECOPdd9/t9LrxkDPe/ja3bNliHD582DAMw1izZo0xfPjwDq8bDznjbX8e67bb\nbjPmz5/fpXUNw4GzlQzDMBYvXmycf/75xogRI4xf//rXhmEYxnPPPWc899xzLW3uvfdeY8SIEUZG\nRoaxZs2aE64bbzm3bt1qZGZmGpmZmcbIkSMdz/nZZ58ZPp/P+O53v2sMHDjQSEpKMpqamtpdN95y\nxnJ/nixjYWGhcdZZZxlZWVmWUxfjaV+2lzPe/jZLS0uNkSNHGllZWcb48eONUCh0wnXjLWe87c9j\nHVsc2lv3RFyGwSRGAIDWuBMcAMCC4gAAsKA4AAAsKA4AAAuKA3ASp5xyirKzs1uWjz/+WMFgUAMG\nDFB2drYuuOACPf74407HBGwV8yukgUTTr18/rVu3rtXvtm3bpksvvVSLFi3Svn37lJWVpWuvvVbZ\n2dkOpQTsxZED0E39+vXThRdeqK1btzodBbANxQE4if3797d0Kd1www2W13ft2qVVq1Zp5MiRDqQD\nooNuJeAk+vbta+lWkqTa2lqNHj1affr00SOPPCK/3+9AOiA6KA5AF02YMEGLFi1yOgYQFXQrAQAs\nKA7ASbQ1F77L5YrpPRyAWGPiPQCABUcOAAALigMAwILiAACwoDgAACwoDgAAC4oDAMDi/wF4J9r0\n90sz1AAAAABJRU5ErkJggg==\n" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tps,fps = zip(*roc)\n", "plot(fps,tps)\n", "xlabel(\"FP\")\n", "ylabel(\"TP\")" ] }, { "cell_type": "markdown", "id": "a71f2470", "metadata": {}, "source": [ "# Cross-Validation" ] }, { "cell_type": "code", "execution_count": 80, "id": "fb577710", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 0.788\n", "500 0.778\n", "1000 0.782\n", "1500 0.778\n", "2000 0.79\n", "2500 0.786\n", "3000 0.77\n", "3500 0.78\n", "4000 0.764\n", "4500 0.802\n", "5000 0.778\n", "5500 0.768\n", "6000 0.722\n", "6500 0.804\n", "7000 0.744\n", "7500 0.783783783784\n", "0.776111486486\n" ] } ], "source": [ "featuresets = [(gender_features(n),g) for n,g in nlist]\n", "accuracies = []\n", "for start in range(0,len(featuresets),500):\n", " training_set = featuresets[:start]+featuresets[start+500:]\n", " test_set = featuresets[start:start+500]\n", " classifier = nltk.NaiveBayesClassifier.train(training_set)\n", " a = nltk.classify.accuracy(classifier,test_set)\n", " accuracies.append(a)\n", " print start,a\n", "print mean(accuracies)" ] }, { "cell_type": "code", "execution_count": 76, "id": "e4ebede2", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "7944" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(featuresets)" ] }, { "cell_type": "markdown", "id": "23668149", "metadata": {}, "source": [ "# Resampling Estimates" ] }, { "cell_type": "code", "execution_count": 84, "id": "3cb5912e", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7841\n" ] } ], "source": [ "featuresets = [(gender_features(n),g) for n,g in nlist]\n", "accuracies = []\n", "gold_set = featuresets[500:]\n", "test_set = featuresets[:500]\n", "for trial in range(20):\n", " training_set = [pyrandom.choice(gold_set) for i in range(len(gold_set))]\n", " classifier = nltk.NaiveBayesClassifier.train(training_set)\n", " a = nltk.classify.accuracy(classifier,test_set)\n", " accuracies.append(a)\n", "print mean(accuracies)" ] }, { "cell_type": "code", "execution_count": 85, "id": "6c66417e", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1" ] }, { "cell_type": "code", "execution_count": null, "id": "42bd1cc3", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }