{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Test 2 Review**\n",
    "\n",
    "- l10 -> end\n",
    "- sentiment analysis\n",
    "  - lexicons\n",
    "  - wordnet\n",
    "  - how to combine words with different sentiment?\n",
    "- classification\n",
    "  - workflow (collect, label, train, test, etc)\n",
    "  - computing feature vectors\n",
    "  - generalization error\n",
    "  - variants of SimpleMachine\n",
    "  - computing cross-validation accuracy (why?)\n",
    "  - bias vs. variance\n",
    "- demographics\n",
    "  - pitfalls of using name lists\n",
    "  - computing odds ratio\n",
    "  - smoothing\n",
    "  - ngrams\n",
    "  - tokenization\n",
    "  - stop words\n",
    "  - regularization (why?)\n",
    "- logistic regression, linear regression\n",
    "  - no need to do calculus, but do you understand the formula?\n",
    "  - apply classification function given data/parameters\n",
    "  - what does the gradient represent?\n",
    "- feature representation\n",
    "  - tf-idf\n",
    "  - csr_matrix: how does this data structure work? (data, column index, row pointer)\n",
    "- recommendation systems\n",
    "  - content-based\n",
    "    - tf-idf\n",
    "    - cosine similarity\n",
    "  - collaborative filtering\n",
    "    - user-user\n",
    "    - item-item\n",
    "    - measures: jaccard, cosine, pearson. Why choose one over another\n",
    "  - How to compute the recommended score for a specific item?\n",
    "- k-means\n",
    "  - compute cluster assignment, means, and error function\n",
    "  - what effect does k have?\n",
    "  - representing word context vectors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Project tips\n",
    "\n",
    "So you've collected data, implemented a baseline, and have an F1 of 78%.  \n",
    "**Now what??**\n",
    "\n",
    "- Error analysis\n",
    "- Check for data biases\n",
    "- Over/under fitting\n",
    "- Parameter tuning\n",
    "\n",
    "## Reminder: train/validation/test splits\n",
    "\n",
    "- Training data\n",
    "  - To fit model\n",
    "  - May use cross-validation loop\n",
    "  \n",
    "- Validation data\n",
    "  - To evaluate model while debugging/tuning\n",
    "  \n",
    "- Testing data\n",
    "  - Evaluate once at the end of the project\n",
    "  - Best estimate of accuracy on some new, as-yet-unseen data\n",
    "  - be sure you are evaluating against **true** labels \n",
    "    - e.g., not the output of some other noisy labeling algorithm\n",
    "  \n",
    "  \n",
    "## Error analysis\n",
    "\n",
    "What did you get wrong and why?\n",
    "\n",
    "- Fit model on all training data\n",
    "- Predict on validation data\n",
    "- Collect and categorize errors\n",
    "  - false positives\n",
    "  - false negatives\n",
    "- Sort by:\n",
    "  - Label probability\n",
    "  \n",
    "A useful diagnostic:\n",
    "- Find the top 10 most wrong predictions\n",
    "  - I.e., probability of incorrect label is near 1\n",
    "- For each, print the features that are \"most responsible\" for decision\n",
    "\n",
    "E.g., for logistic regression\n",
    "\n",
    "$$\n",
    "p(y \\mid x) = \\frac{1}{1 + e^{-x^T \\theta}}\n",
    "$$\n",
    "\n",
    "If true label was $-1$, but classifier predicted $+1$, sort features in descending order of $x_j * \\theta_j$\n",
    "\n",
    "<br><br>\n",
    "\n",
    "Error analysis often helps designing new features.\n",
    "- E.g., \"not good\" classified as positive because $x_{\\mathrm{good}} * \\theta_{\\mathrm{good}} >> 0$\n",
    "- Instead, replace feature \"good\" with \"not_good\"\n",
    "  - similarly for other negation words\n",
    "  \n",
    "May also discover incorrectly labeled data\n",
    "- Common in classification tasks in which labels are not easily defined\n",
    "  - E.g., is the sentence \"it's not bad\" positive or negative?\n",
    "\n",
    "\n",
    "For regression, make a scatter plot\n",
    "    - look at outliers\n",
    "    \n",
    "![scatter](scatter.png)\n",
    "\n",
    "## Inter-annotator agreement\n",
    "\n",
    "- How often do two human annotators give the same label to the same example?\n",
    "\n",
    "- E.g., consider two humans labeling 100 documents:\n",
    "\n",
    "<table>\n",
    "<tr><td> </td> <td> </td> <td colspan=2> **Person 1** </td> </tr>\n",
    "<tr><td> </td> <td> </td> <td> Relevant </td> <td> Not Relevant </td> </tr>\n",
    "<tr><td rowspan=2> **Person 2** </td> <td> Relevant </td> <td> 50 </td> <td> 20 </td> </tr>\n",
    "<tr>                              <td> Not Relevant </td> <td> 10 </td> <td> 20 </td> </tr>\n",
    "</table>\n",
    "\n",
    "\n",
    "- Simple **agreement**: fraction of documents with matching labels. $\\frac{70}{100} = 70\\%$\n",
    "\n",
    "<br><br>\n",
    "\n",
    "- But, how much agreement would we expect by chance?\n",
    "\n",
    "- Person 1 says Relevant $60\\%$ of the time.\n",
    "- Person 2 says Relevant $70\\%$ of the time.\n",
    "- Chance that they both say relevant at the same time? $60\\% \\times 70\\% = 42\\%$.\n",
    "\n",
    "\n",
    "- Person 1 says Not Relevant $40\\%$ of the time.\n",
    "- Person 2 says Not Relevant $30\\%$ of the time.\n",
    "- Chance that they both say not relevant at the same time? $40\\% \\times 30\\% = 12\\%$.\n",
    "\n",
    "\n",
    "- Chance that they agree on any document (both say yes or both say no): $42\\% + 12\\% = 54\\%$\n",
    "\n",
    "** Cohen's Kappa ** $\\kappa$\n",
    "\n",
    "- Percent agreement beyond that expected by chance\n",
    "\n",
    "$ \\kappa = \\frac{P(A) - P(E)}{1 - P(E)}$\n",
    "\n",
    "- $P(A)$ = simple agreement proportion\n",
    "- $P(E)$ = agreement proportion expected by chance\n",
    "\n",
    "\n",
    "E.g., $\\kappa = \\frac{.7 - .54}{1 - .54} = .3478$\n",
    "\n",
    "- $k=0$ if no better than chance, $k=1$ if perfect agreement\n",
    "\n",
    "<br><br>\n",
    "\n",
    "## Data biases\n",
    "\n",
    "- How similar is the testing data to the training data?\n",
    "\n",
    "- If you were to deploy the system to run in real-time, would the data it sees be comparable to the testing data?\n",
    "\n",
    "Assumption that test/train data drawn from same distribution often wrong:\n",
    "\n",
    "<u>Label shift:</u>\n",
    "\n",
    "$p_{\\mathrm{train}}(y) \\ne p_{\\mathrm{test}}(y)$\n",
    "  - e.g., positive examples more likely in testing data\n",
    "  - Why does this matter?\n",
    "\n",
    "In logistic regression:\n",
    "\n",
    "$$\n",
    "p(y \\mid x) = \\frac{1}{1 + e^{-(x^T\\theta + b)}}\n",
    "$$\n",
    "- bias term $b$ adjusts predictions to match overall $p(y)$\n",
    "  \n",
    "<br><br>\n",
    "\n",
    "\n",
    "## More bias\n",
    "\n",
    "<u>Confounders</u>\n",
    "- Are there variables that predict the label that are not in the feature representation?\n",
    "    - e.g., some products have higher ratings than others; gender bias; location bias;\n",
    "    - May add additional features to model these attributes\n",
    "    - Or, may need to train separate classifiers for each gender/location/etc.\n",
    "    \n",
    "<br>\n",
    "\n",
    "<u>Temporal Bias</u>\n",
    "- Do testing instances come later, chronologically, than the training instances?\n",
    "    - E.g., we observe that user X likes Superman II, she probably also likes Superman I \n",
    "- Why does this matter?\n",
    "    - inflates estimate of accuracy in production setting\n",
    "\n",
    "<br>\n",
    "\n",
    "<u> Cross-validation splits </u>\n",
    "- E.g., classifying a user/organization's tweets: does the same user appear in both training/testing\n",
    "    - could just be learning a user-specific classifier; won't generalize to new user\n",
    "    - speech recognition\n",
    "- Again, will inflate estimate of accuracy.\n",
    "\n",
    "\n",
    "\n",
    "## Over/under fitting\n",
    "\n",
    "What is training vs validation accuracy?\n",
    "\n",
    "- If training accuracy is low, we are probably underfitting. Consider:\n",
    "  - adding new features\n",
    "  - adding combinations of existing features (e.g., ngrams, conjunctions/disjunctions)\n",
    "  - adding hidden units/layers\n",
    "  - try non-linear classifier\n",
    "    - SVM, decision trees, neural nets\n",
    "  \n",
    "- If training accuracy is very high (>99%), but validation accuracy is low, we are probably overfitting\n",
    "  - Do the opposite of above\n",
    "  - reduce number of features\n",
    "  - Regularization (L2, L1)\n",
    "  - Early stopping for gradient descent\n",
    "  - look at learning curves\n",
    "  - may need more training data\n",
    "  \n",
    "  \n",
    "  \n",
    "## Parameter tuning\n",
    "\n",
    "Many \"hyperparameters\"\n",
    "- regularization strength\n",
    "- number of gradient descent iterations\n",
    "- ...\n",
    "\n",
    "\n",
    "- Be sure to tune these on the validation set, not the test set.\n",
    "\n",
    "<br>\n",
    "\n",
    "- Grid search\n",
    "  - Exhaustively search over all combinations\n",
    "  - Discretize continous variables\n",
    "  ```python \n",
    "  {'C': [.01, .1, 1, 10, 100, 1000], 'n_hidden': [5, 10, 50], 'regularizer': ['l1', 'l2']},\n",
    "  ```\n",
    "  \n",
    "- Random search\n",
    "  - Define a probability distribution over each parameter\n",
    "  - Each iteration samples from that distribution\n",
    "  - Allows you to express prior knowledge over the likely best settings\n",
    "  - E.g.,\n",
    "  ```python\n",
    "  regularizer={'l1': .3, 'l2': .7}```\n",
    "  \n",
    "See more at http://scikit-learn.org/stable/modules/grid_search.html\n",
    "\n",
    "\n",
    "<br><br>\n",
    "- While building model, may want to avoid evaluating on validation set too much\n",
    "- **double cross-validation** can be used instead\n",
    "\n",
    "- Using only the training set:\n",
    "  - split into $k$ (train, test) splits\n",
    "  - for each split ($D_{tr}, D_{te}$)\n",
    "    - split $D_{tr}$ into $m$ splits\n",
    "    - pick hyperparameters that maximize this nested cross-validation accuracy\n",
    "    - train on all of $D_{tr}$ with those parameters\n",
    "    \n",
    "Evaluates how well your hyperparameter selection algorithm does.\n",
    "\n",
    " \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}