{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Test 2 Review**\n",
"\n",
"- l10 -> end\n",
"- sentiment analysis\n",
" - lexicons\n",
" - wordnet\n",
" - how to combine words with different sentiment?\n",
"- classification\n",
" - workflow (collect, label, train, test, etc)\n",
" - computing feature vectors\n",
" - generalization error\n",
" - variants of SimpleMachine\n",
" - computing cross-validation accuracy (why?)\n",
" - bias vs. variance\n",
"- demographics\n",
" - pitfalls of using name lists\n",
" - computing odds ratio\n",
" - smoothing\n",
" - ngrams\n",
" - tokenization\n",
" - stop words\n",
" - regularization (why?)\n",
"- logistic regression, linear regression\n",
" - no need to do calculus, but do you understand the formula?\n",
" - apply classification function given data/parameters\n",
" - what does the gradient represent?\n",
"- feature representation\n",
" - tf-idf\n",
" - csr_matrix: how does this data structure work? (data, column index, row pointer)\n",
"- recommendation systems\n",
" - content-based\n",
" - tf-idf\n",
" - cosine similarity\n",
" - collaborative filtering\n",
" - user-user\n",
" - item-item\n",
" - measures: jaccard, cosine, pearson. Why choose one over another\n",
" - How to compute the recommended score for a specific item?\n",
"- k-means\n",
" - compute cluster assignment, means, and error function\n",
" - what effect does k have?\n",
" - representing word context vectors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Project tips\n",
"\n",
"So you've collected data, implemented a baseline, and have an F1 of 78%. \n",
"**Now what??**\n",
"\n",
"- Error analysis\n",
"- Check for data biases\n",
"- Over/under fitting\n",
"- Parameter tuning\n",
"\n",
"## Reminder: train/validation/test splits\n",
"\n",
"- Training data\n",
" - To fit model\n",
" - May use cross-validation loop\n",
" \n",
"- Validation data\n",
" - To evaluate model while debugging/tuning\n",
" \n",
"- Testing data\n",
" - Evaluate once at the end of the project\n",
" - Best estimate of accuracy on some new, as-yet-unseen data\n",
" - be sure you are evaluating against **true** labels \n",
" - e.g., not the output of some other noisy labeling algorithm\n",
" \n",
" \n",
"## Error analysis\n",
"\n",
"What did you get wrong and why?\n",
"\n",
"- Fit model on all training data\n",
"- Predict on validation data\n",
"- Collect and categorize errors\n",
" - false positives\n",
" - false negatives\n",
"- Sort by:\n",
" - Label probability\n",
" \n",
"A useful diagnostic:\n",
"- Find the top 10 most wrong predictions\n",
" - I.e., probability of incorrect label is near 1\n",
"- For each, print the features that are \"most responsible\" for decision\n",
"\n",
"E.g., for logistic regression\n",
"\n",
"$$\n",
"p(y \\mid x) = \\frac{1}{1 + e^{-x^T \\theta}}\n",
"$$\n",
"\n",
"If true label was $-1$, but classifier predicted $+1$, sort features in descending order of $x_j * \\theta_j$\n",
"\n",
"
\n",
"\n",
"Error analysis often helps designing new features.\n",
"- E.g., \"not good\" classified as positive because $x_{\\mathrm{good}} * \\theta_{\\mathrm{good}} >> 0$\n",
"- Instead, replace feature \"good\" with \"not_good\"\n",
" - similarly for other negation words\n",
" \n",
"May also discover incorrectly labeled data\n",
"- Common in classification tasks in which labels are not easily defined\n",
" - E.g., is the sentence \"it's not bad\" positive or negative?\n",
"\n",
"\n",
"For regression, make a scatter plot\n",
" - look at outliers\n",
" \n",
"\n",
"\n",
"## Inter-annotator agreement\n",
"\n",
"- How often do two human annotators give the same label to the same example?\n",
"\n",
"- E.g., consider two humans labeling 100 documents:\n",
"\n",
"
| **Person 1** | |||
| Relevant | Not Relevant | ||
| **Person 2** | Relevant | 50 | 20 |
| Not Relevant | 10 | 20 | |