{ "cells": [ { "cell_type": "markdown", "metadata": { "ein.tags": "worksheet-0", "slideshow": { "slide_type": "-" } }, "source": [ "## A Hacker's Guide to Having Your NYTimes Article Comments Noticed\n", "\n", "
\n", "Gavril Bilev\n", "allattention at gmail.com\n", "June 14, 2018\n", "
\n", "\n", "### The problem: predicting the number of upvotes ('recommendations') that a comment posted to an article on the [New York Times](www.nytimes.com) will receive. \n", "\n", "![Imgur](https://i.imgur.com/Vm9tAsj.png?1)\n", "\n", "\n", "### **TL&DR**\n", "If you want upvotes:\n", "* **Time is of the essence, the sooner you comment, the better!**\n", "* **Commenting on comments doesn't bring in the upvotes**\n", "* **There is an optimal article length for getting upvoted comments - at about 800 words.**\n", "* **Print page - most of the important big-headline articles appear in the first 30 pages of the paper, where they attract more readers and more comments.**\n", "* **Getting the coveted 'NYTimes Pick' helps a lot. So does being a trusted user or a NYTimes reporter.**\n", "* **Hot button issues bring about reactions, but not always upvotes** \n", "* **Effort pays off:**\n", " * say more\n", " * use a rich vocabulary \n", " * refer to people, places and organizations, but sparingly \n", " * spell-check\n", "* **Don't be too negative or too positive, be slightly positive.**\n", "\n", "\n", "\n", "\n", "\n", "Readers of the [Gray Lady](www.nytimes.com) are able to post comments on articles and react to the comments of others by either upvoting ('Recommend' button) or replying. For a comment-author, recommendations are desirable because they bring about more visibility - recommended comments float up to the top where they are seen by more readers and can receive even more upvotes. Once a comment 'snowballs' it can be seen by potentially millions of readers. Presumably we write comments because we want others to see them.\n", "\n", "I got curious about what makes some comments rise to the top while most others are completely ignored. [Aashita Kesarwani](https://www.kaggle.com/aashita https://www.kaggle.com/aashita) posted a cool dataset on [Kaggle](www.kaggle.com) of more than 2 million comments geared toward addressing this and similar questions. Be sure to check out the dataset [here](https://www.kaggle.com/aashita/nyt-comments) as well as her wonderful exploratory data analysis of it [here](https://www.kaggle.com/aashita/exploratory-data-analysis-of-comments-on-nyt/data). The data come from two time periods: Jan-May 2017 and Jan-April 2018 and contain features on both the comments (with the full raw text body of each comment) and the more than 9 thousand NYTimes articles the comments were responding to.\n", "\n", "I like this as a prediction task because it is challenging. Presumably what makes people upvote comments or posts, not just at the Times but also in other social media settings (Reddit, Twitter, FB, etc) is the _meaning_ of the comment - they find it helpful, or funny or simply agree with it and want others to see it too. Since we can't easily quantify 'meaning,' we'll have to rely on some feature engineering in order to get any traction with our predictions. The task is also challenging because we don't know much about the readers who are responding to comments and do not have the full text of the articles. Additionally, some comments are pretty short, so they are not easily amenable to a more complicated language model. After extracting what we can from the features already present in the dataset, we will have to wrangle useful information out of noisy and messy raw text - the body of the comments.\n", "\n", "The main point to consider here is that we need to set our expectations low, because this is no handwritten digit recognition exercise with 99%+ accuracy. Even a human scorer would probably not do well in terms of predicting upvotes. Consider these two comments, both made in response to the same article:\n", "\n", "> A. '_Everyone should have walked out. Spicer could have talked to himself._'\n", "\n", "> B. '_If people are \"alarmed\" and \"appalled\" that Trump did this, followed by his cronies justification of it, then they haven't been paying attention. I'm just surprised he took this long to do it. The story here, the one that people should actually be alarmed and appalled by is that the rest of the press stayed. Brietbart, One America, not so surprising but there is absolutely no excuse for rest to have shown so little respect for the one of the most important and defining features of America... The First Amendment, a free press. To echo Mr Kahn's question to President Trump...have you even read the constitution? Here, let me give you a head start... Amendment I: Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances._'\n", "\n", "Without any contextual knowledge, I would have guessed that the two comments would be roughly on par or the first one would receive fewer upvotes. They are generally similar in terms of the reaction to the article, though the second comment is longer and contains more information. The reality is quite different: comment A received over 10,000 upvotes and is the most upvoted comment in the entire dataset, while comment B got ... zero upvotes. \n", "\n", "If you are interested in the takeaways (and one very plausible explanation about the popularity of these two comments, it is not simply brevity!), skip right to the last section. In the sections that follow we will go through the traditional steps of classification, with an emphasis on feature engineering out of the numerical and text data.\n", "\n", "The original target variable here is a count variable of recommendations. For the artificial purpose of using classification algos and metrics, I will discretize it to four meaningful and roughly equal-sized categories. Since the largest of them is about 29%, this is our prediction baseline: if we simply picked that always, our classification would be accurate 29% of the time.\n", "\n", "Categories:\n", "1. None ~29%\n", "2. One or two ~29%\n", "3. Three to eight ~23%\n", "4. More than eight (up to tens of thousands) ~18%\n", "\n", "Naturally, you also choose to keep the original target variable (though cropping the right tail of the distribution might be a good idea due to how skewed it is) and rely on regression and metrics such as mean squared error. (In that case, it would probably be most appropriate to use a negative binomial model for the first simple fit, due to the count nature of the data and the fact that zero has meaning.)\n", "\n", "\n", "### Table of contents:\n", "1. [Data Loading & Preparation](#data)\n", " 1. [Load data files & combine](#dataloading)\n", "\t2. [Discretize **recommendations** (target)](#cut)\n", "\t3. [Turn categorical to dummies](#todummies)\n", "2. [Initial prediction](#initpredict)\n", "\t1. [Simple Multinomial Logistic](#logit)\n", "\t2. [A bag of classifiers](#classifiers)\n", "3. [Feature Engineering](#featureengineering)\n", " 1. [Features based on original variables](#originalvars)\n", " 1. [Replyupvotes](#reply)\n", " 2. [byline](#byline)\n", " 3. [Time](#time)\n", " 2. [Features based on raw text data](#textdata)\n", " 1. [A NYTimes vocabulary & IDF profile](#vocab)\n", "\t\t2. [Basic stats, sentiment analysis, spelling errors & part of speech](#kitchensink)\n", "\t\t3. [Text token features](#texttokens)\n", "4. [Training & Re-evaluation](#training)\n", "\t1. [Simple Multinomial Logistic](#finallogit)\n", "\t2. [A bag of classifiers](#finalclassifiers)\n", "5. [Takeaways](#tldr)\n" ] }, { "cell_type": "markdown", "metadata": { "ein.tags": "worksheet-0", "slideshow": { "slide_type": "-" } }, "source": [ "Data loading \n", "\n", "First, let's prepare the data. We'll load almost all features present in the .csv files in their original shapes into a large dataframe, drop features that cannot be useful (too many missing cases) and convert the categorical features to dummies. This will enable us to do a first-cut prediction without feature engineering - almost as if we know nothing about the dataset and are doing a blind prediction. 'Almost' because we will deliberately exclude one feature that is surely related to the target - the number of replies a comment has received (**replyCount**). We would expect this to be an effect (consequence), rather than a cause of the number of upvotes - upvoted comments are highly visible and therefore much more likely to receive replies.\n", "\n", "We will initially ignore the raw text features (body of comments and keywords of the article) but keep them for later. Finally, to speed up our work here, we will take a random sample of about 10% of the cases (sometimes called 'dev set'). \n", "\n", "Load libraries and already-downloaded data first. There are two separate sets of files, the first containing comments and comment-features and the second containing features about the articles. We read in all of the data, merge them into one Pandas Dataframe where each row is a comment and eliminate all duplicates. \n", " \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "autoscroll": false, "ein.hycell": false, "ein.tags": "worksheet-0", "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "import glob\n", "import pandas as pd\n", "import numpy as np\n", "import calendar\n", "import warnings\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "# set a few options:\n", "sns.set(style='whitegrid')\n", "warnings.filterwarnings(\"ignore\")\n", "pd.options.display.max_colwidth = 100\n", "%matplotlib inline\n", "# Kernel to predict upvotes ('recommendations' as target); will cut into intervals to turn this into a classification exercise\n" ] }, { "cell_type": "markdown", "metadata": { "ein.tags": "worksheet-0", "slideshow": { "slide_type": "-" } }, "source": [ "1. Load data files & combine " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "autoscroll": false, "ein.hycell": false, "ein.tags": "worksheet-0", "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " | Weight | \n", "Gain | \n", "Cover | \n", "
---|---|---|---|
Keyword:\"\" | \n", "497.0 | \n", "682.196571 | \n", "151838.109084 | \n", "
depth | \n", "3042.0 | \n", "412.020744 | \n", "136746.635392 | \n", "
editorsSelection_1 | \n", "716.0 | \n", "341.794156 | \n", "316714.079275 | \n", "
author_By THE LEARNING NETWORK | \n", "430.0 | \n", "112.460529 | \n", "224883.973088 | \n", "
trusted | \n", "1154.0 | \n", "66.849172 | \n", "84256.032967 | \n", "
typeOfMaterial_News | \n", "2553.0 | \n", "57.146109 | \n", "38005.193426 | \n", "
author_By DEB AMLEN | \n", "626.0 | \n", "56.199554 | \n", "185494.032612 | \n", "
author_By GAIL COLLINS | \n", "500.0 | \n", "44.423127 | \n", "112118.262682 | \n", "
Keyword:\"syria\" | \n", "661.0 | \n", "34.882420 | \n", "134829.897287 | \n", "
sectionName_Family | \n", "428.0 | \n", "32.341839 | \n", "212454.863198 | \n", "
author_By DAVID BROOKS | \n", "683.0 | \n", "31.381635 | \n", "224897.240937 | \n", "
approveHour_3AM | \n", "603.0 | \n", "27.642708 | \n", "108925.055124 | \n", "
approveDay_Sunday | \n", "1440.0 | \n", "25.434009 | \n", "39614.574964 | \n", "
approveHour_6AM | \n", "302.0 | \n", "25.242134 | \n", "234501.792347 | \n", "
hoursAfterArticle | \n", "28067.0 | \n", "24.095876 | \n", "29920.784922 | \n", "
author_By ROSS DOUTHAT | \n", "581.0 | \n", "24.012026 | \n", "212154.869081 | \n", "
Keyword:\"global warming\" | \n", "633.0 | \n", "23.019341 | \n", "177123.471271 | \n", "
Keyword:\"school shootings and armed attacks\" | \n", "505.0 | \n", "22.711046 | \n", "119801.236389 | \n", "
Keyword:\"gun control\" | \n", "615.0 | \n", "21.969879 | \n", "80282.095568 | \n", "
approveHour_5AM | \n", "434.0 | \n", "21.439328 | \n", "155686.510859 | \n", "
approveHour_4AM | \n", "474.0 | \n", "21.305978 | \n", "152005.843738 | \n", "
approveDay_Saturday | \n", "2021.0 | \n", "19.984632 | \n", "25385.068124 | \n", "
sectionName_Sunday Review | \n", "967.0 | \n", "19.306374 | \n", "59144.668091 | \n", "
Keyword:\"russia\" | \n", "896.0 | \n", "18.883552 | \n", "24850.877874 | \n", "
sectionName_Live | \n", "273.0 | \n", "18.882208 | \n", "185213.265113 | \n", "
approveHour_3PM | \n", "1613.0 | \n", "18.551806 | \n", "50622.438727 | \n", "
sectionName_Politics | \n", "1636.0 | \n", "17.781703 | \n", "36545.528892 | \n", "
Keyword:\"united states international relations\" | \n", "1127.0 | \n", "17.766371 | \n", "27295.655350 | \n", "
Keyword:\"comey, james b\" | \n", "879.0 | \n", "17.759434 | \n", "142679.586790 | \n", "
sectionName_Baseball | \n", "288.0 | \n", "17.516210 | \n", "363076.870441 | \n", "
\n", " | Weight | \n", "Gain | \n", "Cover | \n", "
---|---|---|---|
hoursAfterArticle | \n", "28067.0 | \n", "24.095876 | \n", "29920.784922 | \n", "
articleWordCount | \n", "20748.0 | \n", "11.753552 | \n", "27152.405073 | \n", "
printPage | \n", "12070.0 | \n", "15.045092 | \n", "30600.598564 | \n", "
reply | \n", "9793.0 | \n", "9.457217 | \n", "20839.824937 | \n", "
commentPolarity | \n", "8195.0 | \n", "3.459197 | \n", "11807.220214 | \n", "
meanWordLength | \n", "8133.0 | \n", "3.358576 | \n", "13323.139771 | \n", "
NNP_Percent | \n", "8077.0 | \n", "4.153441 | \n", "10649.774066 | \n", "
NN_Percent | \n", "7708.0 | \n", "3.348625 | \n", "14371.441559 | \n", "
DT_Percent | \n", "7424.0 | \n", "3.384502 | \n", "9474.644469 | \n", "
RB_Percent | \n", "7282.0 | \n", "3.402155 | \n", "11412.144736 | \n", "
commentObjectivity | \n", "7256.0 | \n", "3.363130 | \n", "11736.138772 | \n", "
IN_Percent | \n", "7185.0 | \n", "3.382322 | \n", "13826.117528 | \n", "
Idf3Percent | \n", "7053.0 | \n", "3.340505 | \n", "9648.121062 | \n", "
JJ_Percent | \n", "7042.0 | \n", "3.338519 | \n", "10733.782276 | \n", "
MeanIdf | \n", "6835.0 | \n", "3.544444 | \n", "9056.971679 | \n", "
Idf4Percent | \n", "6774.0 | \n", "3.270182 | \n", "7689.664120 | \n", "
maxSentLength | \n", "6728.0 | \n", "3.975159 | \n", "16023.769561 | \n", "
PRP_Percent | \n", "6636.0 | \n", "4.143711 | \n", "25704.084771 | \n", "
NNS_Percent | \n", "6531.0 | \n", "3.475850 | \n", "11027.535167 | \n", "
CC_Percent | \n", "6466.0 | \n", "3.423282 | \n", "8732.754822 | \n", "
Idf5Percent | \n", "6464.0 | \n", "3.300831 | \n", "9638.631578 | \n", "
recognizedWordCount | \n", "6333.0 | \n", "8.573698 | \n", "13550.883542 | \n", "
VB_Percent | \n", "6332.0 | \n", "3.639284 | \n", "11144.892127 | \n", "
Idf2Percent | \n", "6153.0 | \n", "3.292262 | \n", "12727.000114 | \n", "
VBZ_Percent | \n", "6100.0 | \n", "3.444711 | \n", "14305.933591 | \n", "
VBP_Percent | \n", "5917.0 | \n", "3.415990 | \n", "14272.319094 | \n", "
TO_Percent | \n", "5890.0 | \n", "3.372315 | \n", "13012.840049 | \n", "
VBN_Percent | \n", "5560.0 | \n", "3.330652 | \n", "11091.542413 | \n", "
VBG_Percent | \n", "5473.0 | \n", "3.428436 | \n", "10130.894194 | \n", "
MinIdf | \n", "5468.0 | \n", "3.210395 | \n", "16043.595959 | \n", "
Idf6Percent | \n", "5438.0 | \n", "3.421114 | \n", "6735.352357 | \n", "
MaxIdf | \n", "5348.0 | \n", "3.505037 | \n", "10320.106724 | \n", "
VBD_Percent | \n", "5342.0 | \n", "3.930106 | \n", "10834.886340 | \n", "
MD_Percent | \n", "5189.0 | \n", "4.563003 | \n", "25085.857509 | \n", "
PRP$_Percent | \n", "5157.0 | \n", "6.184868 | \n", "20498.521030 | \n", "
Idf7Percent | \n", "5072.0 | \n", "3.621082 | \n", "13139.742682 | \n", "
maxWordLength | \n", "4701.0 | \n", "3.500347 | \n", "20085.722387 | \n", "
commentSpellErrorsPercent | \n", "4466.0 | \n", "4.373889 | \n", "23221.690001 | \n", "
CD_Percent | \n", "4242.0 | \n", "3.661409 | \n", "10338.213885 | \n", "
Idf1Percent | \n", "4220.0 | \n", "3.425274 | \n", "8179.586358 | \n", "