{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyze articles on Hacker News using NLP!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates the usage of the [`news-analyze`](https://github.com/jayantj/news-analyze) library, which makes use of topic modeling and clustering for extracting topics and themes out of a corpus of news articles. The key features are - \n", "\n", "1. Extracting high quality, human-interpretable topics from a collection of articles\n", "2. Visualizations of trends in topics over time\n", "3. Automatically ranking topics by \"interesting-ness\"\n", "4. Clustering topics into groups of related topics\n", "5. Auto-tagging new unseen articles with topics\n", "\n", "The goal of the library is to provide a way to qualitatively explore topics and trends in a news corpus to gain insight into it.\n", "\n", "The notebook presents the usage of these features using a model trained on an year's worth of Hacker News data, which is present in the repo and directly usable. The library doesn't yet provide a documented API to be able to train new models on your own data. This is a work in progress.\n", "\n", "This library was one of the things I worked on while I was part of the [Recurse Center](https://www.recurse.com/), a programmer's retreat for people from a variety of backgrounds and experience levels looking to get better at programming. You should check them out!\n", "\n", "A significant motivation behind this initial alpha release and demo is to get feedback about the following - \n", "\n", "1. Specific application and areas where this could be useful\n", "2. Other datasets on which the library could be used\n", "3. New features that could be helpful\n", "4. Problems with existing features\n", "5. Improvements to the API and usage docs\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data and preprocessing\n", "\n", "The data used for training the model is a collection of posts on [Hacker News](http://news.ycombinator.com/), available [here](https://www.kaggle.com/hacker-news/hacker-news-posts/data). The raw data contains 293119 posts from September 2015 to September 2016. A post here refers to an article that was posted to Hacker News, not the comments. The article text is not included, only the url, along with some metadata (time of post, number of points and comments received).\n", "\n", "Firstly, any articles that received under 50 points were filtered out, in order to focus on links that received a fair amount of attention on HN, which results in 20148 posts. Next, to extract the full text of these articles, the content from the urls was scraped and parsed using [newspaper](https://github.com/codelucas/newspaper), a Python library which allows extracting of full text of news articles from html. Content from some urls could not be extracted correctly in this process (mostly 404s), resulting in 15016 parsed articles.\n", "\n", "Topic models were trained on these using [Gensim](http://github.com/RaRe-Technologies/gensim), a Python library that has both native implementations of various topic modeling algorithms as well as wrappers to external topic modeling frameworks. The final model in the repository was trained using a wrapper to [Mallet](https://github.com/mimno/Mallet). [Spacy](https://github.com/explosion/spaCy) was used for tokenization and lemmatization. Tokens that were extremely frequent or extremely rare were filtered out. For more specific details, please have a look at [this file](https://github.com/jayantj/news-analyze/blob/master/models/utils.py)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Demonstration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The insights and use-cases presented in this section are on the dataset described above. I don't yet know how well these techniques can generalize to new datasets, and your mileage may vary. Also, the repository does not contain the original text scraped from the HN posts as these are from a variety of websites, some of which might have terms and conditions that do not permit their data to be publicly released. As a result, the notebook might not be runnable on your local machine. I'm currently looking into how to work around this issue." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import required packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/jayant/Projects/recurse/hn_analyze\n" ] } ], "source": [ "%cd .." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pickle" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from plotly.offline import init_notebook_mode, iplot\n", "init_notebook_mode(connected=True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib\n", "matplotlib.rcParams['figure.figsize'] = [12, 8]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load trained model" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "model = pickle.load(open('data/models/hn_ldam_mallet_100t_5a', 'rb'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.1 Print all topics, ordered by \"interesting-ness\" scores" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the list of all topics that were extracted from the corpus, printed in human-readable form. Note that in the underlying model, each topic is a vector of scores over all words in the corpus. Here, only the top 10 words for each topic are displayed, for ease of reading and in order to get a sense of what each topic is about.\n", "\n", "The topics are ordered in decreasing order of \"interesting-ness\", which is described in a [later section](#topic-interestingness) in the notebook." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #99 Topic #29 Topic #38 Topic #43 Topic #56 Topic #70 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " network earth container quantum bitcoin car \n", " model space docker theory transaction vehicle \n", " learning star run physics blockchain drive \n", " neural planet service particle network tesla \n", " learn orbit image universe wright road \n", " machine moon application physicist block bike \n", " deep year deploy wave ethereum driver \n", " training mars cluster field trust model \n", " layer galaxy machine hole currency electric \n", " image telescope host state exchange wheel \n", "\n", "\n", " Topic #71 Topic #11 Topic #94 Topic #2 Topic #12 Topic #13 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " animal flight cell key food stack \n", " human fly gene certificate eat instruction \n", " specie air dna security fat register \n", " dog space human encryption sugar address \n", " bird aircraft genome encrypt diet code \n", " cat launch genetic password meat call \n", " year plane protein secure fruit memory \n", " tree drone mouse secret egg byte \n", " live pilot cancer public farmer program \n", " find rocket bacteria tls grow function \n", "\n", "\n", " Topic #75 Topic #67 Topic #30 Topic #31 Topic #92 Topic #62 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " component al stock git police drug \n", " react state tax github crime patient \n", " function attack market repository officer health \n", " var group company commit drug medical \n", " element government fund branch prison disease \n", " return country share change criminal doctor \n", " state islamic investor code arrest cancer \n", " render terrorist financial merge year treatment \n", " dom saudi bank request call death \n", " import iran price project law year \n", "\n", "\n", " Topic #44 Topic #52 Topic #97 Topic #77 Topic #16 Topic #39 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " government pi phone memory node company \n", " agency board network cpu system startup \n", " security usb internet core read founder \n", " nsa chip radio intel state investor \n", " surveillance power signal performance write tech \n", " fbi hardware mobile cache cluster valley \n", " snowden card device processor distribute start \n", " intelligence km channel chip latency silicon \n", " document device service op message business \n", " information controller fi gpu operation money \n", "\n", "\n", " Topic #93 Topic #86 Topic #27 Topic #50 Topic #91 Topic #10 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " energy service file security code light \n", " power datum command attack compiler laser \n", " solar aws run vulnerability rust electron \n", " cost cloud install exploit compile field \n", " battery instance script hacker function energy \n", " year server build password optimization high \n", " gas application directory attacker library fusion \n", " plant run default hack call charge \n", " fuel storage package find memory produce \n", " oil system set system performance ray \n", "\n", "\n", " Topic #48 Topic #36 Topic #21 Topic #84 Topic #87 Topic #0 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " database ship sleep device city upgrade \n", " query sea day phone san fix \n", " datum water hour camera area close \n", " table year exercise battery street add \n", " index ocean mental laptop housing al \n", " row find people vr york doc \n", " column island health screen francisco rebuild \n", " sql river depression smartphone home david \n", " data land feel home building update \n", " select site stress hardware people michael \n", "\n", "\n", " Topic #20 Topic #22 Topic #25 Topic #80 Topic #85 Topic #33 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " int number uber brain facebook music \n", " return point amazon study ad video \n", " function matrix driver cognitive twitter sound \n", " const algorithm service memory user audio \n", " void function trip neuron post play \n", " struct vector airbnb participant site song \n", " null prime ride effect people stream \n", " type graph lyft al news record \n", " template line taxi ability content note \n", " char curve city intelligence medium listen \n", "\n", "\n", " Topic #58 Topic #34 Topic #63 Topic #73 Topic #28 Topic #57 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " thread src image law war student \n", " process llvm color court military school \n", " lock tool pixel case weapon college \n", " call gnu map legal soviet learn \n", " event clang frame rule nuclear university \n", " queue module draw state russian teach \n", " task include red lawyer force education \n", " run patch light government missile class \n", " wait solution render judge bomb high \n", " function problem blue order american teacher \n", "\n", "\n", " Topic #74 Topic #1 Topic #41 Topic #65 Topic #14 Topic #96 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " type team people game windows license \n", " function people economic player linux software \n", " haskell job money play system copyright \n", " language company dao move kernel patent \n", " monad interview contract win microsoft free \n", " return hire social world os oracle \n", " define engineer income chess run include \n", " list employee rich computer boot source \n", " lambda manager wealth level user term \n", " promise day inequality sport driver copy \n", "\n", "\n", " Topic #46 Topic #40 Topic #26 Topic #7 Topic #49 Topic #82 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " percent function server water app web \n", " year string network air android page \n", " job return connection temperature google browser \n", " worker variable packet flow user site \n", " rate code ip surface apps content \n", " income match client heat swift website \n", " high expression tcp bridge mobile user \n", " low list address material ios javascript \n", " increase def protocol oxygen add html \n", " growth call send chemical developer chrome \n", "\n", "\n", " Topic #23 Topic #79 Topic #15 Topic #8 Topic #19 Topic #24 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " support bank request python film book \n", " release money server library show write \n", " version card client code art century \n", " feature account http language movie world \n", " change credit response java artist history \n", " add pay application ruby netflix great \n", " fix payment url javascript star man \n", " update cash api framework world modern \n", " include transaction service read le year \n", " issue number header write disney life \n", "\n", "\n", " Topic #64 Topic #6 Topic #60 Topic #5 Topic #32 Topic #45 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " type problem email word project datum \n", " object machine message book open memory \n", " class theory send language source byte \n", " method number tor text developer file \n", " string mathematical address read build bit \n", " function computer account english community size \n", " code mathematic mail document tool hash \n", " public proof domain character development key \n", " call mathematician contact letter team set \n", " return question user write software buffer \n", "\n", "\n", " Topic #42 Topic #54 Topic #81 Topic #61 Topic #66 Topic #76 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " company design language text google china \n", " year build program mode computer country \n", " employee wall code window technology chinese \n", " million building programming screen machine world \n", " business part write line human united \n", " executive small programmer editor system india \n", " billion room software button world states \n", " accord shape system click ai government \n", " firm material computer display year north \n", " ceo create design key robot american \n", "\n", "\n", " Topic #9 Topic #17 Topic #90 Topic #3 Topic #53 Topic #47 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " apple datum api group child product \n", " font result bot public woman customer \n", " iphone number direct political age business \n", " mac average slack state man service \n", " design model total member study revenue \n", " phone analysis sun policy group share \n", " size sample avg campaign parent growth \n", " device show sat president male platform \n", " software distribution sms party adult result \n", " ios measure anonymous government sex software \n", "\n", "\n", " Topic #37 Topic #69 Topic #35 Topic #98 Topic #83 Topic #78 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " life story image datum yahoo package \n", " family continue uk user restaurant full \n", " year read london information coffee debian \n", " friend advertisement caption data bar text \n", " day main mr privacy house subject \n", " people times copyright access food link \n", " live newsletter british service drink send \n", " home sign japan internet mayer mbox \n", " man york year company chef mozilla \n", " house subscribe people provide club date \n", "\n", "\n", " Topic #4 Topic #68 Topic #18 Topic #89 Topic #51 Topic #95 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " people day thing university price test \n", " thing drive people research sell code \n", " feel ms lot science company error \n", " fact bob start paper market bug \n", " human year year researcher buy problem \n", " point august big study business fix \n", " world store problem scientist product check \n", " question july back publish pay fail \n", " person hour happen journal cost issue \n", " bad april talk scientific sale run \n", "\n", "\n", " Topic #55 Topic #88 Topic #72 Topic #59 \n", " ---------- ---------- ---------- ---------- \n", " day country thing system \n", " back european find problem \n", " hand europe post change \n", " run de give require \n", " head french write design \n", " sit france start approach \n", " begin germany read large \n", " walk german point level \n", " hour world article process \n", " man paris ne provide \n", "\n", "\n" ] } ], "source": [ "model.print_topics_table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A topic here is NOT exactly the same as the commonly used interpretation of the word `topic`, it is simply a list of \"related words\". It is intended to represent a broad theme of interest, and doesn't carry a specific label attached to it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Find articles for a specific topic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This prints all the articles (along with a snippet of their content) that contained a specific topic, ordered in decreasing order of the topic score for the article, which is a measure of how central the topic was to the article. The top 5 articles are shown here for ease of reading." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #99 \n", " ---------- \n", " network \n", " model \n", " learning \n", " neural \n", " learn \n", " machine \n", " deep \n", " training \n", " layer \n", " image \n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11052034 - http://www.wildml.com/deep-learning-glossary/\n", "Deep Learning Glossary\n", "Topic score: 0.83\n", "\n", "Article text:\n", " This glossary is work in progress and I am planning to continuously update it. If you find a mistake or think an important term is missing, please let me know in the comments or via email.\n", "\n", "Deep Learning terminology can be quite overwhelming to newcomers. This glossary tries to define commonly used terms and link to original references and additional resources to help readers dive deeper into a specific topic.\n", "\n", "The boundary between what is Deep Learning vs. “general” Machine Learning termino (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #10384279 - http://blog.christianperone.com/2015/08/convolutional-neural-networks-and-feature-extraction-with-python/\n", "Convolutional neural networks and feature extraction with Python\n", "Topic score: 0.80\n", "\n", "Article text:\n", " Convolutional neural networks (or ConvNets) are biologically-inspired variants of MLPs, they have different kinds of layers and each different layer works different than the usual MLP layers. If you are interested in learning more about ConvNets, a good course is the CS231n – Convolutional Neural Newtorks for Visual Recognition. The architecture of the CNNs are shown in the images below:\n", "\n", "As you can see, the ConvNets works with 3D volumes and transformations of these 3D volumes. I won’t repe (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11840175 - https://github.com/rasbt/python-machine-learning-book/blob/master/faq/difference-deep-and-normal-learning.md\n", "What is the difference between deep learning and usual machine learning?\n", "Topic score: 0.74\n", "\n", "Article text:\n", " What is the difference between deep learning and usual machine learning?\n", "\n", "That's an interesting question, and I try to answer this in a very general way.\n", "\n", "In essence, deep learning offers a set of techniques and algorithms that help us to parameterize deep neural network structures -- artificial neural networks with many hidden layers and parameters. One of the key ideas behind deep learning is to extract high level features from the given dataset. Thereby, deep learning aims to overcome the cha (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #12196388 - https://github.com/karandesai-96/digit-classifier\n", "MNIST Handwritten Digit Classifier beginner neural network project\n", "Topic score: 0.73\n", "\n", "Article text:\n", " MNIST Handwritten Digit Classifier\n", "\n", "An implementation of multilayer neural network using numpy library. The implementation is a modified version of Michael Nielsen's implementation in Neural Networks and Deep Learning book.\n", "\n", "Brief Background:\n", "\n", "If you are familiar with basics of Neural Networks, feel free to skip this section. For total beginners who landed up here before reading anything about Neural Networks:\n", "\n", "Neural networks are made up of building blocks known as Sigmoid Neurons . These are n (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11701665 - http://blog.keras.io/building-autoencoders-in-keras.html\n", "Building autoencoders in Keras\n", "Topic score: 0.72\n", "\n", "Article text:\n", " Sat 14 May 2016 In Tutorials.\n", "\n", "In this tutorial, we will answer some common questions about autoencoders, and we will cover code examples of the following models:\n", "\n", "a simple autoencoder based on a fully-connected layer\n", "\n", "a sparse autoencoder\n", "\n", "a deep fully-connected autoencoder\n", "\n", "a deep convolutional autoencoder\n", "\n", "an image denoising model\n", "\n", "a sequence-to-sequence autoencoder\n", "\n", "a variational autoencoder\n", "\n", "Note: all code examples have been updated to the Keras 2.0 API on March 14, 2017. You will need Kera (...)(trimmed)\n", "\n", "\n" ] } ], "source": [ "model.show_topic_articles(99, top_n=5)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #44 \n", " ---------- \n", " government \n", " agency \n", " security \n", " nsa \n", " surveillance \n", " fbi \n", " snowden \n", " intelligence \n", " document \n", " information \n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #10304864 - https://edwardsnowden.com/\n", "Edwardsnowden.com\n", "Topic score: 0.78\n", "\n", "Article text:\n", " Who Is Edward Snowden?\n", "\n", "Edward Snowden is a 31 year old US citizen, former Intelligence Community officer and whistleblower. The documents he revealed provided a vital public window into the NSA and its international intelligence partners’ secret mass surveillance programs and capabilities. These revelations generated unprecedented attention around the world on privacy intrusions and digital security, leading to a global debate on the issue.\n", "\n", "Snowden worked in various roles within the US Intel (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11748746 - http://www.theguardian.com/us-news/2016/may/22/snowden-whistleblower-protections-john-crane\n", "Snowden calls for whistleblower shield after claims by new Pentagon source\n", "Topic score: 0.69\n", "\n", "Article text:\n", " Accusations that Pentagon retaliated against a whistleblower undermine argument that there were options for Snowden other than leaking to the media\n", "\n", "Edward Snowden has called for a complete overhaul of US whistleblower protections after a new source from deep inside the Pentagon came forward with a startling account of how the system became a “trap” for those seeking to expose wrongdoing.\n", "\n", "\n", "\n", "The account of John Crane, a former senior Pentagon investigator, appears to undermine Barack Obama, (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #10615250 - https://www.washingtonpost.com/news/the-switch/wp/2015/11/20/why-its-so-hard-to-keep-up-with-how-the-u-s-government-is-spying-on-its-own-people/\n", "Why its so hard to keep up with how the U.S. gov't is spying on its own people\n", "Topic score: 0.68\n", "\n", "Article text:\n", " Since 2013, Americans have gained immense insight about how the government conducts digital spying programs, largely thanks to the revelations made by former security contractor Edward Snowden. But a new report shows it's really hard to keep track of all the ways the United States is snooping on its own people.\n", "\n", "After Snowden revealed the National Security Agency was collecting data en masse about American e-mails, the government said it had ended that particular program in 2011.\n", "\n", "But it turns o (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11837578 - https://news.vice.com/article/edward-snowden-leaks-tried-to-tell-nsa-about-surveillance-concerns-exclusive\n", "Snowden Tried to Tell NSA About Surveillance Concerns, Documents Reveal\n", "Topic score: 0.68\n", "\n", "Article text:\n", " On the morning of May 29, 2014, an overcast Thursday in Washington, D.C., the general counsel of the Office of the Director of National Intelligence, Robert Litt, wrote an email to high-level officials at the National Security Agency and the White House.\n", "\n", "The topic: what to do about Edward Snowden.\n", "\n", "Snowden’s leaks had first come to light the previous June, when the Guardian’s Glenn Greenwald and the Washington Post’s Barton Gellman published stories based on highly classified documents pr (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11400686 - http://thehill.com/policy/national-security/274840-report-clinton-could-be-interviewed-by-fbi-within-days\n", "Report: FBI moves to interview Clinton over emails\n", "Topic score: 0.67\n", "\n", "Article text:\n", " Hillary Clinton Hillary Rodham ClintonAssange meets U.S. congressman, vows to prove Russia did not leak him documents High-ranking FBI official leaves Russia probe OPINION | Steve Bannon is Trump's indispensable man — don't sacrifice him to the critics MORE and her top aides might be questioned by FBI officials about her private email server within the next few days, according to a new report from Al Jazeera America.\n", "\n", "The news outlet reported that the FBI has concluded its examination of Clint (...)(trimmed)\n", "\n", "\n" ] } ], "source": [ "model.show_topic_articles(44, top_n=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3 Find topics for a given article" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3.1 Article from the corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This displays the topics that were extracted from a specific article in the corpus." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---------------------------------------------------------------------\n", "Article #10577102 - http://www.nytimes.com/2015/11/17/us/after-paris-attacks-cia-director-rekindles-debate-over-surveillance.html\n", "After Paris Attacks, C.I.A. Director Rekindles Debate Over Surveillance\n", "Article text:\n", " “As far as I know, there’s no evidence the French lacked some kind of surveillance authority that would have made a difference,” said Jameel Jaffer, deputy legal director of the American Civil Liberties Union. “When we’ve invested new powers in the government in response to events like the Paris attacks, they have often been abused.”\n", "\n", "The debate over the proper limits on government dates to the origins of the United States, with periodic overreaching in the name of security being cur (...)(trimmed)\n", "\n", "\n", " Topic #44 Topic #67 Topic #69 \n", " Score (0.32) Score (0.19) Score (0.15) \n", " ---------- ---------- ---------- \n", " government al story \n", " agency state continue \n", " security attack read \n", " nsa group advertisement \n", " surveillance government main \n", " fbi country times \n", " snowden islamic newsletter \n", " intelligence terrorist sign \n", " document saudi york \n", " information iran subscribe \n", "\n", "\n" ] } ], "source": [ "model.show_article_topics(10577102)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last topic looks strange here - as it turns out, it is an unintended artifact of the data collection process. The `newspaper` library used to extract text from articles extracts text from some of the advertisements and subscribe buttons for NYTimes articles too. As a result, this set of words co-occurs with each other extremely frequently and co-occurs with other words much less frequently, and hence forms a very natural topic for topic modeling algorithms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.3.2 Finding topics for a new, unseen article" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "url = \"https://www.ligo.caltech.edu/news/ligo20170927\"" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Article: https://www.ligo.caltech.edu/news/ligo20170927\n", "Article text:\n", " News Release • September 27, 2017\n", "\n", "The LIGO Scientific Collaboration and the Virgo collaboration report the first joint detection of gravitational waves with both the LIGO and Virgo detectors. This is the fourth announced detection of a binary black hole system and the first significant gravitational-wave signal recorded by the Virgo detector, and highlights the scientific potential of a three-detector network of gravitational-wave detectors.\n", "\n", "The three-detector observation was made on August 14 (...)(trimmed)\n", "\n", "Most relevant topics:\n", "\n", " Topic #29 Topic #10 Topic #89 \n", " Score (0.31) Score (0.15) Score (0.13) \n", " ---------- ---------- ---------- \n", " earth light university \n", " space laser research \n", " star electron science \n", " planet field paper \n", " orbit energy researcher \n", " moon high study \n", " year fusion scientist \n", " mars charge publish \n", " galaxy produce journal \n", " telescope ray scientific \n", "\n", "\n" ] } ], "source": [ "model.show_article_topics_from_url(url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.4 Plotting topic trends" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The popularity of topics can be plotted over time. Some cherrypicking for interesting results - " ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #11 \n", " ---------- \n", " flight \n", " fly \n", " air \n", " space \n", " aircraft \n", " launch \n", " plane \n", " drone \n", " pilot \n", " rocket \n", "\n", "\n" ] }, { "data": { "application/vnd.plotly.v1+json": { "data": [ { "name": "Topic #11", "type": "scatter", "x": [ "2015-09-08", "2015-09-10", "2015-09-12", "2015-09-13", "2015-09-15", "2015-09-18", "2015-09-19", "2015-09-20", "2015-09-21", "2015-09-25", "2015-09-29", "2015-10-01", "2015-10-04", "2015-10-06", "2015-10-08", "2015-10-12", "2015-10-13", "2015-10-14", "2015-10-15", "2015-10-16", "2015-10-17", "2015-10-18", "2015-10-19", "2015-10-23", "2015-10-28", "2015-11-02", "2015-11-03", "2015-11-04", "2015-11-06", "2015-11-07", "2015-11-09", "2015-11-10", "2015-11-11", "2015-11-16", "2015-11-17", "2015-11-18", "2015-11-20", "2015-11-22", "2015-11-23", "2015-11-24", "2015-11-27", "2015-11-28", "2015-11-29", "2015-11-30", "2015-12-01", "2015-12-05", "2015-12-06", "2015-12-07", "2015-12-08", "2015-12-09", "2015-12-11", "2015-12-14", "2015-12-15", "2015-12-16", "2015-12-18", "2015-12-19", "2015-12-21", "2015-12-22", "2015-12-23", "2015-12-24", "2015-12-25", "2015-12-27", "2015-12-28", "2015-12-29", "2015-12-30", "2015-12-31", "2016-01-02", "2016-01-06", "2016-01-07", "2016-01-08", "2016-01-09", "2016-01-10", "2016-01-11", "2016-01-13", "2016-01-14", "2016-01-17", "2016-01-19", "2016-01-20", "2016-01-21", "2016-01-22", "2016-01-25", "2016-01-29", "2016-01-30", "2016-02-02", "2016-02-03", "2016-02-04", "2016-02-05", "2016-02-06", "2016-02-07", "2016-02-09", "2016-02-10", "2016-02-16", "2016-02-18", "2016-02-20", "2016-02-21", "2016-02-22", "2016-02-23", "2016-02-24", "2016-02-28", "2016-03-01", "2016-03-05", "2016-03-09", "2016-03-10", "2016-03-15", "2016-03-16", "2016-03-17", "2016-03-18", "2016-03-19", "2016-03-22", "2016-03-23", "2016-03-24", "2016-03-28", "2016-03-30", "2016-03-31", "2016-04-03", "2016-04-04", "2016-04-07", "2016-04-09", "2016-04-10", "2016-04-11", "2016-04-12", "2016-04-15", "2016-04-16", "2016-04-17", "2016-04-19", "2016-04-20", "2016-04-21", "2016-04-23", "2016-04-24", "2016-04-26", "2016-04-27", "2016-04-28", "2016-04-30", "2016-05-02", "2016-05-03", "2016-05-04", "2016-05-06", "2016-05-07", "2016-05-09", "2016-05-12", "2016-05-14", "2016-05-16", "2016-05-18", "2016-05-21", "2016-05-22", "2016-05-23", "2016-05-24", "2016-05-25", "2016-05-26", "2016-05-27", "2016-05-28", "2016-05-29", "2016-06-01", "2016-06-04", "2016-06-05", "2016-06-07", "2016-06-09", "2016-06-11", "2016-06-13", "2016-06-14", "2016-06-15", "2016-06-16", "2016-06-21", "2016-06-22", "2016-06-23", "2016-06-28", "2016-07-01", "2016-07-04", "2016-07-05", "2016-07-06", "2016-07-08", "2016-07-09", "2016-07-11", "2016-07-12", "2016-07-14", "2016-07-15", "2016-07-17", "2016-07-20", "2016-07-21", "2016-07-24", "2016-07-27", "2016-07-28", "2016-07-29", "2016-08-03", "2016-08-05", "2016-08-08", "2016-08-10", "2016-08-11", "2016-08-14", "2016-08-16", "2016-08-17", "2016-08-19", "2016-08-20", "2016-08-26", "2016-08-27", "2016-08-28", "2016-08-31", "2016-09-01", "2016-09-02", "2016-09-03", "2016-09-04", "2016-09-05", "2016-09-07", "2016-09-08", "2016-09-10", "2016-09-11", "2016-09-12", "2016-09-13", "2016-09-14", "2016-09-16", "2016-09-19", "2016-09-23", "2016-09-24" ], "y": [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0.4155195349305556, 0.41112592599877684, 0.4188322610466961, 0.44998044929347736, 0.4566437271749537, 0.4266808075133145, 0.43254373011269287, 0.43954317628892625, 0.4406442211332916, 0.43297457168703574, 0.4073729662899589, 0.4081494456592589, 0.40545715466378396, 0.4005908565798869, 0.39621786113005475, 0.4011727635771447, 0.41844575402710266, 0.4068411146765687, 0.39248913144572134, 0.4093308842601232, 0.41154488490081576, 0.40602462272644435, 0.4103103547117749, 0.3941503951085633, 0.414086433829605, 0.4054348610089017, 0.4075481698847668, 0.40938818328450005, 0.4082296250209577, 0.4055028736511356, 0.3940798839059056, 0.3910957461497691, 0.38735643671733627, 0.3579072007122704, 0.34533699001364604, 0.3316628034804495, 0.3113003077867487, 0.3064308028936031, 0.3108874271694513, 0.3030199951783517, 0.28814984808736305, 0.30324031879453206, 0.310365715267107, 0.31689061350471653, 0.3137379794168224, 0.3084845423194735, 0.2838370030389907, 0.29229557160120506, 0.28668451260599975, 0.27957425625839377, 0.2729854798124053, 0.27536599300824777, 0.27993222892263764, 0.2910734959582949, 0.2755511325834325, 0.2874277638507835, 0.28504239716449675, 0.2801295380979484, 0.2811175086803794, 0.2780737394209238, 0.29425008832219735, 0.2997245890783128, 0.301433269797032, 0.31718662208670817, 0.3133367015889849, 0.3135302729519111, 0.3228288423611609, 0.3234167175715088, 0.3252369267999813, 0.32639067282958695, 0.3481508526441025, 0.34471971977794624, 0.3484980246573928, 0.3645215697104986, 0.3725450421460678, 0.37986513022396606, 0.38612994735842104, 0.3918210999972124, 0.40083792687931763, 0.4127365908020415, 0.4227873384406296, 0.4205061727088515, 0.41985225872677445, 0.4183415371593026, 0.42128844009093663, 0.4080843962979948, 0.40886970442770126, 0.4130414563588876, 0.47633169199254455, 0.4853701484105786, 0.46988932410279705, 0.46322018236788515, 0.46846430761869395, 0.4656403020473648, 0.48106937891509194, 0.4977306908314242, 0.4885483243282324, 0.4936922887511916, 0.4875000272194861, 0.4949613974537431, 0.4709463609478391, 0.4845513489362474, 0.5015561226479796, 0.4899375532776272, 0.482213429153503, 0.49496794419180046, 0.5006530540893646, 0.5055786204266093, 0.5040213384810223, 0.48469644292186775, 0.47131512307462664, 0.4750966209668324, 0.46348070484772214, 0.450779551917585, 0.46171667084131013, 0.46227780285184594, 0.46064219208580043, 0.47974948413544777, 0.43091838485349393, 0.42263096078415685, 0.41706580744132254, 0.435449896774101, 0.43022126440969755, 0.4339018910121289, 0.42682218897320573, 0.4109147729824561, 0.4173939379603488, 0.4134879747272908, 0.4236780624728152, 0.4184507688415668, 0.4190400307809104, 0.4284221392121709, 0.407918468307608, 0.4028806627268628, 0.42273317179720815, 0.42904199433062234, 0.41717755624769687, 0.4127317253907294, 0.4185074962312819, 0.4276436769397821, 0.44348427623513553, 0.436254332989119, 0.43824207191856523, 0.4382287256223181, 0.4201428645513333, 0.4232368688791582, 0.4241428620405861, 0.42703580278436715, 0.4133974406481083, 0.4342652961013348, 0.4346622106641874, 0.4234988682996905, 0.41998930792042616, 0.41608204068208077, 0.41708181759681895, 0.4208330932307557, 0.42247358637252114, 0.42651951123533743, 0.4170450344161461, 0.43344003545149795, 0.44057083048226114, 0.41240675844318914, 0.40866385015920176, 0.3977159368082053, 0.377900406945736, 0.37223304984671773, 0.3829839910310275, 0.37273070783733253, 0.39249342336635074, 0.4224754372424727, 0.43914160846317957, 0.43848165917772647, 0.4496247619649457, 0.46974915425322916, 0.47042015831376816, 0.4660605602433588, 0.4731779165183238, 0.46995717949433563, 0.478512834079402, 0.471098871123429, 0.483636035936437, 0.4745942587425275, 0.4886044236452234, 0.4800993586562648, null, null, null, null, null, null, null, null, null, null, null, null, null, null ] } ], "layout": {} }, "text/html": [ "
" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "iplot(model.topic_trend_plot(11))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The topic contains the words `flight, fly, air, space, aircraft, launch` and sees a huge surge in popularity around March - May 2016. This was the time when SpaceX successfully launched and landed its satellites at sea. And of course, things related to Elon Musk have a tendency to be wildly popular on Hacker News :)\n", "\n", "A quick look at the articles for this topic agrees with this hypothesis - " ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #11 \n", " ---------- \n", " flight \n", " fly \n", " air \n", " space \n", " aircraft \n", " launch \n", " plane \n", " drone \n", " pilot \n", " rocket \n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11460935 - http://techcrunch.com/2016/04/08/spacex-just-landed-a-rocket-on-a-drone-ship-for-the-first-time/\n", "SpaceX just landed a rocket on a drone ship for the first time\n", "Topic score: 0.82\n", "\n", "Article text:\n", " At 4:43 pm EST, SpaceX successfully launched their next resupply mission to the International Space Station (ISS). In addition to a seamless launch, SpaceX landed the first stage of their Falcon 9 rocket on an autonomous drone ship for the very first time.\n", "\n", "Landing from the chase plane pic.twitter.com/2Q5qCaPq9P — SpaceX (@SpaceX) April 8, 2016\n", "\n", "This was SpaceX’s fifth landing attempt on a drone ship — all previous attempts ended in explosions. Although in December of last year, Elon Musk (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11459183 - http://mobile.reuters.com/article/idUSKCN0X5228\n", "SpaceX makes breakthrough by landing rocket at sea\n", "Topic score: 0.79\n", "\n", "Article text:\n", " CAPE CANAVERAL, Fla. (Reuters) - A SpaceX Falcon 9 rocket blasted off from Florida on a NASA cargo run to the International Space Station on Friday, and its reusable main-stage booster landed on an ocean platform minutes later in a dramatic spaceflight first.\n", "\n", "The successful autonomous touchdown of the booster at sea marked another milestone for billionaire entrepreneur Elon Musk and his privately owned Space Exploration Technologies in the quest to develop a cheap, reusable rocket, expanding hi (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11642855 - http://phys.org/news/2016-05-spacex-successfully-rockets-stage-space.html\n", "SpaceX lands rocket at sea second time after satellite launch\n", "Topic score: 0.76\n", "\n", "Article text:\n", " This photo provided by SpaceX shows the first stage of the company's Falcon rocket after it landed on a platform in the Atlantic Ocean just off the Florida coast on Friday, May 6, 2016, after launching a Japanese communications satellite. (SpaceX via AP) For the second month in a row, the aerospace upstart SpaceX landed a rocket on an ocean platform early Friday, this time following the successful launch of a Japanese communications satellite.\n", "\n", "A live webcast showed the first-stage booster touch (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11791272 - http://www.theverge.com/2016/5/27/11787532/spacex-falcon-9-rocket-landing-success-sea-drone-ship\n", "SpaceX successfully lands a Falcon 9 rocket at sea for the third time\n", "Topic score: 0.74\n", "\n", "Article text:\n", " SpaceX just successfully landed the first stage of its Falcon 9 rocket on a drone ship in the Atlantic Ocean. It was the third time in a row the company has landed a rocket booster at sea, and the fourth time overall.\n", "\n", "The landing occurred a few minutes before the second stage of the Falcon 9 delivered the THAICOM-8 satellite to space, where it will make its way to geostationary transfer orbit (GTO). GTO is a high-elliptical orbit that is popular for satellites, sitting more than 20,000 miles ab (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11817878 - https://www.washingtonpost.com/graphics/business/rockets/\n", "The New Space Race\n", "Topic score: 0.71\n", "\n", "Article text:\n", " Launch configurations\n", "\n", "Launch abort system jettisons the crew to safety in the event of a launchpad failure.\n", "\n", "Launch abort system\n", "\n", "Orion crew vehicle\n", "\n", "Cargo fairing\n", "\n", "Exploration upper stage\n", "\n", "The core stage of the rocket is orange because that is the natural color of the insulation that will cover it.\n", "\n", "Core stage\n", "\n", "Solid rocket boosters\n", "\n", "Advanced boosters\n", "\n", "RS-25 engines\n", "\n", "A\n", "\n", "B\n", "\n", "C\n", "\n", "D\n", "\n", "A. An initial mission will take an unmanned crew vehicle around the moon and back to demonstrate the capabilities of (...)(trimmed)\n", "\n", "\n" ] } ], "source": [ "model.show_topic_articles(11, top_n=5)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #35 \n", " ---------- \n", " image \n", " uk \n", " london \n", " caption \n", " mr \n", " copyright \n", " british \n", " japan \n", " year \n", " people \n", "\n", "\n" ] }, { "data": { "application/vnd.plotly.v1+json": { "data": [ { "name": "Topic #35", "type": "scatter", "x": [ "2015-09-06", "2015-09-07", "2015-09-09", "2015-09-10", "2015-09-11", "2015-09-13", "2015-09-14", "2015-09-15", "2015-09-16", "2015-09-18", "2015-09-19", "2015-09-20", "2015-09-22", "2015-09-24", "2015-09-25", "2015-09-29", "2015-09-30", "2015-10-01", "2015-10-02", "2015-10-03", "2015-10-05", "2015-10-06", "2015-10-07", "2015-10-10", "2015-10-11", "2015-10-12", "2015-10-13", "2015-10-14", "2015-10-15", "2015-10-16", "2015-10-17", "2015-10-19", "2015-10-20", "2015-10-22", "2015-10-24", "2015-10-26", "2015-10-27", "2015-10-29", "2015-10-31", "2015-11-02", "2015-11-03", "2015-11-04", "2015-11-05", "2015-11-06", "2015-11-08", "2015-11-09", "2015-11-10", "2015-11-11", "2015-11-15", "2015-11-16", "2015-11-17", "2015-11-19", "2015-11-22", "2015-11-23", "2015-11-24", "2015-11-25", "2015-11-26", "2015-11-27", "2015-11-30", "2015-12-02", "2015-12-03", "2015-12-07", "2015-12-08", "2015-12-09", "2015-12-10", "2015-12-11", "2015-12-12", "2015-12-13", "2015-12-16", "2015-12-17", "2015-12-18", "2015-12-20", "2015-12-21", "2015-12-22", "2015-12-23", "2015-12-25", "2015-12-28", "2016-01-03", "2016-01-04", "2016-01-06", "2016-01-07", "2016-01-08", "2016-01-09", "2016-01-11", "2016-01-12", "2016-01-13", "2016-01-14", "2016-01-15", "2016-01-16", "2016-01-17", "2016-01-18", "2016-01-19", "2016-01-20", "2016-01-21", "2016-01-22", "2016-01-26", "2016-01-27", "2016-01-31", "2016-02-03", "2016-02-04", "2016-02-05", "2016-02-08", "2016-02-09", "2016-02-10", "2016-02-11", "2016-02-12", "2016-02-13", "2016-02-15", "2016-02-16", "2016-02-19", "2016-02-20", "2016-02-21", "2016-02-22", "2016-02-23", "2016-02-24", "2016-02-25", "2016-02-28", "2016-02-29", "2016-03-05", "2016-03-07", "2016-03-08", "2016-03-11", "2016-03-14", "2016-03-16", "2016-03-17", "2016-03-18", "2016-03-20", "2016-03-21", "2016-03-22", "2016-03-24", "2016-03-25", "2016-03-26", "2016-03-28", "2016-03-30", "2016-04-01", "2016-04-03", "2016-04-04", "2016-04-06", "2016-04-07", "2016-04-10", "2016-04-11", "2016-04-12", "2016-04-14", "2016-04-15", "2016-04-17", "2016-04-18", "2016-04-19", "2016-04-20", "2016-04-21", "2016-04-23", "2016-04-24", "2016-04-25", "2016-04-26", "2016-04-28", "2016-04-29", "2016-04-30", "2016-05-01", "2016-05-02", "2016-05-03", "2016-05-05", "2016-05-06", "2016-05-09", "2016-05-10", "2016-05-12", "2016-05-13", "2016-05-16", "2016-05-17", "2016-05-20", "2016-05-22", "2016-05-24", "2016-05-27", "2016-05-28", "2016-05-29", "2016-06-01", "2016-06-02", "2016-06-04", "2016-06-05", "2016-06-06", "2016-06-07", "2016-06-10", "2016-06-11", "2016-06-12", "2016-06-15", "2016-06-16", "2016-06-18", "2016-06-20", "2016-06-22", "2016-06-24", "2016-06-25", "2016-06-26", "2016-06-27", "2016-06-28", "2016-06-30", "2016-07-02", "2016-07-04", "2016-07-05", "2016-07-07", "2016-07-09", "2016-07-10", "2016-07-12", "2016-07-13", "2016-07-14", "2016-07-15", "2016-07-16", "2016-07-19", "2016-07-20", "2016-07-21", "2016-07-22", "2016-07-24", "2016-07-26", "2016-07-27", "2016-07-28", "2016-07-29", "2016-07-31", "2016-08-04", "2016-08-07", "2016-08-08", "2016-08-11", "2016-08-12", "2016-08-16", "2016-08-17", "2016-08-18", "2016-08-21", "2016-08-22", "2016-08-23", "2016-08-26", "2016-08-27", "2016-08-28", "2016-08-29", "2016-08-30", "2016-09-02", "2016-09-03", "2016-09-04", "2016-09-05", "2016-09-08", "2016-09-09", "2016-09-11", "2016-09-12", "2016-09-14", "2016-09-16", "2016-09-17", "2016-09-19", "2016-09-20", "2016-09-21", "2016-09-22", "2016-09-24", "2016-09-25" ], "y": [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0.2853019140404511, 0.2804232869698913, 0.2704674251069335, 0.2648002103943619, 0.2801450571979483, 0.2795588513419771, 0.2811303507795396, 0.2870614115582347, 0.29962595231310973, 0.29183743944950413, 0.286449032653452, 0.28678013870295777, 0.29346751901984774, 0.29266168003869036, 0.2910955137068623, 0.28778367911995256, 0.2933275741089841, 0.2750105253438252, 0.2784475113151837, 0.2800489343053638, 0.2851130972413695, 0.2916456899997694, 0.28706634272498743, 0.2811123569134547, 0.28401111176922617, 0.2852212269114162, 0.27553476335525834, 0.27532900887845974, 0.2675927528806756, 0.26623325347342985, 0.270629103452273, 0.2750262048228304, 0.26932753443882557, 0.27119992670346255, 0.2569197983592321, 0.25630070524515725, 0.25526508985144036, 0.24893004372323876, 0.23119154222646277, 0.2379003437097596, 0.23740925365899096, 0.2434937908484, 0.23972818884116462, 0.24017266617331398, 0.237027832131622, 0.2367955961213795, 0.23231786005344518, 0.23285660392210233, 0.2275184352492466, 0.22515848338561825, 0.2185141779571902, 0.21742155452678213, 0.2164160970424917, 0.22214821424282016, 0.21130536588374485, 0.20307372457728248, 0.20032167646581714, 0.20536111388313252, 0.20194114236155256, 0.2129331361567181, 0.20815789962986866, 0.20294611008549115, 0.2072135695996612, 0.2049937581912776, 0.2047097753592843, 0.20963833556554284, 0.20920969825051794, 0.22139602767567473, 0.2230728594008385, 0.22834637208532924, 0.2538021098105671, 0.2388647857114267, 0.23554741098375784, 0.24682423188527552, 0.2463881685555581, 0.24737542322156658, 0.2498985808829356, 0.26426096105621055, 0.26402156437302343, 0.27089131778400627, 0.27446708619614585, 0.27754595535395016, 0.276631339755124, 0.27478924306919567, 0.2749719825295823, 0.2796310359482665, 0.27811157042287693, 0.27408683583696264, 0.27599454264465456, 0.27177956351279897, 0.27331421819022955, 0.27839576030960683, 0.27465534439579425, 0.27979816566029403, 0.28087464511686655, 0.27408047856605094, 0.2738075672846124, 0.2610304081790033, 0.2630494152239556, 0.26060562733415166, 0.239725317610639, 0.2366676501814703, 0.23780917627852446, 0.2283280878517134, 0.23846921056135015, 0.24284980670286335, 0.23949412214481874, 0.22822708147887577, 0.2277640600275854, 0.22375893110951656, 0.2185276492293635, 0.2195056396811914, 0.22459531314882614, 0.2319096392517922, 0.231726907145082, 0.22998341272098824, 0.24274565932611217, 0.2452035766359505, 0.24279213195848734, 0.23343926551072686, 0.23947372479198717, 0.2486906816784224, 0.2504219919556784, 0.24944220981380083, 0.25509686879974763, 0.2675102969684069, 0.2696059742591407, 0.27922396240458874, 0.2939976221918055, 0.28842591290248115, 0.2853523497193242, 0.2961882375370419, 0.299442182223474, 0.29841132034928675, 0.2886357089808254, 0.2906955533781291, 0.28918063420235857, 0.28592723717244056, 0.29070789559993226, 0.28740783643268464, 0.28716065174799993, 0.28179375754490876, 0.275022835014446, 0.27106008823462796, 0.30030258055697295, 0.3084581962880374, 0.29736628622841715, 0.2998222008413632, 0.303120475599069, 0.3098588402320563, 0.3127481319579616, 0.3064482724550034, 0.3083431660561261, 0.30577307571055834, 0.298346850280084, 0.29558495301429943, 0.29553068129769017, 0.29444484988439884, 0.3736581188256796, 0.39347133367009074, 0.4029854168891123, 0.4359955520318513, 0.4373358499099661, 0.4381447468231284, 0.43877222689951245, 0.4358263706001031, 0.448971571315265, 0.46689630810367316, 0.4661068065148736, 0.4676946257892545, 0.48179234059486925, 0.48412401176245784, 0.48390436867439374, 0.4772946594543769, 0.4506503171297584, 0.44764903192915934, 0.4604696898902643, 0.45739235308768594, 0.45271516842712534, 0.44381520973456895, 0.4340131096398162, 0.4345558688448701, 0.4324479444714206, 0.4327727176950265, 0.44081488348540254, 0.4334784228872627, 0.43477956750179564, 0.4325307903746559, 0.3363456419923044, 0.3137784519943611, 0.307481000294826, 0.27274304189726845, 0.2742959202908889, 0.2734044077630127, 0.2734873515002867, 0.27028736505488404, 0.2660631568931221, 0.24873293996939386, 0.2510827597501836, 0.2532321077864667, 0.25691541099677634, 0.25269565577915765, 0.26114782821250937, 0.2764406096479513, 0.285319338862859, 0.2919834395356233, 0.2889611506712207, 0.2850512778691256, 0.28605369470051106, 0.29321253906031247, 0.3108385064146771, 0.30887815445906025, 0.31789225495222445, 0.3252726024581239, 0.33641255069994863, 0.33547012903041296, 0.33749086080879515, 0.3317607158152652, null, null, null, null, null, null, null, null, null, null, null, null, null, null ] } ], "layout": {} }, "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "iplot(model.topic_trend_plot(35))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This topic looks a little more strange. The words `uk london british people` seem fairly coherent, but the presence of words like `image copyright caption` is rather strange. It turns out to be another artifact of the data collection process - a number of the articles with the words `uk London british people` are from the BBC, and the text parser from the article picks up image captions from the BBC site which contain the words `image caption copyright` very frequently.\n", "\n", "As for the popularity trend for the topic, the topic seems fairly dormant most of the time, seeing a massive spike in around June 2016. No prizes for guessing what this is due to - " ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #35 \n", " ---------- \n", " image \n", " uk \n", " london \n", " caption \n", " mr \n", " copyright \n", " british \n", " japan \n", " year \n", " people \n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11970960 - http://www.bbc.com/news/uk-politics-eu-referendum-36620401\n", "Petition for London independence signed by thousands after Brexit vote\n", "Topic score: 0.67\n", "\n", "Article text:\n", " Image copyright Reuters Image caption The overwhelming majority of Londoners voted to remain in the EU\n", "\n", "A petition calling for Sadiq Khan to declare London an independent state after the UK voted to quit the EU has been signed by thousands of people.\n", "\n", "The petition's organiser James O'Malley, said the capital was \"a world city\" which should \"remain at the heart of Europe\".\n", "\n", "Nearly 60% of people in the capital backed the Remain campaign, in stark contrast to most of the country.\n", "\n", "The LSE's directo (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11966167 - http://www.bbc.co.uk/news/uk-politics-36615028\n", "UK votes to leave EU\n", "Topic score: 0.66\n", "\n", "Article text:\n", " Media playback is unsupported on your device Media caption EU vote: David Cameron says the UK \"needs fresh leadership\"\n", "\n", "Prime Minister David Cameron is to step down by October after the UK voted to leave the European Union.\n", "\n", "Speaking outside 10 Downing Street, he said \"fresh leadership\" was needed.\n", "\n", "The PM had urged the country to vote Remain but was defeated by 52% to 48% despite London, Scotland and Northern Ireland backing staying in.\n", "\n", "UKIP leader Nigel Farage hailed it as the UK's \"independe (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11967959 - http://www.mirror.co.uk/news/uk-news/young-voters-wanted-brexit-least-8271517\n", "Young voters wanted Brexit the least and will have to live with it the longest\n", "Topic score: 0.58\n", "\n", "Article text:\n", " Get politics updates directly to your inbox + Subscribe Thank you for subscribing! Could not subscribe, try again later Invalid Email\n", "\n", "Younger voters will be the losers from today's historic vote to leave the EU after polls repeatedly showed they back Remain.\n", "\n", "Brexiters were led to victory in the referendum overnight by triumphing in Tory shires and Old Labour heartlands in Wales and the north of England.\n", "\n", "But the Kingdom is no longer United after London, Scotland and Northern Ireland all backed (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11975945 - https://www.theguardian.com/uk-news/2016/jun/25/sturgeon-seeks-urgent-brussels-talks-to-protect-scotlands-eu-membership\n", "Sturgeon seeks Brussels talks to protect Scotland's EU membership\n", "Topic score: 0.52\n", "\n", "Article text:\n", " First minister to set up panel to advise her on Scotland’s relationship with EU, as Labour considers endorsing independence\n", "\n", "Nicola Sturgeon is to lobby EU member states directly for support in ensuring that Scotland can remain part of the bloc, after Scots voted emphatically against Brexit on Thursday.\n", "\n", "\n", "\n", "The first minister has disclosed that she is to invite all EU diplomats based in Scotland to a summit at her official residence in Edinburgh within the next two weeks in a bid to sidestep th (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11967478 - http://www.theguardian.com/politics/2016/jun/24/david-cameron-resigns-after-uk-votes-to-leave-european-union\n", "David Cameron announces resignation\n", "Topic score: 0.51\n", "\n", "Article text:\n", " David Cameron has resigned, bringing an abrupt end to his six-year premiership, after the British public took the momentous decision to reject his entreaties and turn their back on the European Union.\n", "\n", "Just a year after he clinched a surprise majority in the general election, a visibly emotional Cameron, standing outside Number 10 on Friday morning alongside his wife, Samantha, said: “The will of the British people is an instruction that must be delivered.”\n", "\n", "The prime minister campaigned har (...)(trimmed)\n", "\n", "\n" ] } ], "source": [ "model.show_topic_articles(35, top_n=5)" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #44 \n", " ---------- \n", " government \n", " agency \n", " security \n", " nsa \n", " surveillance \n", " fbi \n", " snowden \n", " intelligence \n", " document \n", " information \n", "\n", "\n" ] }, { "data": { "application/vnd.plotly.v1+json": { "data": [ { "name": "Topic #44", "type": "scatter", "x": [ "2015-09-07", "2015-09-08", "2015-09-11", "2015-09-14", "2015-09-15", "2015-09-17", "2015-09-21", "2015-09-22", "2015-09-23", "2015-09-24", "2015-09-25", "2015-09-26", "2015-09-27", "2015-09-28", "2015-09-29", "2015-09-30", "2015-10-01", "2015-10-04", "2015-10-05", "2015-10-06", "2015-10-07", "2015-10-08", "2015-10-09", "2015-10-10", "2015-10-12", "2015-10-13", "2015-10-15", "2015-10-16", "2015-10-17", "2015-10-18", "2015-10-19", "2015-10-20", "2015-10-21", "2015-10-22", "2015-10-24", "2015-10-25", "2015-10-26", "2015-10-27", "2015-10-28", "2015-10-29", "2015-10-30", "2015-10-31", "2015-11-02", "2015-11-03", "2015-11-04", "2015-11-05", "2015-11-06", "2015-11-08", "2015-11-09", "2015-11-10", "2015-11-11", "2015-11-12", "2015-11-13", "2015-11-14", "2015-11-16", "2015-11-17", "2015-11-18", "2015-11-20", "2015-11-21", "2015-11-22", "2015-11-23", "2015-11-25", "2015-11-27", "2015-11-29", "2015-11-30", "2015-12-03", "2015-12-04", "2015-12-05", "2015-12-06", "2015-12-09", "2015-12-10", "2015-12-11", "2015-12-12", "2015-12-13", "2015-12-14", "2015-12-15", "2015-12-16", "2015-12-17", "2015-12-19", "2015-12-20", "2015-12-21", "2015-12-22", "2015-12-23", "2015-12-25", "2015-12-27", "2015-12-28", "2015-12-30", "2015-12-31", "2016-01-02", "2016-01-03", "2016-01-05", "2016-01-06", "2016-01-08", "2016-01-09", "2016-01-10", "2016-01-11", "2016-01-13", "2016-01-14", "2016-01-15", "2016-01-16", "2016-01-17", "2016-01-19", "2016-01-20", "2016-01-21", "2016-01-22", "2016-01-23", "2016-01-24", "2016-01-27", "2016-01-28", "2016-01-29", "2016-01-30", "2016-01-31", "2016-02-02", "2016-02-03", "2016-02-04", "2016-02-05", "2016-02-08", "2016-02-09", "2016-02-10", "2016-02-11", "2016-02-12", "2016-02-16", "2016-02-17", "2016-02-18", "2016-02-19", "2016-02-20", "2016-02-21", "2016-02-22", "2016-02-23", "2016-02-24", "2016-02-25", "2016-02-26", "2016-02-27", "2016-03-01", "2016-03-02", "2016-03-03", "2016-03-04", "2016-03-06", "2016-03-07", "2016-03-08", "2016-03-09", "2016-03-10", "2016-03-11", "2016-03-12", "2016-03-13", "2016-03-14", "2016-03-15", "2016-03-16", "2016-03-17", "2016-03-18", "2016-03-19", "2016-03-20", "2016-03-21", "2016-03-22", "2016-03-23", "2016-03-24", "2016-03-25", "2016-03-27", "2016-03-28", "2016-03-29", "2016-03-30", "2016-03-31", "2016-04-01", "2016-04-02", "2016-04-03", "2016-04-04", "2016-04-05", "2016-04-06", "2016-04-07", "2016-04-08", "2016-04-09", "2016-04-10", "2016-04-11", "2016-04-13", "2016-04-14", "2016-04-15", "2016-04-16", "2016-04-17", "2016-04-18", "2016-04-19", "2016-04-20", "2016-04-21", "2016-04-22", "2016-04-25", "2016-04-26", "2016-04-27", "2016-04-29", "2016-04-30", "2016-05-01", "2016-05-02", "2016-05-03", "2016-05-04", "2016-05-05", "2016-05-06", "2016-05-07", "2016-05-09", "2016-05-10", "2016-05-11", "2016-05-12", "2016-05-14", "2016-05-15", "2016-05-16", "2016-05-17", "2016-05-18", "2016-05-20", "2016-05-22", "2016-05-23", "2016-05-24", "2016-05-25", "2016-05-26", "2016-05-27", "2016-05-30", "2016-05-31", "2016-06-02", "2016-06-03", "2016-06-04", "2016-06-05", "2016-06-06", "2016-06-07", "2016-06-09", "2016-06-10", "2016-06-12", "2016-06-13", "2016-06-14", "2016-06-15", "2016-06-18", "2016-06-21", "2016-06-22", "2016-06-23", "2016-06-24", "2016-06-25", "2016-06-26", "2016-06-27", "2016-06-28", "2016-06-29", "2016-07-01", "2016-07-03", "2016-07-04", "2016-07-05", "2016-07-09", "2016-07-10", "2016-07-12", "2016-07-13", "2016-07-14", "2016-07-15", "2016-07-16", "2016-07-17", "2016-07-18", "2016-07-20", "2016-07-21", "2016-07-24", "2016-07-25", "2016-07-26", "2016-07-27", "2016-07-28", "2016-07-29", "2016-07-30", "2016-08-02", "2016-08-03", "2016-08-06", "2016-08-09", "2016-08-10", "2016-08-11", "2016-08-13", "2016-08-14", "2016-08-15", "2016-08-17", "2016-08-18", "2016-08-19", "2016-08-20", "2016-08-22", "2016-08-24", "2016-08-25", "2016-08-26", "2016-08-29", "2016-08-30", "2016-08-31", "2016-09-01", "2016-09-03", "2016-09-04", "2016-09-05", "2016-09-06", "2016-09-07", "2016-09-09", "2016-09-10", "2016-09-11", "2016-09-13", "2016-09-14", "2016-09-15", "2016-09-16", "2016-09-17", "2016-09-18", "2016-09-19", "2016-09-22", "2016-09-23", "2016-09-24", "2016-09-25" ], "y": [ null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 0.5774005573438802, 0.5858407748516034, 0.5914591101331902, 0.6228058886591801, 0.6166911134121652, 0.6146919841618721, 0.6166594889475492, 0.6261426330746109, 0.6251079666490464, 0.6137184225024716, 0.6499140805488464, 0.6330846941482503, 0.6341478926182074, 0.6361612962031814, 0.6249784929483672, 0.6190375993831291, 0.5823618287280019, 0.6037779125986883, 0.606558534921883, 0.599477697619442, 0.6414180781692433, 0.6469070372337846, 0.6452773389667356, 0.656256270581353, 0.6647478943391534, 0.6791149597054035, 0.7226769332294544, 0.7024663401385184, 0.7029730671320492, 0.6848853211909458, 0.6841775595889824, 0.693024085464017, 0.6658501584616868, 0.650226297689131, 0.6562770901310127, 0.6670273116522221, 0.6635471378020869, 0.6419115152767849, 0.6549001567670093, 0.6481388126522674, 0.6050178641177871, 0.6412977173061913, 0.6387391487311608, 0.6260772261547808, 0.6254846787783214, 0.6063389621141653, 0.5956894725082289, 0.5623259758495428, 0.5664674942577103, 0.5763919679071935, 0.5408925808525892, 0.5218327278755068, 0.5181732553256582, 0.5155007940764859, 0.506635355943286, 0.5008220428599821, 0.46246569243101054, 0.4367668049376999, 0.4195485293289533, 0.4256049774295373, 0.4391487813690017, 0.42888214155532534, 0.45213523608810335, 0.44201477821312296, 0.44453212693798805, 0.43808669028838526, 0.4378507695894754, 0.4382326174359112, 0.4225235268098833, 0.4363404586087987, 0.43571639012931124, 0.39799839659423714, 0.42638585741988194, 0.4459205737095201, 0.4717244387432499, 0.46805310500357444, 0.46583785188568955, 0.4817145438190748, 0.49126692255176607, 0.49522870037951827, 0.495122321233189, 0.5066220495782962, 0.5064089521754916, 0.5002822193806372, 0.5106668912792284, 0.5178864400091802, 0.533435294560704, 0.538883793951505, 0.5341116536226893, 0.5304750798143499, 0.5277946852700318, 0.5201704489545634, 0.5103334699823835, 0.5116236295310028, 0.5341840569048846, 0.5448555567957007, 0.5736491940424912, 0.5734523295521432, 0.5910486167631119, 0.5940158993249648, 0.6089741189565037, 0.639387441766569, 0.6643934367684874, 0.6527297132314421, 0.6746515343654536, 0.6906661065904357, 0.6861878111546865, 0.6802508959181854, 0.6646155566940611, 0.6461559816778822, 0.6636088255977843, 0.6926129417501266, 0.7009188556739637, 0.7172898209959102, 0.7265140885921199, 0.7201035503490342, 0.6832226542451036, 0.7033707966859536, 0.7217713376360009, 0.7807831691539318, 0.7746728839032431, 0.7833549148394073, 0.7743589528872857, 0.7807246871240997, 0.7552246774323875, 0.7396273364285382, 0.7182418813636847, 0.728968810710104, 0.7159645124184385, 0.7203943827086723, 0.7182152373447134, 0.6986415344994762, 0.659888370143512, 0.6523031847162435, 0.6158698377018278, 0.6202287025564348, 0.635412643718824, 0.6304139123144521, 0.6514250365751553, 0.66653753483283, 0.6417732538177161, 0.6196157967499406, 0.6162241509847555, 0.6210493144693655, 0.6028491380162248, 0.6064796802723927, 0.6067165713274656, 0.5829110687930493, 0.5541378191597972, 0.5058722572100519, 0.5144878495757179, 0.5259994464471985, 0.5640284226564009, 0.5899005888116273, 0.6197719863849626, 0.6282015346898812, 0.6334477566666912, 0.6145528287205794, 0.6133591004797153, 0.606765017664435, 0.6031585642107671, 0.5897227182604884, 0.579730235602986, 0.5914779170257385, 0.6260293436875772, 0.6133382430318266, 0.6007912398353452, 0.6024144050269921, 0.5858926011268163, 0.5707754933700859, 0.5754573631921112, 0.5507730846061247, 0.5450910008046769, 0.5401398995670791, 0.5417227682323434, 0.5291187481719493, 0.5625754103114317, 0.5616919539414239, 0.5768317056872507, 0.5790151903846478, 0.5811854983190036, 0.5713950840670565, 0.5405545246991574, 0.4999738316087552, 0.4793620070550901, 0.482715207197333, 0.48824120482114103, 0.4862031263700191, 0.4773371489961485, 0.49275987204893446, 0.494095106540764, 0.5245911365059662, 0.5323445951661714, 0.5317197861648304, 0.5008782182202058, 0.5067760998495153, 0.5037298030972327, 0.5214988080302437, 0.5527473986811672, 0.5456328623842891, 0.5482066938948846, 0.5604115088935326, 0.5548654230442902, 0.5563819839657093, 0.5629669902202111, 0.5545577907835734, 0.5311403958431943, 0.5333332162564947, 0.5266671315291109, 0.5332089373363815, 0.5308489842626289, 0.5210460112296772, 0.511656186762066, 0.5393736314156744, 0.5376793174591558, 0.535824619982232, 0.5204007644527118, 0.5317328254829039, 0.5437147925603308, 0.5273244392278256, 0.5172055890033014, 0.4927221559295521, 0.48921788353607604, 0.47495414432306143, 0.4603322711494409, 0.44694740625426216, 0.4555157974097223, 0.43350109057988956, 0.4082015640682351, 0.40587160990449594, 0.4185381780409108, 0.43208521802032757, 0.44143986782457173, 0.42715176351528117, 0.4133366193239071, 0.42202789992398426, 0.4380129310107928, 0.43430337772629274, 0.4267242735158106, 0.42062221560350627, 0.40799772917644606, 0.4160266493035366, 0.4228614730288562, 0.4000542546417396, 0.39263190040823653, 0.38856460414752697, 0.39442044213678606, 0.38317845599860106, 0.37308552998044464, 0.3711473290699994, 0.3919820244771572, 0.4310812515113105, 0.4249532655753547, 0.4290189329443604, 0.4497252165213166, 0.45474438219236396, 0.4454813735842052, 0.4559129940159999, 0.4573955430688794, 0.48161528008211346, 0.4605536015965738, 0.441700470441767, 0.444826874535879, 0.4496172055055466, 0.4612762989823603, 0.4620614391953663, 0.4370973002884544, 0.4674149170811472, null, null, null, null, null, null, null, null, null, null, null, null, null, null ] } ], "layout": {} }, "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "iplot(model.topic_trend_plot(44))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This topic has a more interesting trend. Privacy and government surveillance has long been a popular topic on Hacker News, and this is clear from the relatively high popularity values in comparison to the other topics plotted so far. As for the significant increase in popularity around February 2016, this corresponds to the San Bernardino event, when there was a large amount of debate on privacy and surveillance, centered around whether Apple, under pressure by the FBI, should or should not unlock an iPhone used by one of the shooters.\n", "\n", "There are also numerous other spikes in this graph, and it'd be interesting to look at them in more detail to see if they can be traced to specific events." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.5 Topic Intersection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topics can be combined to find articles that are relevant to both topics. Here, we see combining two separate topics consisting of the words `game player play move win` and `google computer technology machine human` give us articles related to AlphaGo's success against the human Go champion, Lee Sedol." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #65 Topic #66 \n", " ---------- ---------- \n", " game google \n", " player computer \n", " play technology \n", " move machine \n", " win human \n", " world system \n", " chess world \n", " computer ai \n", " level year \n", " sport robot \n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11250871 - http://googleasiapacific.blogspot.com/2016/03/alphagos-ultimate-challenge.html\n", "AlphaGos ultimate challenge: a five-game match against Lee Sedol\n", "Topic score: 0.35\n", "\n", "Article text:\n", " Game 3 - March 12, 2016\n", "\n", "“It’s arguable that in the first two games Lee Sedol was playing differently than his true style, trying to find a weakness in the computer. Today Lee was definitely playing his own game, from his strong opening to the complicated moves in the final kō. AlphaGo was ready for everything, including the kō fights, and was able to take the win. I’d like to congratulate the people who actually made this accomplishment possible, because it’s a work of art.”\n", "\n", "“Lee (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11258168 - http://www.shanghaidaily.com/national/AlphaGo-cant-beat-me-says-Chinese-Go-grandmaster-Ke-Jie/shdaily.shtml\n", "AlphaGo Can't Beat Me, Says Chinese Go Grandmaster Ke Jie\n", "Topic score: 0.33\n", "\n", "Article text:\n", " Home » Nation\n", "\n", "ALPHAGO, the computer created by DeepMind, the Artificial Intelligence (AI) arm of Google, defeated world champion Lee Sedol of South Korea Wednesday in Game One of human vs. machine Go-chess showdown. The result is out of the expectations of many, including China's Go grandmaster Ke Jie, but Ke put it clear \"AlphaGo is not in my match now\".\n", "\n", "Ke admitted Thursday he had underestimated AlphaGo's capability before the opening match, but he still believes he will be the winner shoul (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11300892 - https://googleblog.blogspot.com/2016/03/what-we-learned-in-seoul-with-alphago.html\n", "What we learned in Seoul with AlphaGo\n", "Topic score: 0.31\n", "\n", "Article text:\n", " Go may be one of the oldest games in existence, but the attention to our five-game tournament exceeded even our wildest imaginations. Searches for Go rules and Go boards spiked in the U.S. In China, tens of millions watched live streams of the matches, and the “Man vs. Machine Go Showdown” hashtag saw 200 million pageviews on Sina Weibo. Sales of Go boards even surged in Korea.\n", "\n", "\n", "\n", "Our public test of AlphaGo, however, was about more than winning at Go. We founded DeepMind in 2010 to create ge (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #10981682 - https://googleblog.blogspot.com/2016/01/alphago-machine-learning-game-go.html\n", "Google AI beats a pro at the game of Go\n", "Topic score: 0.31\n", "\n", "Article text:\n", " The game of Go originated in China more than 2,500 years ago. Confucius wrote about the game, and it is considered one of the four essential arts required of any true Chinese scholar. Played by more than 40 million people worldwide, the rules of the game are simple: Players take turns to place black or white stones on a board, trying to capture the opponent's stones or surround empty space to make points of territory. The game is played primarily through intuition and feel, and because of its be (...)(trimmed)\n", "\n", "\n", "---------------------------------------------------------------------\n", "Article #11129076 - http://venturebeat.com/2016/02/18/civilization-25-years-66-versions-33m-copies-sold-1-billion-hours-played/\n", "Civilization: 25 years, 33M copies sold, 1B hours played, and 66 versions\n", "Topic score: 0.30\n", "\n", "Article text:\n", " LAS VEGAS — Civilization is one of the gods of strategy games, where you oversee the creation of a whole society in competition with other civilizations. It debuted in 1991, and now at 25, it has become one of the cultural touchstones of the game industry, something that everyone recognizes or has played in the past.\n", "\n", "Image Credit: MicroProse\n", "\n", "Few game franchises live to see a 25th anniversary, but Civ, as most gamers and industry folk call it, is thriving. It has 33 million copies in sales to (...)(trimmed)\n", "\n", "\n" ] } ], "source": [ "model.show_topic_articles([65, 66], top_n=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.6 Similar topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topics that are similar to a specific topic can be found using - " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #44 \n", " ---------- \n", " government \n", " agency \n", " security \n", " nsa \n", " surveillance \n", " fbi \n", " snowden \n", " intelligence \n", " document \n", " information \n", "\n", "\n", "Topics similar to topic #44\n", "---------------------------\n", "\n", " Topic #73 Topic #3 Topic #98 Topic #50 Topic #67 \n", " Score (0.23) Score (0.18) Score (0.17) Score (0.16) Score (0.15) \n", " ---------- ---------- ---------- ---------- ---------- \n", " law group datum security al \n", " court public user attack state \n", " case political information vulnerability attack \n", " legal state data exploit group \n", " rule member privacy hacker government \n", " state policy access password country \n", " lawyer campaign service attacker islamic \n", " government president internet hack terrorist \n", " judge party company find saudi \n", " order government provide system iran \n", "\n", "\n" ] } ], "source": [ "model.show_similar_topics(44, top_n=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.7 Topic Interesting-ness" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are certain topics which occur more frequently in articles than others, but with lower scores. The hypothesis is that these topics are more common and generic, whereas interesting topics would occur less frequently in articles, but higher scores. Common and generic topics would have low scores frequently, indicating they are rarely the main focus of an article, whereas the opposite is true for interesting topics.\n", "\n", "Plotting the distribution of scores over all articles for two topics - " ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [], "source": [ "topics_of_interest = [43, 95]" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #43 Topic #95 \n", " ---------- ---------- \n", " quantum test \n", " theory code \n", " physics error \n", " particle bug \n", " universe problem \n", " physicist fix \n", " wave check \n", " field fail \n", " hole issue \n", " state run \n", "\n", "\n" ] } ], "source": [ "model.print_topics_table(topics_of_interest)" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "scrolled": false }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "data": [ { "name": "Topic #43", "opacity": 0.4, "type": "histogram", "x": [ 0.06948632049134575, 0.31167100694444483, 0.060850034554250275, 0.09542032622333745, 0.4993894048847603, 0.569162897933848, 0.12157474020783376, 0.06637630662020898, 0.19948907681465805, 0.13363095238095238, 0.5702663098878687, 0.20195473251028792, 0.25850403476101824, 0.19943429068755433, 0.08423423423423422, 0.39612865497076033, 0.08712892281594573, 0.12495782116172575, 0.31040706605222745, 0.24605034722222227, 0.3801343570057582, 0.3534059604307537, 0.4332746478873237, 0.17744336569579275, 0.2281761841522799, 0.39455427420830996, 0.1119098905043976, 0.6180136153659131, 0.15029173008625069, 0.39576220425277, 0.29932002314814815, 0.1934227330779057, 0.5613240064882404, 0.6685131430414452, 0.0787257900101935, 0.37950819672131164, 0.4545557950191568, 0.11160683760683754, 0.06502545199227659, 0.060674966124661354, 0.366643990929705, 0.16380183602405798, 0.055410742496050575, 0.17109988776655413, 0.05819079404811572, 0.27080364212193175, 0.12013245487612367, 0.06289062500000006, 0.22283699943598428, 0.09971804511278202, 0.234869976359338, 0.08264221158957978, 0.06277270421106038, 0.4604604233588676, 0.4150180940892645, 0.08566561844863728, 0.5437212692005825, 0.05939722513699438, 0.058055555555555596, 0.18319547835676853, 0.41004650024473743, 0.05170472951085199, 0.1001954215522057, 0.06864221785636317, 0.0702738283307003, 0.06524803221673073, 0.498730014167173, 0.06420641846034517, 0.45425305688463485, 0.15285380116959066, 0.5980067567567562, 0.34303649486632126, 0.09182841068917014, 0.12856301531213188, 0.39872571284380603, 0.08564327485380113, 0.30936781609195463, 0.5465068087625818, 0.14343320848938834, 0.16230158730158747, 0.6402176986146444, 0.05879855465221307, 0.6088744588744591, 0.231052141527002, 0.2926054705609646, 0.36659130060292905, 0.06731763866595329, 0.14330985915492947, 0.19958540630182392, 0.10906210392902418, 0.06380718954248367, 0.545571095571096, 0.05688564858367298, 0.08137052341597802, 0.5466161616161617, 0.07068881050362537, 0.0555477637525324, 0.20523205176060713, 0.10411605937921722, 0.3029229578010062, 0.06768933112216685, 0.1981405430070728, 0.16289682539682546, 0.1669607843137256, 0.25825643983083446, 0.2904425612052729, 0.06690624999999985, 0.057386888273314896, 0.17754660700969407, 0.6097605455122099, 0.08873015873015883, 0.2740668402777774, 0.08949394939493961, 0.09478783419180777, 0.24798396334478762, 0.06952026468155494, 0.068665096191598, 0.09576118326118334, 0.12551789077212785, 0.1495686680469292, 0.12372021703231911, 0.2998028299698444, 0.0811589253187615, 0.23498168498168517, 0.2733823150296464, 0.24679166666666633, 0.43042178155492655, 0.4558439929771761, 0.05054808171400106, 0.0845386210471747, 0.13163629867809948, 0.16682793575566093, 0.515702160493828, 0.1478020950243172, 0.18704315886134082, 0.5429870129870135, 0.4166972336848543, 0.6156274980015997, 0.05583440583440595, 0.11647012578616381, 0.6658943089430905, 0.23077740282464682, 0.14771281634862612, 0.05202922077922082, 0.34985863726321687, 0.226222921034242, 0.25043926178738435, 0.3072287581699348, 0.06775775775775779, 0.2441114616193483, 0.06282132321955337, 0.08815235690235686, 0.1988970588235296, 0.2725241065109696, 0.05337224383916993, 0.05206718346253232, 0.11986054880791702, 0.4325287356321836, 0.12725442834138484, 0.050130499627143874, 0.5687625724766293, 0.577336182336183, 0.12191861084478524, 0.5974701079031786, 0.5845441595441594, 0.06409465020576127, 0.5966198725607328, 0.16262681823737932, 0.3064363143631434, 0.05692932434767584, 0.09579542321477803, 0.20549045138888875, 0.45788478943725686, 0.08523643980924342, 0.05998877665544324, 0.2537946113697276, 0.48084415584415646, 0.30023084025854163, 0.06799457994579942, 0.07787878787878784, 0.06565627405696685, 0.4908545588778141, 0.21458832933653102, 0.5955672325518726, 0.22982104700854683, 0.0659912854030501, 0.06935483870967749, 0.0831048908705584, 0.16230158730158747, 0.05884920634920644, 0.32886473429951746, 0.09072016460905356, 0.2723827160493826, 0.1833922139096706, 0.11857759284943764, 0.05030982905982913, 0.18381642512077298, 0.283699752128462, 0.07912728418399488, 0.5416164226241745, 0.064478887613216, 0.06648827726809381, 0.2769685784592087, 0.570488250057039, 0.5152039007092198, 0.14268251841929, 0.050616142945163374, 0.43797117516629785, 0.0981311274509803, 0.08842056932966029, 0.19380495603517187, 0.5126521397722807, 0.10575396825396824, 0.05690598290598289, 0.08054444444444449, 0.15007685551163838, 0.36674844618907393, 0.07509629629629633, 0.06925791490219818, 0.4642897869004342, 0.3743775548123372, 0.14453999453999475, 0.05489928525016238, 0.05552268244575934, 0.26090746171391294, 0.5617750115260479, 0.13388765705838873, 0.06290865131924737, 0.08087885985748221, 0.10993540051679596, 0.2204732510288065, 0.1780142323109489, 0.15049062049062062, 0.21081649831649846, 0.32086016179626864, 0.1012644444444444, 0.16934710991314747, 0.06160638351093103, 0.4697454206768084, 0.055755850727387694, 0.23931261770244797, 0.09771748492678708, 0.4003539253539247, 0.10525565388397252, 0.05723507097405782, 0.3849762213575444, 0.30495854063018196, 0.07932630906768841, 0.13090614886731425, 0.11189183073093474, 0.22821049692909015, 0.12416666666666676, 0.31613036303630404, 0.5872781635802474, 0.1272598870056496, 0.12182789025935431, 0.2034722222222221, 0.07869802317655078, 0.05207792207792216, 0.2527862466124654, 0.051695176452458046, 0.08495451591942839, 0.14044685990338182, 0.297964493419039, 0.15835271961892108, 0.05516185476815401, 0.06711907347086515, 0.4495742549461557, 0.45868454661558133, 0.10237606837606833, 0.11628148148148153, 0.5484021632251714, 0.05287795992714026, 0.08405856595511765, 0.17778799019607813, 0.6546432616081527, 0.34117199391171965, 0.3424334490740741, 0.056567796610169516, 0.5858475263584749, 0.591936728395062, 0.1363396900034831, 0.08406565656565658, 0.11820987654321004, 0.0978174603174603, 0.3844355555555557, 0.22501120071684577, 0.28669848262359227, 0.05980392156862741, 0.556349206349207, 0.11256768464370817, 0.0675856568178666, 0.5905527591349747, 0.45313588850174186, 0.1887747937510969, 0.06498708010335913, 0.2465422565422566, 0.07725542153996093, 0.10866491491491496, 0.5589163237311378, 0.11199807480748077, 0.12394060727394048, 0.07590345199568511, 0.05962323014158866, 0.3491078669910784, 0.1901287147454097, 0.060007880220646195, 0.4548994974874377, 0.26987727372342735, 0.4880253849354974, 0.09858674463937632, 0.3315702947845803, 0.0857649572649572, 0.1944163860830526, 0.10378492527615339, 0.1413225172074727, 0.246654719235365, 0.18975797579757975, 0.1586280577659887, 0.12721454173067095, 0.275932684509327 ], "xbins": { "end": 1, "size": 0.1, "start": 0 } }, { "name": "Topic #95", "opacity": 0.4, "type": "histogram", "x": [ 0.07079160705770145, 0.07278379462494985, 0.0704984423676012, 0.07451144611948649, 0.06984633569739938, 0.2201065891472869, 0.169893899204244, 0.35216950527169516, 0.07882672882672871, 0.10600448933782239, 0.17468633435062736, 0.05153846153846162, 0.09561554898093375, 0.11461780929866042, 0.07755768974256376, 0.07007458847736632, 0.0548162787293222, 0.12179987004548422, 0.057270069112174304, 0.14093613298337695, 0.11691919191919181, 0.10617777777777783, 0.07223289230331487, 0.08089870754433415, 0.09084182435710826, 0.11530284797432813, 0.1257621567145375, 0.07649523072956949, 0.06186239620403325, 0.08093093093093097, 0.06441233947515354, 0.051391941391941295, 0.18273289665211068, 0.05663794311335291, 0.16571815718157207, 0.0639803094233474, 0.06901870463428247, 0.0716925064599483, 0.053257080610021754, 0.08510848126232742, 0.05225951519140067, 0.06505668934240366, 0.056473269278147394, 0.27211387986035873, 0.08922379826635143, 0.11205065359477126, 0.08329629629629635, 0.06965996168582368, 0.1005339695523744, 0.1670123986647591, 0.07007904698285475, 0.07974006116207955, 0.11768899650255588, 0.13369243768483322, 0.053915032679738524, 0.0776752767527674, 0.05374760536398464, 0.15515112813963397, 0.05977417567193122, 0.09068857589984348, 0.08527621722846446, 0.10520202020202027, 0.10437420178799496, 0.1707386363636366, 0.06242209327315715, 0.16191666666666676, 0.059766391833529595, 0.09724485177642005, 0.11921348314606754, 0.052207697893972435, 0.08021140609636176, 0.155206302794022, 0.05744152046783631, 0.08466666666666664, 0.0796481019673039, 0.05280797101449285, 0.26957937837334855, 0.05006561679790031, 0.10246606691919204, 0.06740780267994788, 0.18250394944707732, 0.11412410394265227, 0.17088148148148155, 0.17853957636566345, 0.06321285929417932, 0.05203373015873006, 0.43535745047372926, 0.09264170566822665, 0.0771495129182549, 0.1870164609053498, 0.1248444444444446, 0.1035337552742616, 0.11920077972709558, 0.1934475597092419, 0.1464818680893836, 0.1064226519337016, 0.08176445304104882, 0.05263911998520979, 0.2936366806136681, 0.07366666666666671, 0.09981226931398755, 0.14768864013266975, 0.09349097892104326, 0.11288968824940046, 0.08810583580613252, 0.05040600893219649, 0.061161176195721104, 0.06733973659317474, 0.05429936305732485, 0.09152967459113141, 0.0645729136020398, 0.14403422409751515, 0.09588989441930618, 0.09242063492063501, 0.11825396825396835, 0.05652784948086954, 0.08774471417384488, 0.13191214470284246, 0.22278951486697948, 0.08188010899182561, 0.06912923561859721, 0.14525775775775754, 0.0631806735914385, 0.13235449735449736, 0.053663993491399345, 0.07014950730547063, 0.2369502314814817, 0.21005933117583603, 0.13809052520453938, 0.0612112563216664, 0.30280764635603347, 0.14942618675013036, 0.08110624315443592, 0.058609834949804374, 0.06459619341563778, 0.07053778395920271, 0.09164814814814823, 0.06605734767025087, 0.08915915915915919, 0.05229534836481821, 0.25198830409356704, 0.09819207252793587, 0.05966117216117217, 0.058664459161147986, 0.15267166042446953, 0.05071202531645577, 0.24567260796311097, 0.05757507507507512, 0.07798069118715024, 0.08168640261331164, 0.07501138433515489, 0.06315453384418905, 0.06892486011191037, 0.06476990049751241, 0.23219544846050932, 0.07668178382464108, 0.09733775366686769, 0.08675710594315243, 0.08890358612580843, 0.06589414858645622, 0.06532956685499046, 0.07346978557504857, 0.12375886524822692, 0.06197037037037042, 0.0559521429047524, 0.06331830327321318, 0.3523873873873877, 0.08025173611111118, 0.1472576832151302, 0.15361762328213419, 0.058281168492436065, 0.08372900516795864, 0.07622143420015758, 0.259646697388633, 0.13167466027178254, 0.1255291005291004, 0.12534246575342453, 0.14966072943172173, 0.3494186046511634, 0.10163511187607566, 0.1596567771960443, 0.07981568016614757, 0.13011071569790433, 0.07846286701208968, 0.0942169131588927, 0.1616862745098038, 0.05179586563307496, 0.1327639421030225, 0.174612403100775, 0.06210996955859979, 0.07412024212578233, 0.06563088512241055, 0.36646849801831366, 0.05212856534695621, 0.08956944444444435, 0.06292826045807307, 0.09352409638554207, 0.07975341651812247, 0.23025889967637547, 0.09935075608152519, 0.0564758527524485, 0.05688927269238165, 0.060675705467372276, 0.0731816774992263, 0.10934343434343431, 0.05554505356017646, 0.06388591163510786, 0.0933210784313725, 0.2927334943639287, 0.13805245766212318, 0.07788259958071264, 0.13604368932038868, 0.06691645133505583, 0.11294524189261004, 0.05636132315521629, 0.17738640702558245, 0.12370509607351714, 0.05306400150150149, 0.08266856600189926, 0.23875448028673857, 0.12529650436953807, 0.17393395440430193, 0.11430738119312428, 0.05072365445499779, 0.22758871436881914, 0.0627973281199087, 0.13195356187697552, 0.07006281900274836, 0.14989743589743584, 0.10816924519456143, 0.060828924162257526, 0.10032133676092538, 0.06437331371106211, 0.09013653483992469, 0.056873745666849065, 0.08646723646723647, 0.134100529100529, 0.057694377185902625, 0.06211617600506496, 0.0760962894817541, 0.19304085831863596, 0.1188091679123069, 0.22911567074634012, 0.06440000000000014, 0.057420091324200874, 0.06702027345591705, 0.12543859649122815, 0.05977366255144032, 0.05838192419825071, 0.08728313030638594, 0.07821041065787666, 0.10309410834321868, 0.10841780958060031, 0.18539491298527427, 0.11459116120951866, 0.13685873513459704, 0.261845076784101, 0.0880555555555556, 0.11634659350708733, 0.13350877192982424, 0.08330226778502646, 0.05229625918503681, 0.10423497267759563, 0.10251291989664076, 0.0675727866904338, 0.27014853395061744, 0.10173256078164045, 0.1135104669887279, 0.12736993932159552, 0.08463283828382842, 0.06819665442676454, 0.08241454488826598, 0.06818118369625903, 0.05846468184471681, 0.08099747474747472, 0.07787639710716637, 0.28338099229239605, 0.334918091168091, 0.08559626436781602, 0.08321111111111114, 0.06402195608782442, 0.08677884615384594, 0.1978472222222223, 0.19961477768495367, 0.19204067910573205, 0.10621262895065373, 0.0675976800976801, 0.17154696132596714, 0.19501194743130237, 0.17905826558265597, 0.06951178451178458, 0.14102393617021264, 0.06232269503546111, 0.05823244552058115, 0.07943003781739605, 0.05115400326797382, 0.13698597000483798, 0.08472403934680416, 0.16652454780361764, 0.09349991413360816, 0.22804317055909393, 0.05273482245131732, 0.06257695690413369, 0.08725095785440616, 0.05654081801622796, 0.14028581931807738, 0.15322061191626402, 0.07897822445561145, 0.1019534050179212, 0.12425712368051736, 0.12721062129414795, 0.33191000918273666, 0.0831272084805653, 0.056848958333333297, 0.252043422733078, 0.057473941006875216, 0.06613394216133939, 0.11038274778634687, 0.11572287311432738, 0.1543286774883601, 0.13806954436450838, 0.22777777777777736, 0.10250951886465913, 0.05989387312243584, 0.050028935185185204, 0.21366843033509703, 0.05288697134334714, 0.08628245067497399, 0.061706349206349265, 0.20618801207417004, 0.05628566066066064, 0.11120545073375272, 0.22235108820160362, 0.07921538955381309, 0.07669242089771898, 0.060384894698620274, 0.05743595013015489, 0.07342684268426845, 0.08649202242345831, 0.062234169653524456, 0.07001977705861201, 0.11176084099868608, 0.05578078078078071, 0.29619514472455655, 0.06787866050161134, 0.1015451388888889, 0.17782959124928005, 0.11308373590982304, 0.08818166325835042, 0.0655965010236368, 0.1409881255301102, 0.15480693459416878, 0.20906672746618038, 0.055771144278606855, 0.09401084010840108, 0.17154528478057898, 0.27622100122100135, 0.07824074074074076, 0.05230907085745798, 0.13029350104821794, 0.080998427672956, 0.12537918871252213, 0.0918063036106124, 0.09453636117869683, 0.06631754705525207, 0.07116731159019715, 0.05215450907971687, 0.08379316816816819, 0.1146942800788955, 0.08438596491228051, 0.05554771979313592, 0.05376888888888892, 0.07998826291079808, 0.16135828625235396, 0.06576522435897438, 0.055773863294188535, 0.05700377797151984, 0.1273966075852869, 0.12684192615279916, 0.10447845804988666, 0.08699044956906586, 0.08493428912783749, 0.10271983640081796, 0.09170817192399926, 0.08228319783197809, 0.13785878587858783, 0.06459781529294946, 0.0636445473251029, 0.05355902777777785, 0.0790047114252061, 0.24728086178437994, 0.08082600343746842, 0.06162280701754383, 0.054028224365844923, 0.20191666666666674, 0.08354519774011294, 0.06254719900959452, 0.0865422396856581, 0.17139303482587048, 0.1084267563527653, 0.10791851851851854, 0.19566562223829026, 0.06152306967984933, 0.0571350762527233, 0.08930012414649284, 0.06544816191656351, 0.05071192473938469, 0.2710957501280085, 0.1359885620915033, 0.09779342723004689, 0.13978811369509034, 0.07774138767588558, 0.3530904125341386, 0.0717453505007153, 0.10806825712392297, 0.19014336917562713, 0.19528478057889812, 0.09595845845845846, 0.09076937762504773, 0.062159137159137144, 0.07044278915787298, 0.07400431253170983, 0.0911404515882129, 0.07717755443886103, 0.2048931623931622, 0.05201714573877962, 0.10240007649646204, 0.050736731318598766, 0.1268421052631578, 0.10300387596899235, 0.08084677419354835, 0.06166161616161609, 0.060426447574334866, 0.05792181069958844, 0.05515503875968995, 0.0513888888888889, 0.053187998124707016, 0.055511716881874694, 0.07464756707594356, 0.12037820061075873, 0.14645864756158875, 0.05450450450450453, 0.23249158249158264, 0.14743719440229297, 0.07199842022116903, 0.39887654320987664, 0.0774246257642842, 0.07724358974358975, 0.0609101060859855, 0.05514277980782238, 0.05038610038610044, 0.05299957752429237, 0.06193568336425472, 0.0976548269581057, 0.13551587301587306, 0.15508156966490289, 0.09589585895858949, 0.05719752010074588, 0.057414330218068514, 0.053235294117647144, 0.18474910394265232, 0.2736111111111112, 0.09242853290183392, 0.0845111111111111, 0.19853128991060046, 0.11629232601667516, 0.11998920298390274, 0.09531321444901683, 0.05128129890453833, 0.06965230536659116, 0.10099223468507344, 0.07542153047989626, 0.09593643862202818, 0.22590090090090068, 0.06806791955159905, 0.1844824228259812, 0.33359929078014183, 0.17750405515004058, 0.08814053779807206, 0.34235880398671076, 0.2189233038348083, 0.19515723270440274, 0.05612315101479186, 0.16067303863002796, 0.22014154959557258, 0.0673224177256436, 0.10310550583129927, 0.1445291902071563, 0.19865034526051467, 0.08892914653784233, 0.07188532926519194, 0.05300984528832626, 0.11207364341085274, 0.22449960598896768, 0.11273849607182912, 0.05598338293650798, 0.05096339113680147, 0.12679837892603854, 0.0769873532068654, 0.10208333333333337, 0.08263462849352438, 0.08803978651140225, 0.05166552589550537, 0.12370792846078027, 0.11763532763532797, 0.10106553338260664, 0.13484848484848463, 0.11748538011695896, 0.13110144927536227, 0.05142801251956176, 0.0577892325315006, 0.0834828454516841, 0.1053929539295393, 0.06831481481481493, 0.09732876712328775, 0.1086143790849673, 0.08277277277277284, 0.09025525302121046, 0.15910138248847927, 0.08282987940522182, 0.17234920634920606, 0.05251709212532579, 0.11156330749353996, 0.09082021936099188, 0.10205026455026459, 0.09921111111111114, 0.06161111111111107, 0.07152076318742986, 0.13429223307167015, 0.2400391074304119, 0.07060606060606064, 0.06363272494396419, 0.15133333333333324, 0.07798611111111112, 0.09982817869415811, 0.15329141575156469, 0.0701370271075365, 0.2020190619513419, 0.18117933723196886, 0.07903225806451619, 0.25966408268733865, 0.08287671232876707, 0.09588686049132294, 0.060517035590277816, 0.14301877219685433, 0.11894122383252827, 0.11066504460665025, 0.055407227615965504, 0.08961255845023378, 0.057507507507507535, 0.29354201917653694, 0.08051232166018157, 0.06571675302245267, 0.09566308243727593, 0.056081211419753085, 0.0673473356670556, 0.1672300670375126, 0.526631968275757, 0.06432127882599585, 0.08013853548204687, 0.13593505477308296, 0.22547789725209108, 0.0589612216194496, 0.054183535762483075, 0.06685733070348453, 0.17017973856209148, 0.08275573795409642, 0.061057068741893636, 0.05130555555555557, 0.0732590145884943, 0.38043101581113237, 0.08978237791932055, 0.062291350531107736, 0.08246460746460738, 0.06373239436619706, 0.08447940947940952, 0.09178887008836244, 0.11382428940568469, 0.0769357336430507, 0.12079037800687287, 0.15159033078880402, 0.20224529641462552, 0.5249689633767848, 0.09494855967078177, 0.30184240362811754, 0.07776776776776781, 0.06487377391863645, 0.10856185252894575, 0.15006742179072294, 0.08555555555555551, 0.058644636015325546, 0.07974129908803282, 0.12340609840609826, 0.05868495077355837, 0.06270718232044202, 0.26370162297128624, 0.07482131254061077, 0.09452926208651395, 0.05131075392269424, 0.07393224149004489, 0.08466248506571089, 0.09049844236760118, 0.09471391359022725, 0.1150529576338929, 0.06763843050971775, 0.22681623931623934, 0.07046434494195684, 0.05065258077226166, 0.06620956399437398, 0.06829787234042561, 0.10035812672176316, 0.11130122517955214, 0.3204797414864531, 0.08789551140544509, 0.14075923166087098, 0.2749498596068993, 0.11949602122015919, 0.09325581395348843, 0.15875000000000045, 0.36795003422313505, 0.051130410807830144, 0.1119024547803618, 0.0957885304659496, 0.08269603768356884, 0.05097465886939573, 0.08457974579745792, 0.050734549138804465, 0.05515916243040432, 0.31234177215189834, 0.0632957481794691, 0.051929371231696726, 0.09148526077097534, 0.14953977646285344, 0.0506127847964582, 0.21585073194144436, 0.07249677002583975, 0.08091147111765666, 0.06252136752136758, 0.2920592750029521, 0.2686107044086548, 0.05070850202429149, 0.055079803560466566, 0.11422930283224401, 0.16500000000000026, 0.0901577503429355, 0.16834965664752902, 0.05689935064935068, 0.12008091572922844, 0.06754611754611761, 0.13143575674439853, 0.19776234567901255, 0.09301824212271959, 0.09883936861652731, 0.05370963912630581, 0.08740374037403742, 0.09129565105174856, 0.2025436700882405, 0.10966553287981855, 0.09304123711340208, 0.15727424749163876, 0.08673510466988733, 0.15002639218791244, 0.11991680767061459, 0.15714036920933477, 0.055669191919191896, 0.060603754439370855, 0.0553829252981795, 0.24125712250712247, 0.0643844149960357, 0.12240457087284219, 0.08292803479636224, 0.1318363273453096, 0.2261055369751019, 0.1288844621513946, 0.0808075827306596, 0.09049019607843137, 0.06745571658615145, 0.4085263157894737, 0.17571301247771845, 0.08263342082239729, 0.057457264957264886, 0.2528538531129205, 0.08574791498520298, 0.07208037825059113, 0.07746013224426278, 0.14357666845589548, 0.09365904365904346, 0.35845731135236575, 0.11386411889596608, 0.11595528455284546, 0.05228233584357247, 0.0968357487922705, 0.15047061524334257, 0.13700891781152078, 0.2708376600899963, 0.12881506401270915, 0.05176433522369256, 0.17267884322678836, 0.08728898426323332, 0.1434939091915835, 0.06167407407407421, 0.09582649151614667, 0.2133053221288517, 0.0919685990338165, 0.06340544083906921, 0.17020417853751188, 0.05214097496706192, 0.09890721940214325, 0.20112603713947047, 0.1246901500326158, 0.06270498661311898, 0.05038535645472058, 0.07941782586295726, 0.08862029968900187, 0.2543325041459373, 0.10336796720384506, 0.07235690235690222, 0.06281305114638441, 0.06487068965517231, 0.2579663376360418, 0.1289462636439966, 0.05670132871172739, 0.33485491861288047, 0.1559345059345061, 0.052930616205725374, 0.2605622311229788, 0.0691308243727599, 0.1453457882661423, 0.20938438438438417, 0.09514522120905104, 0.10109800065552287, 0.10268065268065274, 0.24484064095791483, 0.088904680689786, 0.05086666666666668, 0.1674471992653811, 0.2208468176914778, 0.08123343527013246, 0.07117078715544181, 0.11251286008230457, 0.06632983023443814, 0.08305733618233635, 0.05383540868454665, 0.078047464940669, 0.15043724279835394, 0.06793606455617622, 0.07586458333333328, 0.1095712560386475, 0.05798709474346873, 0.18836336336336315, 0.07296120348376883, 0.13878890138732652, 0.24223806291016084, 0.07412062900889715, 0.06177312042581509, 0.05580065359477123, 0.23164654905601992, 0.12830520393811531, 0.0900017806267807, 0.10032497678737223, 0.09931111111111122, 0.06154011154011156, 0.18886039886039926, 0.07504947695787384, 0.17952310717797415, 0.06546639231824414, 0.11064814814814819, 0.12384708737864081, 0.09651626764886422, 0.14151312116136253, 0.0664360587002097, 0.12518199233716473, 0.14853239894373343, 0.17834794081613423, 0.2753509827517048, 0.058154602323503155, 0.20981280193236743, 0.0835882727852136, 0.051845991561181436, 0.14223744292237428, 0.08618330194601383, 0.0631062951496388, 0.07204976595220497, 0.14624363700193077, 0.2573500491642084, 0.1377558479532165, 0.07937727306464779, 0.081345808224971, 0.09456692913385825, 0.06349533852972648, 0.09822884811416928, 0.11073833573833598, 0.09439600891213788, 0.15163447251114406, 0.1353281853281854, 0.23187523071243965, 0.05468455743879473, 0.27969483568075026, 0.08616063389159015, 0.06747773536895676, 0.14151969429747221, 0.12082611207394585, 0.1008497421472106, 0.06994331065759657, 0.05060835629017451, 0.08407992973210365, 0.11999050332383669, 0.06749833666001337, 0.1464255765199162, 0.1485269360269361, 0.125114547537228, 0.11499081726354458, 0.06025496283752004, 0.1814213564213564, 0.0840694731999082, 0.219320682068207, 0.07562134502923981, 0.0833764832793959, 0.09215830875122902, 0.27124756335282657, 0.1865125868055554, 0.05992534036012296, 0.09172443085287117, 0.056869891760547873, 0.060118814466640434, 0.0739940937615355, 0.05713824289405688, 0.0796028880866426, 0.3190624999999998, 0.0928722993827162, 0.05818363273453112, 0.08250259605399793, 0.2122732426303853, 0.08973127262600954, 0.2153903903903901, 0.05818798449612406, 0.07855282738095247, 0.07949640287769785, 0.0535102988105599, 0.08965608465608466, 0.07958152958152957, 0.07780813600485738, 0.1603896103896104, 0.050457665903890156, 0.10139833711262294, 0.05380278802788042, 0.2478724784328989, 0.09794444444444456, 0.057883040935672515, 0.3166666666666669, 0.10657726692209454, 0.419090413943355, 0.1071303587051619, 0.05750617283950614, 0.1813325932973884, 0.0634336917562724, 0.14819938515590686, 0.06651197202044654, 0.09568007662835257, 0.11898148148148135, 0.05979026577009295, 0.11944893196912561, 0.06035509736540654, 0.08056867891513554, 0.0638672719412019, 0.06112872915198493, 0.05360302049622438, 0.13176662821991542, 0.05667675356921174, 0.20599462365591376, 0.13374756335282634, 0.063971807628524, 0.09576843198338508, 0.057165948275861994, 0.061889596602972365, 0.06783887468030687, 0.08306590752242923, 0.14380652454780354, 0.10327657378740956, 0.16013191894464834, 0.2075591985428053, 0.4096405228758172, 0.08125681570338067, 0.05120300751879706, 0.09509493670886078, 0.10906084656084652, 0.07456969417550326, 0.08518807418963414, 0.07107046070460704, 0.15743801652892575, 0.05443121693121694, 0.07389219576719584, 0.08462088698140198, 0.1389801297648014, 0.12326278659612, 0.05277777777777759, 0.06266601878846766, 0.06385217816935887, 0.06953627180899911, 0.15664942985945374, 0.059470304975922955, 0.11665079365079363, 0.11014588859416441, 0.27153195279447767, 0.11324074074074049, 0.09403917595406959, 0.07825385583884154, 0.09103703703703683, 0.06038421599169253, 0.061077593430534594, 0.1281981981981982, 0.060962301587301516, 0.07055380301437081, 0.1378072763028516, 0.07774928774928787, 0.06713520749665325, 0.12223632261703332, 0.22235277052509422, 0.06130685089234303, 0.1217768109937158, 0.054084967320261385, 0.07507062146892657, 0.09874456306840651, 0.060844444444444576, 0.0694539249146758, 0.09332070039367778, 0.07329098838683504, 0.07552864282968096, 0.1317283950617285, 0.09170565302144265, 0.11811806914546655, 0.07854030501089326, 0.15580446037435278, 0.05701825013419217, 0.1444248629293948, 0.0678092031425365, 0.08556389410047949, 0.05188774045363191, 0.09613329300262159, 0.0931980906921241, 0.05218104785773966, 0.22274701411509248, 0.07781954887218066, 0.16827200577200602, 0.07447775976949403, 0.05447530864197535, 0.14332694763729265, 0.08920792079207927, 0.15660947712418286, 0.05713392200147162, 0.26063218390804604, 0.06019458544839261, 0.05814176245210733, 0.08585776330075989, 0.08853968253968246, 0.11312292358803974, 0.05664468425259793, 0.05162551440329218, 0.06696401341872517, 0.1510513501549358, 0.07443226311667969, 0.059985734664764624, 0.08962429008300556, 0.16630291005291012, 0.06322421624016832, 0.05935374149659863, 0.11070906432748548, 0.2804945054945056, 0.05396206431735024, 0.0734052424031547, 0.37415964351448217, 0.05453088578088577, 0.0720310633213859, 0.12516224188790556, 0.13061027423715724, 0.07092592592592595, 0.27780837753297977, 0.053060483452640306, 0.08491269046824591, 0.1302996085516412, 0.11413888888888891, 0.05380921895006404, 0.0831684188827045, 0.060380658436213895, 0.059426646325959814, 0.08315772669220946, 0.1467239967239969, 0.060230909922267956, 0.06885734314756406, 0.053252461322081585, 0.21228095937347044, 0.06196631033365722, 0.09725623582766442, 0.13518518518518507, 0.06286214581343524, 0.18932127882599584, 0.1011979166666667, 0.17817243472981165, 0.055420353982300836, 0.05265723775849141, 0.0991268737443982, 0.2845449172576835, 0.07339611283109732, 0.13770299145299159, 0.06326699834162508, 0.21216802168021676, 0.18000977517106567, 0.0676645091693635, 0.050327932098765384, 0.22432707355242548, 0.1675281293952179, 0.0570433145009416, 0.5295811287477952, 0.08398297491039419, 0.27521847690387025, 0.06321248196248198, 0.05575359599749847, 0.07382565492321579, 0.051904565456545676, 0.06354688021354682, 0.050030609121518244, 0.12568379921851508, 0.056243032329988854, 0.05572111846946289, 0.08817576223417878, 0.20323315118397084, 0.16794532627865968, 0.25788134105669647, 0.15220510795155354, 0.09893904320987654, 0.22949669966996705, 0.06182418050234134, 0.0633560231023102, 0.056021134128478714, 0.08086795001688621, 0.06470266040688569, 0.07257720979765708, 0.14188368055555564, 0.27175648702594885, 0.06028151401470093, 0.07676181980374663, 0.07891203703703699, 0.25795180722891553, 0.07732133541992693, 0.09887012876268428, 0.06851851851851858, 0.12534000513215307, 0.12566006600660073, 0.061432831136220965, 0.06647783740807006, 0.06870962532299738, 0.1567803864025435, 0.05608530290681884, 0.07717925107427866, 0.10715907478359571, 0.12247441520467839, 0.20595833333333316, 0.06972796439850715, 0.09018199233716478, 0.055517241379310314, 0.15288194444444433, 0.10735094850948491, 0.1766905615292711, 0.10131086142322103, 0.12052319309600856, 0.08420000000000005, 0.05080738177623987, 0.09414735591206178, 0.09945394112060776, 0.13978940650925337, 0.0659119839970905, 0.22666417352281212, 0.07688778330569357, 0.05655270655270654, 0.35108585858585867 ], "xbins": { "end": 1, "size": 0.1, "start": 0 } } ], "layout": { "barmode": "overlay", "title": "Topics #43, #95", "xaxis": { "title": "Topic Score" }, "yaxis": { "title": "Number of Articles" } } }, "text/html": [ "" ], "text/vnd.plotly.v1+html": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "iplot(model.plot_topic_article_distribution(topics_of_interest))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, the histogram for topic #95 (`test, code, error, bug, problem`), a rather generic topic, at least for Hacker News content, is quite skewed to the left, indicating it occurs with low scores very frequently in articles, and almost never with a high score. The histogram for topic #43 (`quantum, theory, physics, particle, universe`) is much flatter, indicating it is the main theme of an article much more often." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computing the median of scores across all articles seems like a decent mathematical way of capturing this intuition of \"interesting-ness\". Sorting topics by the computed median scores in decreasing order, we get -" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Topic #99 Topic #29 Topic #38 Topic #43 Topic #56 Topic #70 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " network earth container quantum bitcoin car \n", " model space docker theory transaction vehicle \n", " learning star run physics blockchain drive \n", " neural planet service particle network tesla \n", " learn orbit image universe wright road \n", " machine moon application physicist block bike \n", " deep year deploy wave ethereum driver \n", " training mars cluster field trust model \n", " layer galaxy machine hole currency electric \n", " image telescope host state exchange wheel \n", "\n", "\n", " Topic #71 Topic #11 Topic #94 Topic #2 Topic #12 Topic #13 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " animal flight cell key food stack \n", " human fly gene certificate eat instruction \n", " specie air dna security fat register \n", " dog space human encryption sugar address \n", " bird aircraft genome encrypt diet code \n", " cat launch genetic password meat call \n", " year plane protein secure fruit memory \n", " tree drone mouse secret egg byte \n", " live pilot cancer public farmer program \n", " find rocket bacteria tls grow function \n", "\n", "\n", " Topic #75 Topic #67 Topic #30 Topic #31 Topic #92 Topic #62 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " component al stock git police drug \n", " react state tax github crime patient \n", " function attack market repository officer health \n", " var group company commit drug medical \n", " element government fund branch prison disease \n", " return country share change criminal doctor \n", " state islamic investor code arrest cancer \n", " render terrorist financial merge year treatment \n", " dom saudi bank request call death \n", " import iran price project law year \n", "\n", "\n", " Topic #44 Topic #52 Topic #97 Topic #77 Topic #16 Topic #39 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " government pi phone memory node company \n", " agency board network cpu system startup \n", " security usb internet core read founder \n", " nsa chip radio intel state investor \n", " surveillance power signal performance write tech \n", " fbi hardware mobile cache cluster valley \n", " snowden card device processor distribute start \n", " intelligence km channel chip latency silicon \n", " document device service op message business \n", " information controller fi gpu operation money \n", "\n", "\n", " Topic #93 Topic #86 Topic #27 Topic #50 Topic #91 Topic #10 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " energy service file security code light \n", " power datum command attack compiler laser \n", " solar aws run vulnerability rust electron \n", " cost cloud install exploit compile field \n", " battery instance script hacker function energy \n", " year server build password optimization high \n", " gas application directory attacker library fusion \n", " plant run default hack call charge \n", " fuel storage package find memory produce \n", " oil system set system performance ray \n", "\n", "\n", " Topic #48 Topic #36 Topic #21 Topic #84 Topic #87 Topic #0 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " database ship sleep device city upgrade \n", " query sea day phone san fix \n", " datum water hour camera area close \n", " table year exercise battery street add \n", " index ocean mental laptop housing al \n", " row find people vr york doc \n", " column island health screen francisco rebuild \n", " sql river depression smartphone home david \n", " data land feel home building update \n", " select site stress hardware people michael \n", "\n", "\n", " Topic #20 Topic #22 Topic #25 Topic #80 Topic #85 Topic #33 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " int number uber brain facebook music \n", " return point amazon study ad video \n", " function matrix driver cognitive twitter sound \n", " const algorithm service memory user audio \n", " void function trip neuron post play \n", " struct vector airbnb participant site song \n", " null prime ride effect people stream \n", " type graph lyft al news record \n", " template line taxi ability content note \n", " char curve city intelligence medium listen \n", "\n", "\n", " Topic #58 Topic #34 Topic #63 Topic #73 Topic #28 Topic #57 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " thread src image law war student \n", " process llvm color court military school \n", " lock tool pixel case weapon college \n", " call gnu map legal soviet learn \n", " event clang frame rule nuclear university \n", " queue module draw state russian teach \n", " task include red lawyer force education \n", " run patch light government missile class \n", " wait solution render judge bomb high \n", " function problem blue order american teacher \n", "\n", "\n", " Topic #74 Topic #1 Topic #41 Topic #65 Topic #14 Topic #96 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " type team people game windows license \n", " function people economic player linux software \n", " haskell job money play system copyright \n", " language company dao move kernel patent \n", " monad interview contract win microsoft free \n", " return hire social world os oracle \n", " define engineer income chess run include \n", " list employee rich computer boot source \n", " lambda manager wealth level user term \n", " promise day inequality sport driver copy \n", "\n", "\n", " Topic #46 Topic #40 Topic #26 Topic #7 Topic #49 Topic #82 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " percent function server water app web \n", " year string network air android page \n", " job return connection temperature google browser \n", " worker variable packet flow user site \n", " rate code ip surface apps content \n", " income match client heat swift website \n", " high expression tcp bridge mobile user \n", " low list address material ios javascript \n", " increase def protocol oxygen add html \n", " growth call send chemical developer chrome \n", "\n", "\n", " Topic #23 Topic #79 Topic #15 Topic #8 Topic #19 Topic #24 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " support bank request python film book \n", " release money server library show write \n", " version card client code art century \n", " feature account http language movie world \n", " change credit response java artist history \n", " add pay application ruby netflix great \n", " fix payment url javascript star man \n", " update cash api framework world modern \n", " include transaction service read le year \n", " issue number header write disney life \n", "\n", "\n", " Topic #64 Topic #6 Topic #60 Topic #5 Topic #32 Topic #45 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " type problem email word project datum \n", " object machine message book open memory \n", " class theory send language source byte \n", " method number tor text developer file \n", " string mathematical address read build bit \n", " function computer account english community size \n", " code mathematic mail document tool hash \n", " public proof domain character development key \n", " call mathematician contact letter team set \n", " return question user write software buffer \n", "\n", "\n", " Topic #42 Topic #54 Topic #81 Topic #61 Topic #66 Topic #76 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " company design language text google china \n", " year build program mode computer country \n", " employee wall code window technology chinese \n", " million building programming screen machine world \n", " business part write line human united \n", " executive small programmer editor system india \n", " billion room software button world states \n", " accord shape system click ai government \n", " firm material computer display year north \n", " ceo create design key robot american \n", "\n", "\n", " Topic #9 Topic #17 Topic #90 Topic #3 Topic #53 Topic #47 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " apple datum api group child product \n", " font result bot public woman customer \n", " iphone number direct political age business \n", " mac average slack state man service \n", " design model total member study revenue \n", " phone analysis sun policy group share \n", " size sample avg campaign parent growth \n", " device show sat president male platform \n", " software distribution sms party adult result \n", " ios measure anonymous government sex software \n", "\n", "\n", " Topic #37 Topic #69 Topic #35 Topic #98 Topic #83 Topic #78 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " life story image datum yahoo package \n", " family continue uk user restaurant full \n", " year read london information coffee debian \n", " friend advertisement caption data bar text \n", " day main mr privacy house subject \n", " people times copyright access food link \n", " live newsletter british service drink send \n", " home sign japan internet mayer mbox \n", " man york year company chef mozilla \n", " house subscribe people provide club date \n", "\n", "\n", " Topic #4 Topic #68 Topic #18 Topic #89 Topic #51 Topic #95 \n", " ---------- ---------- ---------- ---------- ---------- ---------- \n", " people day thing university price test \n", " thing drive people research sell code \n", " feel ms lot science company error \n", " fact bob start paper market bug \n", " human year year researcher buy problem \n", " point august big study business fix \n", " world store problem scientist product check \n", " question july back publish pay fail \n", " person hour happen journal cost issue \n", " bad april talk scientific sale run \n", "\n", "\n", " Topic #55 Topic #88 Topic #72 Topic #59 \n", " ---------- ---------- ---------- ---------- \n", " day country thing system \n", " back european find problem \n", " hand europe post change \n", " run de give require \n", " head french write design \n", " sit france start approach \n", " begin germany read large \n", " walk german point level \n", " hour world article process \n", " man paris ne provide \n", "\n", "\n" ] } ], "source": [ "model.print_topics_table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This seems to give reasonably good results. Specific, focused topics are at the top, whereas common generic topics are at the bottom. It is possible that this metric of interesting-ness could be flawed for certain kinds of data, where either the notion of interesting-ness is different in the first place (as it is a subjective notion), or where the topic-article distribution is significantly different." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.8 Topic Clusters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has a notion of similarity between topics based on a few metrics. The two basic ideas are -\n", "1. Two topics are similar if they have similar words\n", "2. Two topics are similar if they co-occur frequently in articles\n", "\n", "The first captures the notion of lexical similarity, whereas the second captures the notion of relatedness.\n", "\n", "Plotting the topic similarity matrix for the `word_doc_sim` metric which combines both (1) and (2) - " ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "