{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building and Exploring a Map of Reddit with Python\n", "\n", "The goal of this notebook is to build and analyse a map of the 10,000 most popular subreddits on [Reddit](https://www.reddit.com). To do this we need a means to measure the similarity of two subreddits. In a great [article on FiveThirtyEight](https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/) Trevor Martin did an analysis of subreddits by considering the overlaps of users commenting on two different subreddits. Their interest was in using vector algebra on representative vectors to look at, for example, what happens if you remove ``r/politics`` from ``r/The_Donald``. Our interest is a little broader -- we want to map out and visualize the space of subreddits, and attempt to cluster subreddits into their natural groups. With that done we can then explore some of the clusters and find interesting stories to tell.\n", "\n", "The first step in all of this is acquiring the relevant data on subreddits. Reddit user ``u/Stuck_in_the_Matrix`` provided a vast amount of reddit comment data on Google's BigQuery service, and the FiveThirtyEight authors posted their code to extract the relevant commenter overlap information via BigQuery queries. I have placed the (slightly modified) code for BigQuery [on github](https://github.com/lmcinnes/subreddit_mapping/BigQuery_queries.sql). The result is a file with over 15 million counts of pairwise commenter overlap between subreddits [available here](https://github.com/lmcinnes/subreddit_mapping/subreddit-overlap.archive.bz2). This will be the starting point for our analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting set up\n", "\n", "To build a map of subreddits we'll need to get relevant data, and massage it into a form that we can use creating a map. We begin by loading all the relevant Python modules we will require." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import scipy.sparse as ss\n", "import numpy as np\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.preprocessing import normalize\n", "from sklearn.base import BaseEstimator\n", "from sklearn.utils import check_array\n", "from os.path import isfile\n", "import subprocess" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we need to read in the overlap data, which we can do easily with [pandas](http://pandas.pydata.org/). The result is a dataframe, each row providing a from subreddit, a to subreddit, and the number of unique commenters that the two subreddits have in common. A call to ``head`` shows is the first 5 entries of the table." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "raw_data = pd.read_csv('subreddit-overlap')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | t1_subreddit | \n", "t2_subreddit | \n", "NumOverlaps | \n", "
---|---|---|---|
0 | \n", "roblox | \n", "spaceengineers | \n", "20 | \n", "
1 | \n", "madlads | \n", "Guitar | \n", "29 | \n", "
2 | \n", "Chargers | \n", "BigBrother | \n", "29 | \n", "
3 | \n", "NetflixBestOf | \n", "celebnsfw | \n", "35 | \n", "
4 | \n", "JoeRogan | \n", "Glitch_in_the_Matrix | \n", "28 | \n", "
\n", " | x | \n", "y | \n", "subreddit | \n", "
---|---|---|---|
0 | \n", "-2.469311 | \n", "2.295230 | \n", "AskReddit | \n", "
1 | \n", "-2.801981 | \n", "2.136050 | \n", "pics | \n", "
2 | \n", "-2.734101 | \n", "2.063090 | \n", "funny | \n", "
3 | \n", "-3.564055 | \n", "2.174888 | \n", "todayilearned | \n", "
4 | \n", "-5.986312 | \n", "2.277558 | \n", "worldnews | \n", "
\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"