{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Interactive Data Visualization with Bokeh\n", "\n", "[Bokeh](http://bokeh.pydata.org/en/latest/) is an interactive Python library for visualizations that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.\n", "\n", " - To get started using Bokeh to make your visualizations, see the [User Guide](http://bokeh.pydata.org/en/latest/docs/user_guide.html#userguide).\n", " - To see examples of how you might use Bokeh with your own data, check out the [Gallery](http://bokeh.pydata.org/en/latest/docs/gallery.html#gallery).\n", " - A complete API reference of Bokeh is at [Reference Guide](http://bokeh.pydata.org/en/latest/docs/reference.html#refguide).\n", "\n", "The following notebook is intended to illustrate some of Bokeh's interactive utilities and is based on a [post](https://demo.bokehplots.com/apps/gapminder) by software engineer and Bokeh developer [Sarah Bird](https://twitter.com/birdsarah).\n", "\n", "\n", "## Recreating Gapminder's \"The Health and Wealth of Nations\" \n", "\n", "Gapminder started as a spin-off from Professor Hans Rosling’s teaching at the Karolinska Institute in Stockholm. Having encountered broad ignorance about the rapid health improvement in Asia, he wanted to measure that lack of awareness among students and professors. He presented the surprising results from his so-called “Chimpanzee Test” in [his first TED-talk](https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) in 2006.\n", "\n", "[![The Best Stats You've Never Seen](http://img.youtube.com/vi/hVimVzgtD6w/0.jpg)](http://www.youtube.com/watch?v=hVimVzgtD6w \"The best stats you've ever seen | Hans Rosling\")\n", "\n", "Rosling's interactive [\"Health and Wealth of Nations\" visualization](http://www.gapminder.org/world) has since become an iconic illustration of how our assumptions about ‘first world’ and ‘third world’ countries can betray us. Mike Bostock has [recreated the visualization using D3.js](https://bost.ocks.org/mike/nations/), and in this lab, we will see that it is also possible to use Bokeh to recreate the interactive visualization in Python.\n", "\n", "\n", "### About Bokeh Widgets\n", "Widgets are interactive controls that can be added to Bokeh applications to provide a front end user interface to a visualization. They can drive new computations, update plots, and connect to other programmatic functionality. When used with the [Bokeh server](http://bokeh.pydata.org/en/latest/docs/user_guide/server.html), widgets can run arbitrary Python code, enabling complex applications. Widgets can also be used without the Bokeh server in standalone HTML documents through the browser’s Javascript runtime.\n", "\n", "To use widgets, you must add them to your document and define their functionality. Widgets can be added directly to the document root or nested inside a layout. There are two ways to program a widget’s functionality:\n", "\n", " - Use the CustomJS callback (see [CustomJS for Widgets](http://bokeh.pydata.org/en/0.12.0/docs/user_guide/interaction.html#userguide-interaction-actions-widget-callbacks). This will work in standalone HTML documents.\n", " - Use `bokeh serve` to start the Bokeh server and set up event handlers with `.on_change` (or for some widgets, `.on_click`).\n", " \n", "### Imports" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Science Stack \n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Bokeh Essentials \n", "from bokeh.io import output_notebook\n", "from bokeh.plotting import figure, show, ColumnDataSource\n", "\n", "# Layouts \n", "from bokeh.layouts import layout\n", "from bokeh.layouts import widgetbox\n", "\n", "# Figure interaction layer\n", "from bokeh.io import show\n", "from bokeh.io import output_notebook \n", "\n", "# Data models for visualization \n", "from bokeh.models import Text\n", "from bokeh.models import Plot\n", "from bokeh.models import Slider\n", "from bokeh.models import Circle\n", "from bokeh.models import Range1d\n", "from bokeh.models import CustomJS\n", "from bokeh.models import HoverTool\n", "from bokeh.models import LinearAxis\n", "from bokeh.models import ColumnDataSource\n", "from bokeh.models import SingleIntervalTicker\n", "\n", "# Palettes and colors\n", "from bokeh.palettes import brewer\n", "from bokeh.palettes import Spectral6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To display Bokeh plots inline in a Jupyter notebook, use the `output_notebook()` function from bokeh.io. When `show()` is called, the plot will be displayed inline in the next notebook output cell. To save your Bokeh plots, you can use the `output_file()` function instead (or in addition). The `output_file()` function will write an HTML file to disk that can be opened in a browser. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = \"1\";\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force !== \"\") {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force !== \"\") {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"
\\n\"+\n", " \"

\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"

\\n\"+\n", " \"\\n\"+\n", " \"\\n\"+\n", " \"from bokeh.resources import INLINE\\n\"+\n", " \"output_notebook(resources=INLINE)\\n\"+\n", " \"\\n\"+\n", " \"
\"}};\n", "\n", " function display_loaded() {\n", " if (window.Bokeh !== undefined) {\n", " Bokeh.$(\"#757332f7-0cbe-48e2-bd30-7245a4c65a33\").text(\"BokehJS successfully loaded.\");\n", " } else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(display_loaded, 100)\n", " }\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };var element = document.getElementById(\"757332f7-0cbe-48e2-bd30-7245a4c65a33\");\n", " if (element == null) {\n", " console.log(\"Bokeh: ERROR: autoload.js configured with elementid '757332f7-0cbe-48e2-bd30-7245a4c65a33' but no matching script tag was found. \")\n", " return false;\n", " }\n", "\n", " var js_urls = ['https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.js'];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " \n", " Bokeh.$(\"#757332f7-0cbe-48e2-bd30-7245a4c65a33\").text(\"BokehJS is loading...\");\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.12.3.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.12.3.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " \n", " if ((window.Bokeh !== undefined) || (force === \"1\")) {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }if (force === \"1\") {\n", " display_loaded();\n", " }} else if (Date.now() < window._bokeh_timeout) {\n", " setTimeout(run_inline_js, 100);\n", " } else if (!window._bokeh_failed_load) {\n", " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", " window._bokeh_failed_load = true;\n", " } else if (!force) {\n", " var cell = $(\"#757332f7-0cbe-48e2-bd30-7245a4c65a33\").parents('.cell').data().cell;\n", " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", " }\n", "\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Load Bokeh for visualization\n", "output_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the data\n", "\n", "Some of Bokeh examples rely on sample data that is not included in the Bokeh GitHub repository or released packages, due to their size. Once Bokeh is installed, the sample data can be obtained by executing the command in the next cell. The location that the sample data is stored can be configured. By default, data is downloaded and stored to a directory $HOME/.bokeh/data. (The directory is created if it does not already exist.) " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating /Users/benjamin/.bokeh directory\n", "Creating /Users/benjamin/.bokeh/data directory\n", "Using data directory: /Users/benjamin/.bokeh/data\n", "Downloading: CGM.csv (1589982 bytes)\n", " 1589982 [100.00%]\n", "Downloading: US_Counties.zip (3182088 bytes)\n", " 3182088 [100.00%]\n", "Unpacking: US_Counties.csv\n", "Downloading: us_cities.json (713565 bytes)\n", " 713565 [100.00%]\n", "Downloading: unemployment09.csv (253301 bytes)\n", " 253301 [100.00%]\n", "Downloading: AAPL.csv (166698 bytes)\n", " 166698 [100.00%]\n", "Downloading: FB.csv (9706 bytes)\n", " 9706 [100.00%]\n", "Downloading: GOOG.csv (113894 bytes)\n", " 113894 [100.00%]\n", "Downloading: IBM.csv (165625 bytes)\n", " 165625 [100.00%]\n", "Downloading: MSFT.csv (161614 bytes)\n", " 161614 [100.00%]\n", "Downloading: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.zip (5148539 bytes)\n", " 5148539 [100.00%]\n", "Unpacking: WPP2012_SA_DB03_POPULATION_QUINQUENNIAL.csv\n", "Downloading: gapminder_fertility.csv (64346 bytes)\n", " 64346 [100.00%]\n", "Downloading: gapminder_population.csv (94509 bytes)\n", " 94509 [100.00%]\n", "Downloading: gapminder_life_expectancy.csv (73243 bytes)\n", " 73243 [100.00%]\n", "Downloading: gapminder_regions.csv (7781 bytes)\n", " 7781 [100.00%]\n", "Downloading: world_cities.zip (646858 bytes)\n", " 646858 [100.00%]\n", "Unpacking: world_cities.csv\n", "Downloading: airports.json (6373 bytes)\n", " 6373 [100.00%]\n", "Downloading: movies.db.zip (5067833 bytes)\n", " 5067833 [100.00%]\n", "Unpacking: movies.db\n" ] } ], "source": [ "import bokeh.sampledata\n", "bokeh.sampledata.download()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare the data \n", " \n", "In order to create an interactive plot in Bokeh, we need to animate snapshots of the data over time from 1964 to 2013. In order to do this, we can think of each year as a separate static plot. We can then use a JavaScript `Callback` to change the data source that is driving the plot. \n", "\n", "#### JavaScript Callbacks\n", "\n", "Bokeh exposes various [callbacks](http://bokeh.pydata.org/en/latest/docs/user_guide/interaction/callbacks.html#userguide-interaction-callbacks), which can be specified from Python, that trigger actions inside the browser’s JavaScript runtime. This kind of JavaScript callback can be used to add interesting interactions to Bokeh documents without the need to use a Bokeh server (but can also be used in conjuction with a Bokeh server). Custom callbacks can be set using a [`CustomJS` object](http://bokeh.pydata.org/en/latest/docs/user_guide/interaction/callbacks.html#customjs-for-widgets) and passing it as the callback argument to a `Widget` object.\n", "\n", "As the data we will be using today is not too big, we can pass all the datasets to the JavaScript at once and switch between them on the client side using a slider widget. \n", "\n", "This means that we need to put all of the datasets together build a single data source for each year. First we will load each of the datasets with the `process_data()` function and do a bit of clean up:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def process_data():\n", " \n", " # Import the gap minder data sets\n", " from bokeh.sampledata.gapminder import fertility, life_expectancy, population, regions\n", " \n", " # The columns are currently string values for each year, \n", " # make them ints for data processing and visualization.\n", " columns = list(fertility.columns)\n", " years = list(range(int(columns[0]), int(columns[-1])))\n", " rename_dict = dict(zip(columns, years))\n", " \n", " # Apply the integer year columna names to the data sets. \n", " fertility = fertility.rename(columns=rename_dict)\n", " life_expectancy = life_expectancy.rename(columns=rename_dict)\n", " population = population.rename(columns=rename_dict)\n", " regions = regions.rename(columns=rename_dict)\n", "\n", " # Turn population into bubble sizes. Use min_size and factor to tweak.\n", " scale_factor = 200\n", " population_size = np.sqrt(population / np.pi) / scale_factor\n", " min_size = 3\n", " population_size = population_size.where(population_size >= min_size).fillna(min_size)\n", "\n", " # Use pandas categories and categorize & color the regions\n", " regions.Group = regions.Group.astype('category')\n", " regions_list = list(regions.Group.cat.categories)\n", "\n", " def get_color(r):\n", " return Spectral6[regions_list.index(r.Group)]\n", " regions['region_color'] = regions.apply(get_color, axis=1)\n", "\n", " return fertility, life_expectancy, population_size, regions, years, regions_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we will add each of our sources to the `sources` dictionary, where each key is the name of the year (prefaced with an underscore) and each value is a dataframe with the aggregated values for that year.\n", "\n", "_Note that we needed the prefixing as JavaScript objects cannot begin with a number._" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Process the data and fetch the data frames and lists \n", "fertility_df, life_expectancy_df, population_df_size, regions_df, years, regions = process_data()\n", "\n", "# Create a data source dictionary whose keys are prefixed years\n", "# and whose values are ColumnDataSource objects that merge the \n", "# various per-year values from each data frame. \n", "sources = {}\n", "\n", "# Quick helper variables \n", "region_color = regions_df['region_color']\n", "region_color.name = 'region_color'\n", "\n", "# Create a source for each year. \n", "for year in years:\n", " # Extract the fertility for each country for this year.\n", " fertility = fertility_df[year]\n", " fertility.name = 'fertility'\n", " \n", " # Extract life expectancy for each country for this year. \n", " life = life_expectancy_df[year]\n", " life.name = 'life' \n", " \n", " # Extract the normalized population size for each country for this year. \n", " population = population_df_size[year]\n", " population.name = 'population' \n", " \n", " # Create a dataframe from our extraction and add to our sources \n", " new_df = pd.concat([fertility, life, population, region_color], axis=1)\n", " sources['_' + str(year)] = ColumnDataSource(new_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see what's in the `sources` dictionary by running the cell below.\n", "\n", "Later we will be able to pass this `sources` dictionary to the JavaScript Callback. In so doing, we will find that in our JavaScript we have objects named by year that refer to a corresponding `ColumnDataSource`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'_1964': ColumnDataSource(id='128797c9-c1ff-40ef-8cca-924aa30eaa9e', ...),\n", " '_1965': ColumnDataSource(id='fcaf9623-f5ab-4366-ba57-1dc23b920003', ...),\n", " '_1966': ColumnDataSource(id='dbafbed0-9bac-4de2-8032-4536df3773a5', ...),\n", " '_1967': ColumnDataSource(id='638521cd-3270-450b-aa32-b538846f2a07', ...),\n", " '_1968': ColumnDataSource(id='2339feb4-91cb-4c36-9564-88a973a1cccb', ...),\n", " '_1969': ColumnDataSource(id='4b6a7ad2-99b3-44fa-b50a-2adcf18b80a8', ...),\n", " '_1970': ColumnDataSource(id='c7f48b7f-9985-4035-a73b-b144827ab913', ...),\n", " '_1971': ColumnDataSource(id='4704bc4d-1f2c-4fbe-b11e-b613017e557d', ...),\n", " '_1972': ColumnDataSource(id='8e4ac357-c1e5-4078-b17e-ff2668c2c2ba', ...),\n", " '_1973': ColumnDataSource(id='432483b6-c175-453a-8f7c-0495bdea8797', ...),\n", " '_1974': ColumnDataSource(id='59d7244c-f030-4347-aa76-b73a1c166359', ...),\n", " '_1975': ColumnDataSource(id='7eb64100-122d-4806-a464-90291155f3c8', ...),\n", " '_1976': ColumnDataSource(id='3e0e0d94-8a5b-40e8-a097-eea8df436477', ...),\n", " '_1977': ColumnDataSource(id='819fdb97-2d8a-49c5-9379-02188d9b7e35', ...),\n", " '_1978': ColumnDataSource(id='f07b28e1-4d00-4d8e-a76f-f50a526c791d', ...),\n", " '_1979': ColumnDataSource(id='f9ba4f95-2257-4015-ba92-9158ec431fd6', ...),\n", " '_1980': ColumnDataSource(id='c91bbe51-f2f9-40e4-8a92-14d7bc32963a', ...),\n", " '_1981': ColumnDataSource(id='798d83cb-2b3b-47c6-8d98-a361032537f8', ...),\n", " '_1982': ColumnDataSource(id='c0e69480-474e-485e-b334-65ec1797d277', ...),\n", " '_1983': ColumnDataSource(id='ef414232-e7c7-4d2f-a532-ce3f89ec85c9', ...),\n", " '_1984': ColumnDataSource(id='d730c8eb-d0b5-4ef3-84bb-be25a579f996', ...),\n", " '_1985': ColumnDataSource(id='a49d0414-8d41-46ce-b7de-2b6a034e7925', ...),\n", " '_1986': ColumnDataSource(id='44fb667e-a0b1-412a-b5d7-aad9a67b1138', ...),\n", " '_1987': ColumnDataSource(id='f405735d-0d2e-48b8-9e61-1cae920f8565', ...),\n", " '_1988': ColumnDataSource(id='1821bf62-fe61-4006-be2a-9ac91c089cd7', ...),\n", " '_1989': ColumnDataSource(id='b131dc0d-f790-47e5-9e96-559120d409c7', ...),\n", " '_1990': ColumnDataSource(id='bfe113ac-abd4-4fbd-bb59-f1e7805a278f', ...),\n", " '_1991': ColumnDataSource(id='9cbe0f63-cfd8-4831-9553-0e1816413a6c', ...),\n", " '_1992': ColumnDataSource(id='01086edd-7aff-4b4a-bd9b-b6c6bee653a1', ...),\n", " '_1993': ColumnDataSource(id='08ee3746-9b18-4971-817d-fc39d2f00ac7', ...),\n", " '_1994': ColumnDataSource(id='9a38ce2e-abd0-490d-9bd7-e3290afaa45d', ...),\n", " '_1995': ColumnDataSource(id='d7566b0a-c44b-44df-a6f1-338af7427256', ...),\n", " '_1996': ColumnDataSource(id='49f6af2d-d294-45e1-9818-2a289501531b', ...),\n", " '_1997': ColumnDataSource(id='40537e00-289f-4963-aceb-9db53f086319', ...),\n", " '_1998': ColumnDataSource(id='29900086-9343-4e72-b970-3f6998f00fb7', ...),\n", " '_1999': ColumnDataSource(id='8466426f-f5c3-45a9-be15-7d4afc5bc8c3', ...),\n", " '_2000': ColumnDataSource(id='c68b2c66-a620-4330-a28c-06bf22966e59', ...),\n", " '_2001': ColumnDataSource(id='c1ddaaf2-0d47-47a6-be89-8451320b07a3', ...),\n", " '_2002': ColumnDataSource(id='b22a9525-f422-462d-98b3-e3257b9f1361', ...),\n", " '_2003': ColumnDataSource(id='f6c2e16b-6881-497a-a441-95c75cc0085d', ...),\n", " '_2004': ColumnDataSource(id='a3f3a02f-3b8a-4ca4-8818-a221fd85808d', ...),\n", " '_2005': ColumnDataSource(id='2988e9ca-5daa-4f58-afe9-a5b8533a7b72', ...),\n", " '_2006': ColumnDataSource(id='d0c61b85-a640-43a6-ae00-3fdb912c2bef', ...),\n", " '_2007': ColumnDataSource(id='1fdb0348-6cee-4983-a954-1b15514954ae', ...),\n", " '_2008': ColumnDataSource(id='74df0458-c72f-4b82-a328-88eb34a186d0', ...),\n", " '_2009': ColumnDataSource(id='f102827a-cb23-489b-a2ec-1a44509e9d77', ...),\n", " '_2010': ColumnDataSource(id='f3f99dde-4221-4158-b9f9-329a12c83a2a', ...),\n", " '_2011': ColumnDataSource(id='872d54a0-ab97-41cb-bbda-ea0df8e6b485', ...),\n", " '_2012': ColumnDataSource(id='a0fe0918-569c-446f-9029-7df5ee7ddf8a', ...)}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create a corresponding `dictionary_of_sources` object, where the keys are integers and the values are the references to our ColumnDataSources from above: " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dictionary_of_sources = dict(zip([x for x in years], ['_%s' % x for x in years]))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'{1964: _1964, 1965: _1965, 1966: _1966, 1967: _1967, 1968: _1968, 1969: _1969, 1970: _1970, 1971: _1971, 1972: _1972, 1973: _1973, 1974: _1974, 1975: _1975, 1976: _1976, 1977: _1977, 1978: _1978, 1979: _1979, 1980: _1980, 1981: _1981, 1982: _1982, 1983: _1983, 1984: _1984, 1985: _1985, 1986: _1986, 1987: _1987, 1988: _1988, 1989: _1989, 1990: _1990, 1991: _1991, 1992: _1992, 1993: _1993, 1994: _1994, 1995: _1995, 1996: _1996, 1997: _1997, 1998: _1998, 1999: _1999, 2000: _2000, 2001: _2001, 2002: _2002, 2003: _2003, 2004: _2004, 2005: _2005, 2006: _2006, 2007: _2007, 2008: _2008, 2009: _2009, 2010: _2010, 2011: _2011, 2012: _2012}'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "js_source_array = str(dictionary_of_sources).replace(\"'\", \"\")\n", "js_source_array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have an object that's storing all of our `ColumnDataSources`, so that we can look them up.\n", "\n", "### Build the plot\n", "\n", "First we need to create a `Plot` object. We'll start with a basic frame, only specifying things like plot height, width, and ranges for the axes." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "xdr = Range1d(1, 9)\n", "ydr = Range1d(20, 100)\n", "\n", "plot = Plot(\n", " x_range=xdr,\n", " y_range=ydr,\n", " plot_width=800,\n", " plot_height=400,\n", " outline_line_color=None,\n", " toolbar_location=None, \n", " min_border=20,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to display the plot in the notebook use the `show()` function:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build the axes\n", "\n", "Next we can make some stylistic modifications to the plot axes (e.g. by specifying the text font, size, and color, and by adding labels), to make the plot look more like the one in Hans Rosling's video." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a dictionary of our common setting. \n", "AXIS_FORMATS = dict(\n", " minor_tick_in=None,\n", " minor_tick_out=None,\n", " major_tick_in=None,\n", " major_label_text_font_size=\"10pt\",\n", " major_label_text_font_style=\"normal\",\n", " axis_label_text_font_size=\"10pt\",\n", "\n", " axis_line_color='#AAAAAA',\n", " major_tick_line_color='#AAAAAA',\n", " major_label_text_color='#666666',\n", "\n", " major_tick_line_cap=\"round\",\n", " axis_line_cap=\"round\",\n", " axis_line_width=1,\n", " major_tick_line_width=1,\n", ")\n", "\n", "\n", "# Create two axis models for the x and y axes. \n", "xaxis = LinearAxis(\n", " ticker=SingleIntervalTicker(interval=1), \n", " axis_label=\"Children per woman (total fertility)\", \n", " **AXIS_FORMATS\n", ")\n", "\n", "yaxis = LinearAxis(\n", " ticker=SingleIntervalTicker(interval=20), \n", " axis_label=\"Life expectancy at birth (years)\", \n", " **AXIS_FORMATS\n", ") \n", "\n", "# Add the axes to the plot in the specified positions.\n", "plot.add_layout(xaxis, 'below')\n", "plot.add_layout(yaxis, 'left')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Go ahead and experiment with visualizing each step of the building process and changing various settings." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add the background year text\n", "\n", "One of the features of Rosling's animation is that the year appears as the text background of the plot. We will add this feature to our plot first so it will be layered below all the other glyphs (will will be incrementally added, layer by layer, on top of each other until we are finished)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
GlyphRenderer(
id = '92046da9-54f9-4a43-9ad4-76e4c959a48e', …)
data_source = ColumnDataSource(id='cfeefe07-19a3-4afc-b737-1a2ad544f274', ...),
glyph = Text(id='112167d3-b144-4c7c-b9f4-e291ee46016f', ...),
hover_glyph = None,
level = 'glyph',
name = None,
nonselection_glyph = None,
selection_glyph = None,
tags = [],
visible = True,
x_range_name = 'default',
y_range_name = 'default')
\n", "\n" ], "text/plain": [ "GlyphRenderer(id='92046da9-54f9-4a43-9ad4-76e4c959a48e', ...)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a data source for each of our years to display. \n", "text_source = ColumnDataSource({'year': ['%s' % years[0]]})\n", "\n", "# Create a text object model and add to the figure. \n", "text = Text(x=2, y=35, text='year', text_font_size='150pt', text_color='#EEEEEE')\n", "plot.add_glyph(text_source, text)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add the bubbles and hover\n", "Next we will add the bubbles using Bokeh's [`Circle`](http://bokeh.pydata.org/en/latest/docs/reference/plotting.html#bokeh.plotting.figure.Figure.circle) glyph. We start from the first year of data, which is our source that drives the circles (the other sources will be used later). " ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Select the source for the first year we have. \n", "renderer_source = sources['_%s' % years[0]]\n", "\n", "# Create a circle glyph to generate points for the scatter plot. \n", "circle_glyph = Circle(\n", " x='fertility', y='life', size='population',\n", " fill_color='region_color', fill_alpha=0.8, \n", " line_color='#7c7e71', line_width=0.5, line_alpha=0.5\n", ")\n", "\n", "# Connect the glyph generator to the data source and add to the plot\n", "circle_renderer = plot.add_glyph(renderer_source, circle_glyph)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above, `plot.add_glyph` returns the renderer, which we can then pass to the `HoverTool` so that hover only happens for the bubbles on the page and not other glyph elements:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Add the hover (only against the circle and not other plot elements)\n", "tooltips = \"@index\"\n", "plot.add_tools(HoverTool(tooltips=tooltips, renderers=[circle_renderer]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Test out different parameters for the `Circle` glyph and see how it changes the plot:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add the legend\n", "\n", "Next we will manually build a legend for our plot by adding circles and texts to the upper-righthand portion:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Position of the legend \n", "text_x = 7\n", "text_y = 95\n", "\n", "# For each region, add a circle with the color and text. \n", "for i, region in enumerate(regions):\n", " plot.add_glyph(Text(x=text_x, y=text_y, text=[region], text_font_size='10pt', text_color='#666666'))\n", " plot.add_glyph(\n", " Circle(x=text_x - 0.1, y=text_y + 2, fill_color=Spectral6[i], size=10, line_color=None, fill_alpha=0.8)\n", " )\n", " \n", " # Move the y coordinate down a bit.\n", " text_y = text_y - 5" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add the slider and callback\n", "Next we add the slider widget and the JavaScript callback code, which changes the data of the `renderer_source` (powering the bubbles / circles) and the data of the `text_source` (powering our background text). After we've `set()` the data we need to `trigger()` a change. `slider`, `renderer_source`, `text_source` are all available because we add them as args to `Callback`. \n", "\n", "It is the combination of `sources = %s % (js_source_array)` in the JavaScript and `Callback(args=sources...)` that provides the ability to look-up, by year, the JavaScript version of our Python-made `ColumnDataSource`." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Add the slider\n", "code = \"\"\"\n", " var year = slider.get('value'),\n", " sources = %s,\n", " new_source_data = sources[year].get('data');\n", " renderer_source.set('data', new_source_data);\n", " text_source.set('data', {'year': [String(year)]});\n", "\"\"\" % js_source_array\n", "\n", "callback = CustomJS(args=sources, code=code)\n", "slider = Slider(start=years[0], end=years[-1], value=1, step=1, title=\"Year\", callback=callback)\n", "callback.args[\"renderer_source\"] = renderer_source\n", "callback.args[\"slider\"] = slider\n", "callback.args[\"text_source\"] = text_source" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# show(widgetbox(slider))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting all the pieces together\n", "\n", "Last but not least, we put the chart and the slider together in a layout and display it inline in the notebook." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show(layout([[plot], [slider]], sizing_mode='scale_width'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I hope that you'll use Bokeh to produce interactive visualizations for visual analysis:\n", "\n", "![The Visual Analytics Mantra](figures/visual_analytics_mantra.png)\n", "\n", "## Topic Model Visualization\n", "\n", "In this section we'll take a look at visualizing a corpus by exploring clustering and dimensionality reduction techniques. Text analysis is certainly high dimensional visualization and this can be applied to other data sets as well. \n", "\n", "The first step is to load our documents from disk and vectorize them using Gensim. This content is a bit beyond the scope of the workshop for today, however I did want to provide code for reference, and I'm happy to go over it offline. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.5/site-packages/gensim/utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.\n", " warnings.warn(\"Pattern library is not installed, lemmatization won't be available.\")\n" ] } ], "source": [ "import nltk \n", "import string\n", "import pickle\n", "import gensim\n", "import random \n", "\n", "from operator import itemgetter\n", "from collections import defaultdict \n", "from nltk.corpus import wordnet as wn\n", "from gensim.matutils import sparse2full\n", "from nltk.corpus.reader.api import CorpusReader\n", "from nltk.corpus.reader.api import CategorizedCorpusReader\n", "\n", "CORPUS_PATH = \"data/baleen_sample\"\n", "PKL_PATTERN = r'(?!\\.)[a-z_\\s]+/[a-f0-9]+\\.pickle'\n", "CAT_PATTERN = r'([a-z_\\s]+)/.*'" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class PickledCorpus(CategorizedCorpusReader, CorpusReader):\n", " \n", " def __init__(self, root, fileids=PKL_PATTERN, cat_pattern=CAT_PATTERN):\n", " CategorizedCorpusReader.__init__(self, {\"cat_pattern\": cat_pattern})\n", " CorpusReader.__init__(self, root, fileids)\n", " \n", " self.punct = set(string.punctuation) | {'“', '—', '’', '”', '…'}\n", " self.stopwords = set(nltk.corpus.stopwords.words('english'))\n", " self.wordnet = nltk.WordNetLemmatizer() \n", " \n", " def _resolve(self, fileids, categories):\n", " if fileids is not None and categories is not None:\n", " raise ValueError(\"Specify fileids or categories, not both\")\n", "\n", " if categories is not None:\n", " return self.fileids(categories=categories)\n", " return fileids\n", " \n", " def lemmatize(self, token, tag):\n", " token = token.lower()\n", " \n", " if token not in self.stopwords:\n", " if not all(c in self.punct for c in token):\n", " tag = {\n", " 'N': wn.NOUN,\n", " 'V': wn.VERB,\n", " 'R': wn.ADV,\n", " 'J': wn.ADJ\n", " }.get(tag[0], wn.NOUN)\n", " return self.wordnet.lemmatize(token, tag)\n", " \n", " def tokenize(self, doc):\n", " # Expects a preprocessed document, removes stopwords and punctuation\n", " # makes all tokens lowercase and lemmatizes them. \n", " return list(filter(None, [\n", " self.lemmatize(token, tag)\n", " for paragraph in doc \n", " for sentence in paragraph \n", " for token, tag in sentence \n", " ]))\n", " \n", " def docs(self, fileids=None, categories=None):\n", " # Resolve the fileids and the categories\n", " fileids = self._resolve(fileids, categories)\n", "\n", " # Create a generator, loading one document into memory at a time.\n", " for path, enc, fileid in self.abspaths(fileids, True, True):\n", " with open(path, 'rb') as f:\n", " yield self.tokenize(pickle.load(f))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `PickledCorpus` is a Python class that reads a continuous stream of pickle files from disk. The files themselves are preprocessed documents from RSS feeds in various topics (and is actually just a small sample of the documents that are in the larger corpus). If you're interestd in the ingestion and curation of this corpus, see [baleen.districtdatalabs.com](http://baleen.districtdatalabs.com). \n", "\n", "Just to get a feel for this data set, I'll load the corpus and print out the number of documents per category:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create the Corpus Reader\n", "corpus = PickledCorpus(CORPUS_PATH)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "books: 71 documents\n", "business: 389 documents\n", "cinema: 100 documents\n", "cooking: 30 documents\n", "data_science: 41 documents\n", "design: 55 documents\n", "do_it_yourself: 122 documents\n", "gaming: 128 documents\n", "news: 1,159 documents\n", "politics: 149 documents\n", "sports: 118 documents\n", "tech: 176 documents\n", "\n", "2,538 documents in the corpus\n" ] } ], "source": [ "# Count the total number of documents\n", "total_docs = 0\n", "\n", "# Count the number of documents per category. \n", "for category in corpus.categories():\n", " num_docs = sum(1 for doc in corpus.fileids(categories=[category]))\n", " total_docs += num_docs \n", " \n", " print(\"{}: {:,} documents\".format(category, num_docs))\n", " \n", "print(\"\\n{:,} documents in the corpus\".format(total_docs))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our corpus reader object handles text preprocessing with NLTK (the natural language toolkit), namely by converting each document as follows:\n", "\n", "- tokenizing the document \n", "- making all tokens lower case \n", "- removes stopwords and punctuation \n", "- converts words to their lemma \n", "\n", "Here is an example document:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "car bomb explosion turkish capital ankara leave least 34 people dead 100 injured accord turkey health minister today attack targeted civilian bus stop interior minister efkan ala say health minister mehmet muezzinoglu say 30 victim die scene four die hospital muezzinoglu also say 125 people wound 19 serious condition united state condemn attack take innocent life injured score national security council spokesman ned price say thought prayer go kill injure well love one price statement say horrific act recent many terrorist attack perpetrate turkish people united state stand together turkey nato ally value partner confront scourge terrorism explosion occur city main boulevard ataturk bulvari near city main square kizilay associated press report two day ago u embassy say potential terrorist plot attack turkish government building housing locate bahcelievler area ankara u embassy say american avoid area immediately clear whether u embassy warning relate attack associated press contribute report\n" ] } ], "source": [ "fid = random.choice(corpus.fileids())\n", "doc = next(corpus.docs(fileids=[fid]))\n", "print(\" \".join(doc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to convert these documents into vectors so that we can apply machine learning. We'll use a bag-of-words (bow) model with TF-IDF, implemented by the Gensim library." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "# Create the lexicon from the corpus \n", "lexicon = gensim.corpora.Dictionary(corpus.docs())\n", "\n", "# Create the document vectors \n", "docvecs = [lexicon.doc2bow(doc) for doc in corpus.docs()]\n", "\n", "# Train the TF-IDF model and convert vectors to TF-IDF\n", "tfidf = gensim.models.TfidfModel(docvecs, id2word=lexicon, normalize=True)\n", "tfidfvecs = [tfidf[doc] for doc in docvecs]\n", "\n", "# Save the lexicon and TF-IDF model to disk.\n", "lexicon.save('data/topics/lexicon.dat')\n", "tfidf.save('data/topics/tfidf_model.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Documents are now described by the words that are most important to that document relative to the rest of the corpus. The document above has been transformed into the following vector with associated weights: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "embassy (0.29) muezzinoglu (0.27) turkish (0.26) attack (0.22) ankara (0.20) injured (0.18) minister (0.17) explosion (0.16) bahcelievler (0.15) turkey (0.15) efkan (0.14) bulvari (0.14) kizilay (0.14) ataturk (0.14) terrorist (0.14) scourge (0.13) perpetrate (0.13) ned (0.12) targeted (0.12) mehmet (0.12) associated (0.11) main (0.11) boulevard (0.11) health (0.11) ala (0.11) nato (0.10) horrific (0.10) price (0.10) 125 (0.10) die (0.10) prayer (0.10) innocent (0.09) united (0.09) interior (0.09) area (0.09) condemn (0.08) confront (0.08) press (0.08) civilian (0.08) housing (0.08) wound (0.08) terrorism (0.08) plot (0.08) bus (0.08) warning (0.08) injure (0.07) city (0.07) council (0.07) locate (0.07) 34 (0.07) bomb (0.07) ally (0.07) hospital (0.07) square (0.07) occur (0.06) thought (0.06) say (0.06) score (0.06) victim (0.06) dead (0.06) relate (0.06) condition (0.06) spokesman (0.06) 19 (0.06) u (0.06) avoid (0.06) building (0.06) serious (0.06) people (0.06) scene (0.06) partner (0.06) value (0.05) immediately (0.05) potential (0.05) report (0.05) capital (0.05) car (0.05) contribute (0.05) 100 (0.05) act (0.05) near (0.05) state (0.05) together (0.05) security (0.05) kill (0.05) stand (0.04) love (0.04) stop (0.04) clear (0.04) 30 (0.04) ago (0.04) recent (0.04) statement (0.04) national (0.04) whether (0.04) least (0.04) today (0.04) american (0.04) government (0.04) four (0.03) life (0.03) leave (0.03) accord (0.03) many (0.02) well (0.02) day (0.02) two (0.02) go (0.02) take (0.02) also (0.01) one (0.01)\n" ] } ], "source": [ "# Covert random document from above into TF-IDF vector \n", "dv = tfidf[lexicon.doc2bow(doc)]\n", "\n", "# Print the document terms and their weights. \n", "print(\" \".join([\n", " \"{} ({:0.2f})\".format(lexicon[tid], score)\n", " for tid, score in sorted(dv, key=itemgetter(1), reverse=True)\n", "]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Topic Visualization with LDA\n", "\n", "We have a lot of documents in our corpus, so let's see if we can cluster them into related topics using the Latent Dirichlet Model that comes with Gensim. This model is widely used for \"topic modeling\" -- that is clustering on documents. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Select the number of topics to train the model on.\n", "NUM_TOPICS = 10 \n", "\n", "# Create the LDA model from the docvecs corpus and save to disk.\n", "model = gensim.models.LdaModel(docvecs, id2word=lexicon, alpha='auto', num_topics=NUM_TOPICS)\n", "model.save('data/topics/lda_model.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each topic is represented as a vector - where each word is a dimension and the probability of that word beloning to the topic is the value. We can use the model to query the topics for a document, our random document from above is assigned the following topics with associated probabilities:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(2, 0.72882756700044149), (8, 0.2632769507616482)]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model[lexicon.doc2bow(doc)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can assign the most probable topic to each document in our corpus by selecting the topic with the maximal probability: " ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "topics = [\n", " max(model[doc], key=itemgetter(1))[0]\n", " for doc in docvecs\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topics themselves can be described by their highest probability words:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic 0:\n", "0.010*\"game\" + 0.007*\"say\" + 0.006*\"team\" + 0.005*\"get\" + 0.005*\"one\" + 0.005*\"season\" + 0.005*\"go\" + 0.005*\"first\" + 0.005*\"make\" + 0.005*\"new\"\n", "\n", "Topic 1:\n", "0.007*\"data\" + 0.006*\"say\" + 0.004*\"one\" + 0.004*\"use\" + 0.004*\"also\" + 0.003*\"make\" + 0.003*\"like\" + 0.003*\"people\" + 0.003*\"new\" + 0.003*\"find\"\n", "\n", "Topic 2:\n", "0.009*\"say\" + 0.006*\"year\" + 0.005*\"one\" + 0.004*\"people\" + 0.004*\"state\" + 0.004*\"two\" + 0.003*\"eng\" + 0.003*\"also\" + 0.003*\"time\" + 0.003*\"get\"\n", "\n", "Topic 3:\n", "0.011*\"say\" + 0.008*\"year\" + 0.004*\"state\" + 0.003*\"take\" + 0.003*\"also\" + 0.003*\"make\" + 0.003*\"time\" + 0.003*\"would\" + 0.003*\"go\" + 0.003*\"new\"\n", "\n", "Topic 4:\n", "0.014*\"trump\" + 0.012*\"say\" + 0.005*\"republican\" + 0.005*\"one\" + 0.005*\"get\" + 0.004*\"go\" + 0.004*\"like\" + 0.004*\"clinton\" + 0.004*\"make\" + 0.004*\"state\"\n", "\n", "Topic 5:\n", "0.006*\"one\" + 0.005*\"make\" + 0.005*\"may\" + 0.004*\"time\" + 0.004*\"say\" + 0.004*\"get\" + 0.004*\"1\" + 0.003*\"like\" + 0.003*\"take\" + 0.003*\"two\"\n", "\n", "Topic 6:\n", "0.011*\"say\" + 0.006*\"trump\" + 0.005*\"new\" + 0.005*\"year\" + 0.004*\"make\" + 0.004*\"get\" + 0.004*\"state\" + 0.003*\"one\" + 0.003*\"would\" + 0.003*\"time\"\n", "\n", "Topic 7:\n", "0.015*\"say\" + 0.007*\"year\" + 0.005*\"mr\" + 0.004*\"state\" + 0.004*\"also\" + 0.004*\"one\" + 0.004*\"go\" + 0.004*\"make\" + 0.004*\"people\" + 0.003*\"would\"\n", "\n", "Topic 8:\n", "0.012*\"say\" + 0.006*\"year\" + 0.005*\"one\" + 0.004*\"make\" + 0.004*\"would\" + 0.004*\"u\" + 0.004*\"get\" + 0.004*\"company\" + 0.004*\"new\" + 0.004*\"time\"\n", "\n", "Topic 9:\n", "0.009*\"say\" + 0.004*\"make\" + 0.004*\"one\" + 0.004*\"year\" + 0.004*\"like\" + 0.004*\"would\" + 0.004*\"new\" + 0.003*\"company\" + 0.003*\"use\" + 0.003*\"people\"\n", "\n" ] } ], "source": [ "for tid, topic in model.print_topics():\n", " print(\"Topic {}:\\n{}\\n\".format(tid, topic))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot each topic by using decomposition methods (TruncatedSVD in this case) to reduce the probability vector for each topic into 2 dimensions, then size the radius of each topic according to how much probability documents it contains donates to it. Also try with PCA, explored below!" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a sum dictionary that adds up the total probability \n", "# of each document in the corpus to each topic. \n", "tsize = defaultdict(float)\n", "for doc in docvecs:\n", " for tid, prob in model[doc]:\n", " tsize[tid] += prob" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a numpy array of topic vectors where each vector \n", "# is the topic probability of all terms in the lexicon. \n", "tvecs = np.array([\n", " sparse2full(model.get_topic_terms(tid, len(lexicon)), len(lexicon)) \n", " for tid in range(NUM_TOPICS)\n", "])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import the model family \n", "from sklearn.decomposition import TruncatedSVD \n", "\n", "# Instantiate the model form, fit and transform \n", "topic_svd = TruncatedSVD(n_components=2)\n", "svd_tvecs = topic_svd.fit_transform(tvecs)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create the Bokeh columnar data source with our various elements. \n", "# Note the resize/normalization of the topics so the radius of our\n", "# topic circles fits int he graph a bit better. \n", "tsource = ColumnDataSource(\n", " data=dict(\n", " x=svd_tvecs[:, 0],\n", " y=svd_tvecs[:, 1],\n", " w=[model.print_topic(tid, 10) for tid in range(10)],\n", " c=brewer['Spectral'][10],\n", " r=[tsize[idx]/700000.0 for idx in range(10)],\n", " )\n", " )\n", "\n", "# Create the hover tool so that we can visualize the topics. \n", "hover = HoverTool(\n", " tooltips=[\n", " (\"Words\", \"@w\"),\n", " ]\n", " )\n", "\n", "\n", "# Create the figure to draw the graph on. \n", "plt = figure(\n", " title=\"Topic Model Decomposition\", \n", " width=960, height=540, \n", " tools=\"pan,box_zoom,reset,resize,save\"\n", ")\n", "\n", "# Add the hover tool \n", "plt.add_tools(hover)\n", "\n", "# Plot the SVD topic dimensions as a scatter plot \n", "plt.scatter(\n", " 'x', 'y', source=tsource, size=9,\n", " radius='r', line_color='c', fill_color='c',\n", " marker='circle', fill_alpha=0.85,\n", ")\n", "\n", "# Show the plot to render the JavaScript \n", "show(plt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Corpus Visualization with PCA\n", "\n", "The bag of words model means that every token (string representation of a word) is a dimension and a document is represented by a vector that maps the relative weight of that dimension to the document by the TF-IDF metric. In order to visualize documents in this high dimensional space, we must use decomposition methods to reduce the dimensionality to something we can plot. \n", "\n", "One good first attempt is toi use principle component analysis (PCA) to reduce the data set dimensions (the number of vocabulary words in the corpus) to 2 dimensions in order to map the corpus as a scatter plot. \n", "\n", "We'll use the Scikit-Learn PCA transformer to do this work:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# In order to use Scikit-Learn we need to transform Gensim vectors into a numpy Matrix. \n", "docarr = np.array([sparse2full(vec, len(lexicon)) for vec in tfidfvecs])" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import the model family \n", "from sklearn.decomposition import PCA \n", "\n", "# Instantiate the model form, fit and transform \n", "tfidf_pca = PCA(n_components=2)\n", "pca_dvecs = topic_svd.fit_transform(docarr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use Bokeh to create an interactive plot that will allow us to explore documents according to their position in decomposed TF-IDF space, coloring by their topic. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a map using the ColorBrewer 'Paired' Palette to assign \n", "# Topic IDs to specific colors. \n", "cmap = {\n", " i: brewer['Paired'][10][i]\n", " for i in range(10)\n", "}\n", "\n", "# Create a tokens listing for our hover tool. \n", "tokens = [\n", " \" \".join([\n", " lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)\n", " ][:10])\n", " for doc in tfidfvecs\n", "]\n", "\n", "# Create a Bokeh tabular data source to describe the data we've created. \n", "source = ColumnDataSource(\n", " data=dict(\n", " x=pca_dvecs[:, 0],\n", " y=pca_dvecs[:, 1],\n", " w=tokens,\n", " t=topics,\n", " c=[cmap[t] for t in topics],\n", " )\n", " )\n", "\n", "# Create an interactive hover tool so that we can see the document. \n", "hover = HoverTool(\n", " tooltips=[\n", " (\"Words\", \"@w\"),\n", " (\"Topic\", \"@t\"),\n", " ]\n", " )\n", "\n", "# Create the figure to draw the graph on. \n", "plt = figure(\n", " title=\"PCA Decomposition of BoW Space\", \n", " width=960, height=540, \n", " tools=\"pan,box_zoom,reset,resize,save\"\n", ")\n", "\n", "# Add the hover tool to the figure \n", "plt.add_tools(hover)\n", "\n", "# Create the scatter plot with the PCA dimensions as the points. \n", "plt.scatter(\n", " 'x', 'y', source=source, size=9,\n", " marker='circle_x', line_color='c', \n", " fill_color='c', fill_alpha=0.5,\n", ")\n", "\n", "# Show the plot to render the JavaScript \n", "show(plt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another approach is to use the TSNE model for stochastic neighbor embedding. This is a very popular text clustering visualization/projection mechanism." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import the TSNE model family from the manifold package \n", "from sklearn.manifold import TSNE \n", "from sklearn.pipeline import Pipeline\n", "\n", "# Instantiate the model form, it is usually recommended \n", "# To apply PCA (for dense data) or TruncatedSVD (for sparse)\n", "# before TSNE to reduce noise and improve performance. \n", "tsne = Pipeline([\n", " ('svd', TruncatedSVD(n_components=75)),\n", " ('tsne', TSNE(n_components=2)),\n", "])\n", " \n", "# Transform our TF-IDF vectors.\n", "tsne_dvecs = tsne.fit_transform(docarr)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "
\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create a map using the ColorBrewer 'Paired' Palette to assign \n", "# Topic IDs to specific colors. \n", "cmap = {\n", " i: brewer['Paired'][10][i]\n", " for i in range(10)\n", "}\n", "\n", "# Create a tokens listing for our hover tool. \n", "tokens = [\n", " \" \".join([\n", " lexicon[tid] for tid, _ in sorted(doc, key=itemgetter(1), reverse=True)\n", " ][:10])\n", " for doc in tfidfvecs\n", "]\n", "\n", "# Create a Bokeh tabular data source to describe the data we've created. \n", "source = ColumnDataSource(\n", " data=dict(\n", " x=tsne_dvecs[:, 0],\n", " y=tsne_dvecs[:, 1],\n", " w=tokens,\n", " t=topics,\n", " c=[cmap[t] for t in topics],\n", " )\n", " )\n", "\n", "# Create an interactive hover tool so that we can see the document. \n", "hover = HoverTool(\n", " tooltips=[\n", " (\"Words\", \"@w\"),\n", " (\"Topic\", \"@t\"),\n", " ]\n", " )\n", "\n", "# Create the figure to draw the graph on. \n", "plt = figure(\n", " title=\"TSNE Decomposition of BoW Space\", \n", " width=960, height=540, \n", " tools=\"pan,box_zoom,reset,resize,save\"\n", ")\n", "\n", "# Add the hover tool to the figure \n", "plt.add_tools(hover)\n", "\n", "# Create the scatter plot with the PCA dimensions as the points. \n", "plt.scatter(\n", " 'x', 'y', source=source, size=9,\n", " marker='circle_x', line_color='c', \n", " fill_color='c', fill_alpha=0.5,\n", ")\n", "\n", "# Show the plot to render the JavaScript \n", "show(plt)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }