{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%capture\n", "%load_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "# %cd .. \n", "import sys\n", "sys.path.append(\"..\")\n", "import statnlpbook.util as util\n", "import statnlpbook.em as em\n", "import matplotlib.pyplot as plt\n", "import mpld3\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\newcommand{\\Xs}{\\mathcal{X}}\n", "\\newcommand{\\Ys}{\\mathcal{Y}}\n", "\\newcommand{\\y}{\\mathbf{y}}\n", "\\newcommand{\\z}{\\mathbf{z}}\n", "\\newcommand{\\balpha}{\\boldsymbol{\\alpha}}\n", "\\newcommand{\\bbeta}{\\boldsymbol{\\beta}}\n", "\\newcommand{\\aligns}{\\mathbf{a}}\n", "\\newcommand{\\align}{a}\n", "\\newcommand{\\source}{\\mathbf{s}}\n", "\\newcommand{\\target}{\\mathbf{t}}\n", "\\newcommand{\\ssource}{s}\n", "\\newcommand{\\starget}{t}\n", "\\newcommand{\\repr}{\\mathbf{f}}\n", "\\newcommand{\\repry}{\\mathbf{g}}\n", "\\newcommand{\\x}{\\mathbf{x}}\n", "\\newcommand{\\X}{\\mathbf{X}}\n", "\\newcommand{\\parents}{\\mathrm{par}}\n", "\\newcommand{\\dom}{\\mathrm{dom}}\n", "\\newcommand{\\prob}{p}\n", "\\newcommand{\\vocab}{V}\n", "\\newcommand{\\params}{\\boldsymbol{\\theta}}\n", "\\newcommand{\\param}{\\theta}\n", "\\DeclareMathOperator{\\perplexity}{PP}\n", "\\DeclareMathOperator{\\argmax}{argmax}\n", "\\DeclareMathOperator{\\argmin}{argmin}\n", "\\newcommand{\\train}{\\mathcal{D}}\n", "\\newcommand{\\counts}[2]{\\#_{#1}(#2) }\n", "\\newcommand{\\length}[1]{\\text{length}(#1) }\n", "\\newcommand{\\indi}{\\mathbb{I}}\n", "\\newcommand{\\duals}{\\boldsymbol{\\lambda}}\n", "\\newcommand{\\lagrang}{\\mathcal{L}}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Expectation Maximisation Algorithm\n", "\n", "[Maximum likelihood estimation](mle.ipynb) is an effective tool to learn model parameters when your data is fully observed. For example, when training a [language model](language_models.ipynb) we fully observe all words in the training set, and when training a [syntactic parser](parsing.ipynb) on a treebank we have access to the full parse tree of each training instance. However, in many scenarios this is not the case. When training [machine translation](word_mt.ipynb) models we often require alignments between the source sentence words and those in the target sentence. However, the standard corpora out there do not provide these alignments as they are too expensive to annotate in scale. Likewise, we might want to classify text documents into different document classes, and do have some training documents, but without annotated document classes.\n", "\n", "Let us consider a model $\\prob_\\params(\\x,\\z)$, and a dataset $\\train = \\x_1 \\ldots \\x_n$ consisting only of $\\x$ data but no information about the latent $\\z$. For example, $\\x$ could be a pair of sentences, and $\\z$ an alignment between the sentences, as in chapter [machine translation](word_mt.ipynb). In unsupervised text classification $\\x$ would be a document, and $\\z=z$ a document label.\n", "\n", "To make this more concrete, let us consider the following scenario. We would like to classify (or cluster) documents about food into two classes $\\mathrm{c}_1$ and $\\mathrm{c}_2$. To keep things simple, each word in the document is either $\\mathrm{nattoo}$, a [healthy but slightly smelly Japanese food](https://en.wikipedia.org/wiki/Natt%C5%8D) made of fermented soybeans, $\\mathrm{pizza}$ or $\\mathrm{fries}$. The dataset we are looking at has a tendency to either mostly talk about healthy food (nattoo), or mostly about unhealty food (fries and pizza), and we would like our model to distinguish between these to cases. The model $\\prob^\\mathrm{food}_\\params()$ itself is a [Naive Bayes](TODO) model and generates the words of a document independently conditioned on the document class $z\\in\\{\\mathrm{c}_1,\\mathrm{c}_2\\}$:\n", "\n", "$$\n", "\\newcommand{\\foodprob}{\\prob^\\mathrm{food}_\\params}\n", "\\foodprob(\\x,z) = \\foodprob(z) \\prod_{x \\in \\x} \\foodprob(x|z) = \\theta_z \\prod_{x \\in \\x} \\theta_{x|z}\n", "$$\n", "\n", "In python we can formulate this as follows. Notice how we implement the model in log-space for numerical stability." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.04000000000000001" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from math import log, exp\n", "\n", "# Domains and values\n", "z_domain = ['c1','c2']\n", "x_domain = ['nattoo','pizza','fries']\n", "c1, c2 = z_domain\n", "n, p, f = x_domain\n", "\n", "def prob(x, z, theta):\n", " \"\"\"\n", " Calculate probability of p_\\theta(x,z).\n", " Args:\n", " x: list of words of the document, should be in `x_domain`\n", " z: class label, should be in `z_domain`\n", " Returns:\n", " probability p(x,z) given the parameters.\n", " \"\"\"\n", " theta_x, theta_z = theta\n", " bias = log(theta_z[z])\n", " ll = sum([log(theta_x[x_i, z]) for x_i in x])\n", " return exp(bias + ll)\n", "\n", "def create_theta(prob_c1, prob_c2, \n", " n_c1, p_c1, f_c1, \n", " n_c2, p_c2, f_c2):\n", " \"\"\"\n", " Construct a theta parameter vector. \n", " \"\"\"\n", " theta_z = { c1: prob_c1, c2: prob_c2}\n", " theta_x = { (n, c1): n_c1, (p, c1): p_c1, (f, c1): f_c1, \n", " (n, c2): n_c2, (p, c2): p_c2, (f, c2): f_c2}\n", " return theta_x, theta_z\n", " \n", "\n", "theta = create_theta(0.5,0.5, 0.3, 0.5, 0.2, 0.1, 0.4, 0.5)\n", "prob([p,p,f], c2, theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Marginal Log-Likelihood\n", "As before using our [structured prediction recipe](structured_prediction.ipynb), we would like to find good parameters $\\params^*$ by defining some training objective over $\\train$ and $\\params$. Inspired by the [Maximum likelihood estimation](mle.ipynb) approach, a natural candidate for this objective is the *marginal* log-likelihood of the data. This likelihood arises by marginalising out the latent variable $\\z$ for each training instance. Assuming again that the sample is generated IID, we get:\n", "\n", "$$\n", " M_\\params(\\train) = \\log(\\prob_\\params(\\train)) = \\sum_{\\x \\in \\train} \\log(\\sum_\\z \\prob_\\params(\\x,\\z))\n", "$$\n", "\n", "Unfortunately this objective has two problems when compared to the standard log-likelihood. First, there is no closed-form solution to it. In the case of the log-likelihood we could find an optimal $\\params^*$ simply by counting, but no such solution exists for the marginal log-likelihood. Second, the objective is non-concave, meaning that there can be several local optima. This means that any iterative solution one can apply to maximising it (such as [SGD](sgd.ipynb)) is not guaranteed to find a globally optimal $\\params^*$. \n", "\n", "Let us visualise this objective for our running example. We will do so by choosing two points in the parameter space $\\params_1$ and $\\params_2$, and then visualise the behaviour of the loss on the line between these points. \n", "\n", "First let us define the marginal log likelihood:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-2.8645501413202457" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def marginal_ll(data, theta):\n", " \"\"\"\n", " Calculate the marginal log-likelihood of the given `data` using parameter `theta`.\n", " Args:\n", " data: list of documents, where each document is a list of words. \n", " theta: parameters to use. \n", " \"\"\"\n", " return sum([log(prob(x,c1,theta) + prob(x,c2,theta)) for x in data]) / len(data)\n", "\n", "marginal_ll([[p,p,f],[n,n]], theta)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us plot the marginal log-likelihood on the line between a $\\params_1$ and $\\params_2$." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "theta1 = create_theta(0.3, 0.7, 0.0, 0.3, 0.7, 1.0, 0.0, 0.0)\n", "theta2 = create_theta(0.3, 0.7, 1.0, 0.0, 0.0, 0.0, 0.3, 0.7)\n", "\n", "em.plot_1D(lambda theta: marginal_ll(dataset, theta), theta1, theta2, ylim=[-8.5,-5.5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see the non-concavity of the objective, as there are two local optima. These essentially stem from the symmetry of the model: whether you call one cluster $\\mathrm{c}_1$ or $\\mathrm{c}_1$ will make no difference in the probability you assign to the data. \n", "\n", "How can we at least find one of these local optima? A classic approach to this problem relies on a deriving a lower bound to the marginal log-likelihood. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Lower Bound on the Marginal Log-Likelihood\n", "Let us assume an arbitrary distribution $q(\\z|\\x)$. With this distribution, and using Jenssen's inequality, we can define a lower bound as follows: \n", "$$\n", "M_\\params(\\train) \n", " = \\sum_{\\x \\in \\train} \\log(\\sum_\\z \\prob_\\params(\\x,\\z)) \\\\ \n", " = \\sum_{\\x \\in \\train} \\log(\\sum_\\z q(\\z|\\x) \\frac{\\prob_\\params(\\x,\\z)}{q(\\z|\\x)})\n", " \\geq \\sum_{\\x \\in \\train} \\sum_\\z q(\\z|\\x) \\log(\\frac{\\prob_\\params(\\x,\\z)}{q(\\z|\\x)}) =: B_{q,\\params}(\\train).\n", "$$\n", "\n", "What $q$ to choose? The one that gets $B$ maximally close to $M$. But given that $M$ is non-concave and $B_{q_\\params}(\\train)$ is concave, the bound cannot be tight everywhere. We need to choose a point $\\params$ at which the bound is as close as possibly. Let $\\params'$ be such a point. \n", "\n", "We want to find a $q$ that maximises $B_{q,\\params'}(\\train)$. Since $\\prob_\\params(\\x,\\z) = \\prob_\\params(\\z|\\x) \\prob_\\params(\\x)$ we can maximise \n", "\n", "$$\n", "\\sum_{\\x \\in \\train} \\sum_\\z q(z) \\log(\\frac{\\prob_{\\params'}(\\z|\\x)}{q(\\z)}) + \\log(\\prob_{\\params'}(\\x)).\n", "$$ \n", "\n", "The second term in the sum is constant with respect to $q$, and the first one is the negative KL divergence between $q$ and $\\prob_\\params(\\z|\\x)$. The distribution that minimises KL divergence (and hence maximises the negative KL divergence) is the distribution itself. Hence the closest lower bound can be determined by setting $q(\\z|\\x)=\\prob_{\\params'}(\\z|\\x)$, the conditional distribution over $\\z$ based on the given parameters $\\params'$. \n", "\n", "Let us plot this bound for given $\\params'$." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "current_theta = add_theta(0.4, theta1, 0.6, theta2)\n", "\n", "def calculate_class_distributions(data, theta):\n", " result = []\n", " for x in data:\n", " norm = prob(x,c1,theta) + prob(x,c2,theta)\n", " # E Step\n", " q = {\n", " c1: prob(x,c1,theta) / norm,\n", " c2: prob(x,c2,theta) / norm\n", " }\n", " result.append(q)\n", " return result\n", "\n", "current_q = calculate_class_distributions(dataset, current_theta)\n", "\n", "def marginal_ll_bound(data, theta, q_data = current_q):\n", " loss = 0.0\n", " for x,q in zip(data,q_data):\n", " loss += q[c1] * log(prob(x,c1,theta) / q[c1]) + q[c2] * log(prob(x,c2,theta) / q[c2])\n", " return loss / len(data)\n", "\n", "em.plot_1D(lambda theta: marginal_ll(dataset, theta), theta1, theta2, \n", " loss2=lambda theta:marginal_ll_bound(dataset,theta), ylim=[-8.5,-5.5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This lower bound seems to be concave, with a single maximum. In fact, it is easy to see that $B_{q_\\params}(\\train)$ is a weighted version of the (joint) log-likelihood as defined in the [MLE chapter](mle.ipynb). As such it has a single maximum, and we can find the optimum easily, but more on that later.\n", "\n", "As can be seen in the figure, the bound is not just close at $\\params'$, it coincides with the original objective. This can be easily shown to always hold:\n", "\n", "$$\n", "\\sum_\\z \\prob_\\params(\\z|\\x) \\log\\left(\\frac{\\prob_\\params(\\x,\\z)}{\\prob_\\params(\\z|\\x)}\\right) = \\sum_\\z \\prob_\\params(\\z|\\x) \\log\\left(\\frac{\\prob_\\params(\\z|\\x) \\prob_\\params(\\x)}{\\prob_\\params(\\z|\\x)} \\right) = \\log(\\prob_\\params(\\x)) \\sum_\\z \\prob_\\params(\\z|\\x) = \\log(\\prob_\\params(\\x)) )\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Maximising the Marginal Log-likelihood\n", "The fact that the lower bound coincides with the objective at the chosen point $\\params'$, and that the lower bound itself is easy to optimise given that it is a weighted log-likelihood with close-form solution, suggests a simple algorithm to find a (local) optimum of the objective. Simply choose an initial $\\params'$, determine the lower bound, find the optimum of this lower bound, call it $\\params'$ and repeat until convergence. This algorithm is the Expectation Maximisation (EM) algorithm, and it is named that way for reasons we explain below. It is relatively easy to show that this algorithm increases the marginal likelihood in each step, see for example Mike Collin's note. It is a little more involved to show that under mild conditions this algorithm converges to a local optimum, and we point the reader to [Wu](http://web.stanford.edu/class/ee378b/papers/wu-em.pdf)'s in-depth analysis.\n", "\n", "Before we describe the algorithm in more detail we should point out that when choosing $\\params$ to optimise the bound $B_{q,\\params}(\\train)$ we can simplify the bound to:\n", "$$\n", "\\sum_{\\x \\in \\train} \\sum_\\z q(\\z|\\x) \\log(\\frac{\\prob_\\params(\\x,\\z)}{q(\\z|\\x)}) \\propto \n", "\\sum_{\\x \\in \\train} \\sum_\\z q(\\z|\\x) \\log(\\prob_\\params(\\x,\\z)) = \\sum_{\\x \\in \\train} E_{q(\\z|x)}\\left[\\prob_\\params(\\x,\\z)\\right].\n", "$$\n", "That is, we can drop terms that do not involve $\\params$, and then get an *expected* version of the standard log-likelihood. It is this expectation that gives the algorithm its name.\n", "\n", "Let us formalise the EM algorithm:\n", "\n", "* **Input**:\n", " * Initial parameter $\\params_1$\n", "* **Initialisation**\n", " * $i\\leftarrow 1$\n", "* **while** not converged:\n", " * Expectation-Step:\n", " * Calculate the lower bound. This means calcuating the **expected** log-likelihood, and generally involves calculating a representation of the conditional probabilities $\\prob_{\\params_i}(\\z|\\x)$ that is convenient for optimising the expected log-likelihood in the M-Step.\n", " * Maximisation-Step:\n", " * Maximise the lower bound. This means finding the parameters that **maximise** the current expected log-likelihood. Often there is a closed form solution to this problem, and it involves weighted counting. \n", " * $i\\leftarrow i + 1$\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us implement the EM algorithm for our example task and model. \n", "\n", "#### E-Step \n", "Calculating the conditional probabilities $q(z|x)=\\prob_{\\params_i}(z|\\x)$ that make up the expectation is easy: we can calculate $\\prob_{\\params_i}(\\x,z)$ for each class $z$, and then normalise these values to sum up to one:\n", "$$\n", " q(z|x) = \\frac{\\prob_{\\params_i}(\\x,z)}{\\sum_{z'} \\prob_{\\params_i}(\\x,z')} \n", "$$\n", "\n", "#### M-Step\n", "The M-Step requires us to maximise the expected log-likelihood, using the distributions calculated in the E-Step. The solution is similar to the closed form solution of the [MLE](mle.ipynb) objective (and can be derived accordingly by the reader). It only differs by using weighted counts instead of hard counts:\n", "\n", "$$\n", "\\theta_{x|z} = \\frac{ \\sum_{\\x \\in \\train} q(z|\\x) \\sum_{x_i \\in \\x }\\indi(x_i = x)}{\\sum_{\\x \\in \\train} q(z|\\x) |\\x|}\n", "$$\n", "\n", "The $\\theta_z$ parameters can be calculated accordingly.\n", "\n", "In Python we can implement this algorithm as follows. Notice that we are re-using the `calculate_class_distributions` function from above." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import defaultdict\n", "\n", "def e_step(data,theta):\n", " return calculate_class_distributions(data, theta)\n", "\n", "def m_step(x_data,q_data):\n", " counts = defaultdict(float)\n", " norm = defaultdict(float)\n", " class_counts = defaultdict(float)\n", " for x,q in zip(x_data, q_data):\n", " for z in z_domain:\n", " class_counts[z] += q[z]\n", " for x_i in x:\n", " norm[z] += q[z]\n", " counts[x_i, z] += q[z]\n", " theta_c = dict([(z,class_counts[z] / len(x_data)) for z in z_domain])\n", " theta_x = dict([((x,z),counts[x,z] / norm[z]) for z in z_domain for x in x_domain])\n", " return theta_x, theta_c\n", "\n", "def em_algorithm(data, init_theta, iterations = 10):\n", " current_theta = init_theta\n", " current_q = None\n", " result = []\n", " for _ in range(0, iterations):\n", " current_q = e_step(data, current_theta)\n", " current_theta = m_step(data, current_q)\n", " current_marg_ll = marginal_ll(data, current_theta)\n", " current_bound = marginal_ll_bound(data, current_theta, current_q)\n", " result.append((current_q, current_theta, current_marg_ll, current_bound))\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us run the EM algorithm on some example data and observe the result." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "([{'c1': 0.001366599986274166, 'c2': 0.9986334000137259},\n", " {'c1': 0.998991046664316, 'c2': 0.001008953335683917},\n", " {'c1': 0.0010386235810250118, 'c2': 0.998961376418975}],\n", " ({('fries', 'c1'): 0.1252610897436444,\n", " ('fries', 'c2'): 0.2856617554537877,\n", " ('nattoo', 'c1'): 0.7487542865225054,\n", " ('nattoo', 'c2'): 0.1432040502542796,\n", " ('pizza', 'c1'): 0.12598462373385083,\n", " ('pizza', 'c2'): 0.5711341942919325},\n", " {'c1': 0.33379875674387166, 'c2': 0.6662012432561283}),\n", " -7.056983528691977,\n", " -7.0569835287024105)" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = [[p,p,p,p,p,n],[n,n,n,n,n,n,f,p],[f,f,f,f,p,p,p,n]]\n", "iterations = em_algorithm(data, current_theta, 5)\n", "iterations[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the model learnt to assign the 'unhealthy' documents 1 and 3, consisting primarily of pizza and fries, to class `c1` and the healthy 'natto' document to cluster `c2`. \n", "\n", "To see how quickly the algorithm converges we can observe the behaviour of the lower bound and the marginal log-likelihood. " ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fig = plt.figure()\n", "plt.plot(range(0, len(iterations)), [iteration[2] for iteration in iterations], label='marg_ll')\n", "plt.plot(range(0, len(iterations)), [iteration[3] for iteration in iterations], label='bound')\n", "plt.legend()\n", "mpld3.display(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the bound indeed is a lower bound of the of the marginal log-likelihood, and both objectives are converging very quickly to a local optimum. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background Material\n", "* Michael Collin's [lecture notes on EM](http://www.cs.columbia.edu/~mcollins/em.pdf)\n", "* K. Nigal, A. McCallum, S. Thrun, T. Mitchell, [Text Classification from Labeled and Unlabeled Documents using EM](http://www.kamalnigam.com/papers/emcat-mlj99.pdf), Machine Learning, 2000\n", "* Jurafsky & Martin, [Hidden Markov Models](https://web.stanford.edu/~jurafsky/slp3/8.pdf)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }