{ "cells": [ { "cell_type": "markdown", "source": [ "## Stata and R in a jupyter notebook" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "The jupyter notebook project is now designed to be a 'language agnostic' web-application front-end for any one of many possible software language kernels. We've been mostly using python but there are in fact several dozen [other language kernels](https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages) that can be made to work with it including Julia, R, Matlab, C, Go, Fortran and Stata. \n", "\n", "The ecosystem of libraries and packages for scientific computing with python is huge and constantly growing but there are still many statistics and econometrics applications that are available as built-in or user-written modules in Stata that have not yet been ported to python or are just simply easier to use in Stata. On the other hand there are some libraries such as python pandas and different visualization libraries such as seaborn or matplotlib that give features that are not available in Stata. \n", "\nFortunately you don't have to choose between using Stata or python, you can use them both together, to get the best of both worlds. " ], "metadata": {} }, { "cell_type": "markdown", "source": [ "## Jupyter running an R kernel \n", "\n", "R is a powerful open source software environment for statistical computing. R has [R markdown](https://rmarkdown.rstudio.com/) which allows you to create R-markdown [notebooks](https://bookdown.org/yihui/rmarkdown/notebook.html) similar in concept to jupyter notebooks. But you can also run R inside a jupyter notebook (indeed the name 'Jupyter' is from **Ju**lia, i**Pyt**hon and **R**).\n", "\n", "See my notebook with [notes on Research Discontinuity Design](RDD_R.ipynb) for an example of a jupyter notebook running R. To install an R kernel see the [IRkernel](https://irkernel.github.io/) project. \n", "\n", "## Jupyter with a Stata Kernel\n", "\n", "Kyle Barron has created a [stata_kernel](https://kylebarron.github.io/stata_kernel/) that offers several useful features including code-autocompletion, inline graphics, and generally fast responses. \n", "\n", "For this to work you must have a working licensed copy of Stata version 14 or greater on your machine.\n", "\n", "## Python and Stata combined in the same notebook\n", "\n", "Sometimes it may be useful to combine python and Stata in the same notebook. Ties de Kok has written a nice python library called [ipystata](https://github.com/TiesdeKok/ipystata) that allows one to execute Stata code in codeblocks inside an ipython notebook when preceded by a ```%%stata``` magic command. \n", "\n", "This workflow allows you to pass data between python and Stata sessions and to display Stata plots inline. Compared to the [stata_kernel](https://kylebarron.github.io/stata_kernel/) option the response times are not quite as fast. \n", "\nThe remainder of this notebook illustrates the use of ipystata." ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### A sample ipystata session\n", "\nFor more details see the [example notebook](http://nbviewer.jupyter.org/github/TiesdeKok/ipystata/blob/master/ipystata/Example.ipynb) and documentation on the ipystata repository." ], "metadata": {} }, { "cell_type": "code", "source": [ "%matplotlib inline\n", "import seaborn as sns\n", "import pandas as pd\n", "import statsmodels.formula.api as smf\n", "import ipystata" ], "outputs": [], "execution_count": 1, "metadata": {} }, { "cell_type": "markdown", "source": [ "The following opens a Stata session where we load a dataset and summarize the data. The ```-o``` flag following the `%%Stata``` magic instructs it to output or return the dataset in Stata memory as a pandas dataframe in python. " ], "metadata": {} }, { "cell_type": "code", "source": [ "%%stata -o life_df\n", "sysuse lifeexp.dta\n", "summarize" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "(Life expectancy, 1998)\n", "\n", " Variable | Obs Mean Std. Dev. Min Max\n", "-------------+---------------------------------------------------------\n", " region | 68 1.5 .7431277 1 3\n", " country | 0\n", " popgrowth | 68 .9720588 .9311918 -.5 3\n", " lexp | 68 72.27941 4.715315 54 79\n", " gnppc | 63 8674.857 10634.68 370 39980\n", "-------------+---------------------------------------------------------\n", " safewater | 40 76.1 17.89112 28 100\n", "\n" ] } ], "execution_count": 2, "metadata": {} }, { "cell_type": "markdown", "source": [ "Let's confirm the data was returned as a pandas dataframe:" ], "metadata": {} }, { "cell_type": "code", "source": [ "life_df.head(3)" ], "outputs": [ { "output_type": "execute_result", "execution_count": 3, "data": { "text/html": [ "
| \n", " | region | \n", "country | \n", "popgrowth | \n", "lexp | \n", "gnppc | \n", "safewater | \n", "
|---|---|---|---|---|---|---|
| 0 | \n", "Eur & C.Asia | \n", "Albania | \n", "1.2 | \n", "72 | \n", "810.0 | \n", "76.0 | \n", "
| 1 | \n", "Eur & C.Asia | \n", "Armenia | \n", "1.1 | \n", "74 | \n", "460.0 | \n", "NaN | \n", "
| 2 | \n", "Eur & C.Asia | \n", "Austria | \n", "0.4 | \n", "79 | \n", "26830.0 | \n", "NaN | \n", "