{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Example notebook for the %%stata cell magic by the IPyStata package. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Author:** Ties de Kok \n", "**Homepage:** https://github.com/TiesdeKok/ipystata \n", "**PyPi:** https://pypi.python.org/pypi/ipystata " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Note: this example notebook uses the `Stata Batch Mode` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See Github for an example notebook using the Windows-only `Stata automation` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import ipystata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure ipystata" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "require(['notebook/js/codecell'], function(codecell) {\n", "\t\t\t codecell.CodeCell.options_default.highlight_modes['magic_stata'] = {'reg':[/^%%stata/]} ;\n", "\t\t\t Jupyter.notebook.events.one('kernel_ready.Kernel', function(){\n", "\t\t\t Jupyter.notebook.get_cells().map(function(cell){\n", "\t\t\t if (cell.cell_type == 'code'){ cell.auto_highlight(); } }) ;\n", "\t\t\t });\n", "\t\t\t});" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from ipystata.config import config_stata\n", "config_stata('/home/user/stata15/stata-se')\n", "#config_stata(\"D:\\Software\\stata15\\StataSE-64.exe\", force_batch=True) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** for this change to take effect you need to `Kernel` --> `Restart` the notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check whether IPyStata is working" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Hello, I am printed by Stata.\n" ] } ], "source": [ "%%stata\n", "\n", "display \"Hello, I am printed by Stata.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Some examples based on the Stata 13 manual" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the dataset \"auto.dta\" in Stata return it back to Python as a Pandas dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code cell below runs the Stata command **`sysuse auto.dta`** to load the dataset and returns it back to Python via the **`-o car_df`** argument." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "(1978 Automobile Data)\n" ] } ], "source": [ "%%stata -o car_df\n", "sysuse auto.dta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**`car_df`** is a regular Pandas dataframe on which Python / Pandas actions can be performed. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
makepricempgrep78headroomtrunkweightlengthturndisplacementgear_ratioforeign
0AMC Concord4099223.02.5112930186401213.58Domestic
1AMC Pacer4749173.03.0113350173402582.53Domestic
2AMC Spirit379922NaN3.0122640168351213.08Domestic
3Buick Century4816203.04.5163250196401962.93Domestic
4Buick Electra7827154.04.0204080222433502.41Domestic
\n", "
" ], "text/plain": [ " make price mpg rep78 headroom trunk weight length turn \\\n", "0 AMC Concord 4099 22 3.0 2.5 11 2930 186 40 \n", "1 AMC Pacer 4749 17 3.0 3.0 11 3350 173 40 \n", "2 AMC Spirit 3799 22 NaN 3.0 12 2640 168 35 \n", "3 Buick Century 4816 20 3.0 4.5 16 3250 196 40 \n", "4 Buick Electra 7827 15 4.0 4.0 20 4080 222 43 \n", "\n", " displacement gear_ratio foreign \n", "0 121 3.58 Domestic \n", "1 258 2.53 Domestic \n", "2 121 3.08 Domestic \n", "3 196 2.93 Domestic \n", "4 350 2.41 Domestic " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "car_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Basic descriptive statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The argument **`-d or --data`** is used to define which dataframe should be set as dataset in Stata. \n", "In the example below the Stata function **`tabulate`** is used to generate some descriptive statistics for the dataframe **`car_df`**." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " | headroom\n", " foreign | 1.5 2 2.5 3 3.5 4 4.5 5 | Total\n", "-----------+----------------------------------------------------------------------------------------+----------\n", " Domestic | 3 10 4 7 13 10 4 1 | 52 \n", " Foreign | 1 3 10 6 2 0 0 0 | 22 \n", "-----------+----------------------------------------------------------------------------------------+----------\n", " Total | 4 13 14 13 15 10 4 1 | 74 \n" ] } ], "source": [ "%%stata -d car_df\n", "tabulate foreign headroom" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These descriptive statistics can be replicated in Pandas using the **`crosstab`** fuction, see the code below." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
headroom1.52.02.53.03.54.04.55.0All
foreign
Domestic3104713104152
Foreign13106200022
All413141315104174
\n", "
" ], "text/plain": [ "headroom 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 All\n", "foreign \n", "Domestic 3 10 4 7 13 10 4 1 52\n", "Foreign 1 3 10 6 2 0 0 0 22\n", "All 4 13 14 13 15 10 4 1 74" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(car_df['foreign'], car_df['headroom'], margins=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stata graphs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:** due to a limitation of Stata it currently returns the graph as a PDF. \n", "This is a temporary workaround that I hope to find a more suitable fix for in the future." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "(highschool and beyond (200 cases))\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%stata -gr\n", "use https://stats.idre.ucla.edu/stat/data/hsb2.dta, clear\n", "graph twoway scatter read math" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Python lists as Stata macros" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many situations it is convenient to define values or variable names in a Python list or equivalently in a Stata macro. \n", "The **`-i or --input`** argument makes a Python list available for use in Stata as a local macro. \n", "For example, **`-i main_var`** converts the Python list **`['mpg', 'rep78']`** into the following Stata macro: **``main_var'`**." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "main_var = ['mpg', 'rep78']\n", "control_var = ['gear_ratio', 'trunk', 'weight', 'displacement']" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "mpg rep78\n", "gear_ratio trunk weight displacement\n", "\n", "Linear regression Number of obs = 69\n", " F(6, 62) = 8.60\n", " Prob > F = 0.0000\n", " R-squared = 0.4124\n", " Root MSE = 2338.1\n", "\n", "------------------------------------------------------------------------------\n", " | Robust\n", " price | Coef. Std. Err. t P>|t| [95% Conf. Interval]\n", "-------------+----------------------------------------------------------------\n", " mpg | -76.95578 84.95038 -0.91 0.369 -246.7692 92.8576\n", " rep78 | 899.0818 299.7541 3.00 0.004 299.882 1498.282\n", " gear_ratio | 1479.744 917.5363 1.61 0.112 -354.3846 3313.873\n", " trunk | -110.3163 80.16622 -1.38 0.174 -270.5663 49.93365\n", " weight | 1.139509 1.187361 0.96 0.341 -1.233991 3.51301\n", "displacement | 17.82274 8.523647 2.09 0.041 .7842094 34.86126\n", " _cons | -5163.323 4965.389 -1.04 0.302 -15088.99 4762.348\n", "------------------------------------------------------------------------------\n" ] } ], "source": [ "%%stata -d car_df -i main_var -i control_var\n", "\n", "display \"`main_var'\"\n", "display \"`control_var'\"\n", "\n", "regress price `main_var' `control_var', vce(robust)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modify dataset in Stata and return it to Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible create new variables or modify the existing dataset in Stata and have it returned as a Pandas dataframe. \n", "In the example below the output **`-o car_df`** will overwrite the data **`-d car_df`**, effectively modifying the dataframe in place. \n", "Note, the argument **`-np or --noprint`** can be used to supress any output below the code cell." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "%%stata -d car_df -o car_df -np\n", "\n", "generate weight_squared = weight^2\n", "generate log_weight = log(weight)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
makepricempgrep78headroomtrunkweightlengthturndisplacementgear_ratioforeignweight_squaredlog_weight
0AMC Concord4099223.02.5112930186401213.58Domestic8584900.07.982758
1AMC Pacer4749173.03.0113350173402582.53Domestic11222500.08.116715
2AMC Spirit379922NaN3.0122640168351213.08Domestic6969600.07.878534
\n", "
" ], "text/plain": [ " make price mpg rep78 headroom trunk weight length turn \\\n", "0 AMC Concord 4099 22 3.0 2.5 11 2930 186 40 \n", "1 AMC Pacer 4749 17 3.0 3.0 11 3350 173 40 \n", "2 AMC Spirit 3799 22 NaN 3.0 12 2640 168 35 \n", "\n", " displacement gear_ratio foreign weight_squared log_weight \n", "0 121 3.58 Domestic 8584900.0 7.982758 \n", "1 258 2.53 Domestic 11222500.0 8.116715 \n", "2 121 3.08 Domestic 6969600.0 7.878534 " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "car_df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set a custom working directory for this Stata code cell" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using a directory defined in a variable (this is useful if you need it for many cells)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "directory = '~/sandbox'" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "%%stata -cwd directory -np\n", "display \"`c(pwd)'\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### It is also possible to provide the directory as an argument" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "%%stata -cwd '~/sandbox' -np\n", "display \"`c(pwd)'\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An example case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create the variable **`large`** in Python and use it as the dependent variable for a binary choice estimation by Stata." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "car_df['large'] = [1 if x > 3 and y > 200 else 0 for x, y in zip(car_df['headroom'], car_df['length'])]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
headroomlengthlarge
02.51860
13.01730
23.01680
34.51960
44.02221
54.02181
63.01700
\n", "
" ], "text/plain": [ " headroom length large\n", "0 2.5 186 0\n", "1 3.0 173 0\n", "2 3.0 168 0\n", "3 4.5 196 0\n", "4 4.0 222 1\n", "5 4.0 218 1\n", "6 3.0 170 0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "car_df[['headroom', 'length', 'large']].head(7)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Iteration 0: log pseudolikelihood = -39.60355 \n", "Iteration 1: log pseudolikelihood = -19.307161 \n", "Iteration 2: log pseudolikelihood = -13.526857 \n", "Iteration 3: log pseudolikelihood = -10.999644 \n", "Iteration 4: log pseudolikelihood = -10.726345 \n", "Iteration 5: log pseudolikelihood = -10.723111 \n", "Iteration 6: log pseudolikelihood = -10.723109 \n", "Iteration 7: log pseudolikelihood = -10.723109 \n", "\n", "Logistic regression Number of obs = 69\n", " Wald chi2(6) = 12.90\n", " Prob > chi2 = 0.0446\n", "Log pseudolikelihood = -10.723109 Pseudo R2 = 0.7292\n", "\n", " (Std. Err. adjusted for 69 clusters in make)\n", "------------------------------------------------------------------------------\n", " | Robust\n", " large | Coef. Std. Err. z P>|z| [95% Conf. Interval]\n", "-------------+----------------------------------------------------------------\n", " mpg | -.5846335 .2941083 -1.99 0.047 -1.161075 -.0081918\n", " rep78 | -1.298127 1.264918 -1.03 0.305 -3.777322 1.181067\n", " gear_ratio | -1.331913 3.389448 -0.39 0.694 -7.975109 5.311283\n", " trunk | 1.210178 .4830082 2.51 0.012 .2634991 2.156856\n", " weight | -.0007284 .0022358 -0.33 0.745 -.0051105 .0036536\n", "displacement | .001631 .0160425 0.10 0.919 -.0298119 .0330738\n", " _cons | -.2977676 16.7841 -0.02 0.986 -33.19401 32.59847\n", "------------------------------------------------------------------------------\n", "Note: 8 failures and 0 successes completely determined.\n" ] } ], "source": [ "%%stata -d car_df -i main_var -i control_var\n", "\n", "logit large `main_var' `control_var', vce(cluster make)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 1 }