## Example notebook for the %%stata cell magic by the IPyStata package. 

**Author:**   Ties de Kok <tdekok@uw.edu>  
**Homepage:**    https://github.com/TiesdeKok/ipystata  
**PyPi:** https://pypi.python.org/pypi/ipystata  

## Note: this example notebook uses the `Stata Batch Mode` method.

See Github for an example notebook using the Windows-only `Stata automation` method.

## Import packages

In [1]:
import pandas as pd

In [2]:
import ipystata

## Configure ipystata

In [1]:
from ipystata.config import config_stata
config_stata('/home/user/stata15/stata-se')
#config_stata("D:\Software\stata15\StataSE-64.exe", force_batch=True) 

**Note:** for this change to take effect you need to `Kernel` --> `Restart` the notebook.

## Check whether IPyStata is working

In [4]:
%%stata

display "Hello, I am printed by Stata."


Hello, I am printed by Stata.


# Some examples based on the Stata 13 manual

## Load the dataset "auto.dta" in Stata return it back to Python as a Pandas dataframe

The code cell below runs the Stata command **`sysuse auto.dta`** to load the dataset and returns it back to Python via the **`-o car_df`** argument.

In [5]:
%%stata -o car_df
sysuse auto.dta


(1978 Automobile Data)


**`car_df`** is a regular Pandas dataframe on which Python / Pandas actions can be performed. 

In [6]:
car_df.head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic


## Basic descriptive statistics

The argument **`-d or --data`** is used to define which dataframe should be set as dataset in Stata.  
In the example below the Stata function **`tabulate`** is used to generate some descriptive statistics for the dataframe **`car_df`**.

In [7]:
%%stata -d car_df
tabulate foreign headroom


           |                                        headroom
   foreign |       1.5          2        2.5          3        3.5          4        4.5          5 |     Total
-----------+----------------------------------------------------------------------------------------+----------
  Domestic |         3         10          4          7         13         10          4          1 |        52 
   Foreign |         1          3         10          6          2          0          0          0 |        22 
-----------+----------------------------------------------------------------------------------------+----------
     Total |         4         13         14         13         15         10          4          1 |        74 


These descriptive statistics can be replicated in Pandas using the **`crosstab`** fuction, see the code below.

In [8]:
pd.crosstab(car_df['foreign'], car_df['headroom'], margins=True)

headroom,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,All
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Domestic,3,10,4,7,13,10,4,1,52
Foreign,1,3,10,6,2,0,0,0,22
All,4,13,14,13,15,10,4,1,74


## Stata graphs

**Note:** due to a limitation of Stata it currently returns the graph as a PDF.  
This is a temporary workaround that I hope to find a more suitable fix for in the future.

In [9]:
%%stata -gr
use https://stats.idre.ucla.edu/stat/data/hsb2.dta, clear
graph twoway scatter read math


(highschool and beyond (200 cases))


## Use Python lists as Stata macros

In many situations it is convenient to define values or variable names in a Python list or equivalently in a Stata macro.  
The **`-i or --input`** argument makes a Python list available for use in Stata as a local macro.  
For example, **`-i main_var`** converts the Python list **`['mpg', 'rep78']`** into the following Stata macro: **``main_var'`**.

In [10]:
main_var = ['mpg', 'rep78']
control_var = ['gear_ratio', 'trunk', 'weight', 'displacement']

In [11]:
%%stata -d car_df -i main_var -i control_var

display "`main_var'"
display "`control_var'"

regress price `main_var' `control_var', vce(robust)


mpg rep78
gear_ratio trunk weight displacement

Linear regression                               Number of obs     =         69
                                                F(6, 62)          =       8.60
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4124
                                                Root MSE          =     2338.1

------------------------------------------------------------------------------
             |               Robust
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -76.95578   84.95038    -0.91   0.369    -246.7692     92.8576
       rep78 |   899.0818   299.7541     3.00   0.004      299.882    1498.282
  gear_ratio |   1479.744   917.5363     1.61   0.112    -354.3846    3313.873
       trunk |  -110.3163   80.16622    -1.38

## Modify dataset in Stata and return it to Python

It is possible create new variables or modify the existing dataset in Stata and have it returned as a Pandas dataframe.  
In the example below the output **`-o car_df`** will overwrite the data **`-d car_df`**, effectively modifying the dataframe in place.  
Note, the argument **`-np or --noprint`** can be used to supress any output below the code cell.

In [12]:
%%stata -d car_df -o car_df -np

generate weight_squared = weight^2
generate log_weight = log(weight)

In [13]:
car_df.head(3)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign,weight_squared,log_weight
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic,8584900.0,7.982758
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic,11222500.0,8.116715
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic,6969600.0,7.878534


## Set a custom working directory for this Stata code cell

### Using a directory defined in a variable (this is useful if you need it for many cells)

In [14]:
directory = '~/sandbox'

In [15]:
%%stata -cwd directory -np
display "`c(pwd)'"

### It is also possible to provide the directory as an argument

In [16]:
%%stata -cwd '~/sandbox' -np
display "`c(pwd)'"

## An example case

Create the variable **`large`** in Python and use it as the dependent variable for a binary choice estimation by Stata.

In [17]:
car_df['large'] = [1 if x > 3 and y > 200 else 0 for x, y in zip(car_df['headroom'], car_df['length'])]

In [18]:
car_df[['headroom', 'length', 'large']].head(7)

Unnamed: 0,headroom,length,large
0,2.5,186,0
1,3.0,173,0
2,3.0,168,0
3,4.5,196,0
4,4.0,222,1
5,4.0,218,1
6,3.0,170,0


In [19]:
%%stata -d car_df -i main_var -i control_var

logit large `main_var' `control_var', vce(cluster make)


Iteration 0:   log pseudolikelihood =  -39.60355  
Iteration 1:   log pseudolikelihood = -19.307161  
Iteration 2:   log pseudolikelihood = -13.526857  
Iteration 3:   log pseudolikelihood = -10.999644  
Iteration 4:   log pseudolikelihood = -10.726345  
Iteration 5:   log pseudolikelihood = -10.723111  
Iteration 6:   log pseudolikelihood = -10.723109  
Iteration 7:   log pseudolikelihood = -10.723109  

Logistic regression                             Number of obs     =         69
                                                Wald chi2(6)      =      12.90
                                                Prob > chi2       =     0.0446
Log pseudolikelihood = -10.723109               Pseudo R2         =     0.7292

                                  (Std. Err. adjusted for 69 clusters in make)
------------------------------------------------------------------------------
             |               Robust
       large |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
