{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This markdown contains the example code for the blog post [How to read and write Stata files in R](https://www.marsja.se/how-to-read-and-write-stata-dta-files-in-r-with-haven/). The code is, of course, more thouroughly explained in that blog post. \n",
    "\n",
    "Note, the difference between this example, and the code in the blog post is as that \"RScript\" is removed frome the file.path() function. This is done, because when running the Jupyter notebook will get the path of the .ipynp file. That is, if the file is in \"RScripts\" already we only need tell R where the data files are (i.e., in \"Data\"). Remember to change the file path so it corresponds to where the .dta files are. If we run the code, as a script, in RStudio, for instance, we need the \"RScript\" to be the second argument. E.g., <code>dtafile &lt;- file.path(getwd(), \"RScript\", \"Data\", \"FifthDayData.dta\")</code>.\n",
    "\n",
    "## Install Haven if missing\n",
    "\n",
    "First, we need the packages. This first code chunk will install them if we don't have them installed. Note, that we could just install tidyverse and get all the above packages.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "list.of.packages <- c(\"haven\", \"readr\", \"readxl\", \"dplyr\")\n",
    "new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,\"Package\"])]\n",
    "if(length(new.packages)) install.packages(new.packages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Path\n",
    "Note, if you have cloned the Git repository or downloaded the files from the git repository and have the example data in the same folder as in the repo you can do as follows to get a running script:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.path <- file.path(getwd(), \"..\", \"SimData\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Load Haven:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(haven)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load a .dta file:\n",
    "\n",
    "Remember to change to the path where the .dta file is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<caption>A tibble: 6 × 7</caption>\n",
       "<thead>\n",
       "\t<tr><th scope=col>index</th><th scope=col>ID</th><th scope=col>Name</th><th scope=col>Day</th><th scope=col>Age</th><th scope=col>Response</th><th scope=col>Gender</th></tr>\n",
       "\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;chr&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><td>0</td><td>1</td><td>John  </td><td>Fifth</td><td>23</td><td>0.4537330</td><td>0</td></tr>\n",
       "\t<tr><td>1</td><td>2</td><td>Billie</td><td>Fifth</td><td>22</td><td>0.2573597</td><td>0</td></tr>\n",
       "\t<tr><td>2</td><td>3</td><td>Robert</td><td>Fifth</td><td>20</td><td>0.4433932</td><td>0</td></tr>\n",
       "\t<tr><td>3</td><td>4</td><td>Don   </td><td>Fifth</td><td>27</td><td>0.4235921</td><td>0</td></tr>\n",
       "\t<tr><td>4</td><td>5</td><td>Joseph</td><td>Fifth</td><td>21</td><td>0.5713554</td><td>0</td></tr>\n",
       "\t<tr><td>5</td><td>6</td><td>James </td><td>Fifth</td><td>25</td><td>0.5577922</td><td>0</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A tibble: 6 × 7\n",
       "\\begin{tabular}{r|lllllll}\n",
       " index & ID & Name & Day & Age & Response & Gender\\\\\n",
       " <dbl> & <dbl> & <chr> & <chr> & <dbl> & <dbl> & <dbl>\\\\\n",
       "\\hline\n",
       "\t 0 & 1 & John   & Fifth & 23 & 0.4537330 & 0\\\\\n",
       "\t 1 & 2 & Billie & Fifth & 22 & 0.2573597 & 0\\\\\n",
       "\t 2 & 3 & Robert & Fifth & 20 & 0.4433932 & 0\\\\\n",
       "\t 3 & 4 & Don    & Fifth & 27 & 0.4235921 & 0\\\\\n",
       "\t 4 & 5 & Joseph & Fifth & 21 & 0.5713554 & 0\\\\\n",
       "\t 5 & 6 & James  & Fifth & 25 & 0.5577922 & 0\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A tibble: 6 × 7\n",
       "\n",
       "| index &lt;dbl&gt; | ID &lt;dbl&gt; | Name &lt;chr&gt; | Day &lt;chr&gt; | Age &lt;dbl&gt; | Response &lt;dbl&gt; | Gender &lt;dbl&gt; |\n",
       "|---|---|---|---|---|---|---|\n",
       "| 0 | 1 | John   | Fifth | 23 | 0.4537330 | 0 |\n",
       "| 1 | 2 | Billie | Fifth | 22 | 0.2573597 | 0 |\n",
       "| 2 | 3 | Robert | Fifth | 20 | 0.4433932 | 0 |\n",
       "| 3 | 4 | Don    | Fifth | 27 | 0.4235921 | 0 |\n",
       "| 4 | 5 | Joseph | Fifth | 21 | 0.5713554 | 0 |\n",
       "| 5 | 6 | James  | Fifth | 25 | 0.5577922 | 0 |\n",
       "\n"
      ],
      "text/plain": [
       "  index ID Name   Day   Age Response  Gender\n",
       "1 0     1  John   Fifth 23  0.4537330 0     \n",
       "2 1     2  Billie Fifth 22  0.2573597 0     \n",
       "3 2     3  Robert Fifth 20  0.4433932 0     \n",
       "4 3     4  Don    Fifth 27  0.4235921 0     \n",
       "5 4     5  Joseph Fifth 21  0.5713554 0     \n",
       "6 5     6  James  Fifth 25  0.5577922 0     "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "dtafile <- file.path(data.path, \"FifthDayData.dta\")\n",
    "\n",
    "## Load the .dta file\n",
    "\n",
    "fifthD.df <- read_dta(dtafile)\n",
    "\n",
    "\n",
    "## Get the first five rows\n",
    "head(fifthD.df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load a .dta file from URL:\n",
    "\n",
    "Now, here we will read the Stata file from a URL:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<caption>A tibble: 6 × 12</caption>\n",
       "<thead>\n",
       "\t<tr><th scope=col>year</th><th scope=col>q</th><th scope=col>y</th><th scope=col>pchick</th><th scope=col>pbeef</th><th scope=col>pcor</th><th scope=col>pf</th><th scope=col>cpi</th><th scope=col>qproda</th><th scope=col>pop</th><th scope=col>meatex</th><th scope=col>time</th></tr>\n",
       "\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><td>1950</td><td>14.3</td><td>7863</td><td>69.5</td><td>31.2</td><td>59.8</td><td>NA</td><td>24.1</td><td>2628500</td><td>151.684</td><td>NA</td><td>41</td></tr>\n",
       "\t<tr><td>1951</td><td>15.1</td><td>7953</td><td>72.9</td><td>36.5</td><td>72.1</td><td>NA</td><td>26.0</td><td>2843000</td><td>154.287</td><td>NA</td><td>42</td></tr>\n",
       "\t<tr><td>1952</td><td>15.3</td><td>8071</td><td>73.1</td><td>36.2</td><td>71.3</td><td>NA</td><td>26.5</td><td>2851200</td><td>156.954</td><td>NA</td><td>43</td></tr>\n",
       "\t<tr><td>1953</td><td>15.2</td><td>8319</td><td>71.3</td><td>28.5</td><td>62.7</td><td>NA</td><td>26.7</td><td>2953900</td><td>159.565</td><td>NA</td><td>44</td></tr>\n",
       "\t<tr><td>1954</td><td>15.8</td><td>8276</td><td>64.4</td><td>27.4</td><td>63.4</td><td>NA</td><td>26.9</td><td>3099700</td><td>162.391</td><td>NA</td><td>45</td></tr>\n",
       "\t<tr><td>1955</td><td>14.7</td><td>8675</td><td>67.0</td><td>27.1</td><td>56.1</td><td>NA</td><td>26.8</td><td>2958100</td><td>165.275</td><td>NA</td><td>46</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A tibble: 6 × 12\n",
       "\\begin{tabular}{r|llllllllllll}\n",
       " year & q & y & pchick & pbeef & pcor & pf & cpi & qproda & pop & meatex & time\\\\\n",
       " <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl> & <dbl>\\\\\n",
       "\\hline\n",
       "\t 1950 & 14.3 & 7863 & 69.5 & 31.2 & 59.8 & NA & 24.1 & 2628500 & 151.684 & NA & 41\\\\\n",
       "\t 1951 & 15.1 & 7953 & 72.9 & 36.5 & 72.1 & NA & 26.0 & 2843000 & 154.287 & NA & 42\\\\\n",
       "\t 1952 & 15.3 & 8071 & 73.1 & 36.2 & 71.3 & NA & 26.5 & 2851200 & 156.954 & NA & 43\\\\\n",
       "\t 1953 & 15.2 & 8319 & 71.3 & 28.5 & 62.7 & NA & 26.7 & 2953900 & 159.565 & NA & 44\\\\\n",
       "\t 1954 & 15.8 & 8276 & 64.4 & 27.4 & 63.4 & NA & 26.9 & 3099700 & 162.391 & NA & 45\\\\\n",
       "\t 1955 & 14.7 & 8675 & 67.0 & 27.1 & 56.1 & NA & 26.8 & 2958100 & 165.275 & NA & 46\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A tibble: 6 × 12\n",
       "\n",
       "| year &lt;dbl&gt; | q &lt;dbl&gt; | y &lt;dbl&gt; | pchick &lt;dbl&gt; | pbeef &lt;dbl&gt; | pcor &lt;dbl&gt; | pf &lt;dbl&gt; | cpi &lt;dbl&gt; | qproda &lt;dbl&gt; | pop &lt;dbl&gt; | meatex &lt;dbl&gt; | time &lt;dbl&gt; |\n",
       "|---|---|---|---|---|---|---|---|---|---|---|---|\n",
       "| 1950 | 14.3 | 7863 | 69.5 | 31.2 | 59.8 | NA | 24.1 | 2628500 | 151.684 | NA | 41 |\n",
       "| 1951 | 15.1 | 7953 | 72.9 | 36.5 | 72.1 | NA | 26.0 | 2843000 | 154.287 | NA | 42 |\n",
       "| 1952 | 15.3 | 8071 | 73.1 | 36.2 | 71.3 | NA | 26.5 | 2851200 | 156.954 | NA | 43 |\n",
       "| 1953 | 15.2 | 8319 | 71.3 | 28.5 | 62.7 | NA | 26.7 | 2953900 | 159.565 | NA | 44 |\n",
       "| 1954 | 15.8 | 8276 | 64.4 | 27.4 | 63.4 | NA | 26.9 | 3099700 | 162.391 | NA | 45 |\n",
       "| 1955 | 14.7 | 8675 | 67.0 | 27.1 | 56.1 | NA | 26.8 | 2958100 | 165.275 | NA | 46 |\n",
       "\n"
      ],
      "text/plain": [
       "  year q    y    pchick pbeef pcor pf cpi  qproda  pop     meatex time\n",
       "1 1950 14.3 7863 69.5   31.2  59.8 NA 24.1 2628500 151.684 NA     41  \n",
       "2 1951 15.1 7953 72.9   36.5  72.1 NA 26.0 2843000 154.287 NA     42  \n",
       "3 1952 15.3 8071 73.1   36.2  71.3 NA 26.5 2851200 156.954 NA     43  \n",
       "4 1953 15.2 8319 71.3   28.5  62.7 NA 26.7 2953900 159.565 NA     44  \n",
       "5 1954 15.8 8276 64.4   27.4  63.4 NA 26.9 3099700 162.391 NA     45  \n",
       "6 1955 14.7 8675 67.0   27.1  56.1 NA 26.8 2958100 165.275 NA     46  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "url = \"http://www.principlesofeconometrics.com/stata/broiler.dta\"\n",
    "\n",
    "data.df <- read_dta(url)\n",
    "\n",
    "head(data.df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read a specific column\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<caption>A tibble: 6 × 1</caption>\n",
       "<thead>\n",
       "\t<tr><th scope=col>pbeef</th></tr>\n",
       "\t<tr><th scope=col>&lt;dbl&gt;</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><td>31.2</td></tr>\n",
       "\t<tr><td>36.5</td></tr>\n",
       "\t<tr><td>36.2</td></tr>\n",
       "\t<tr><td>28.5</td></tr>\n",
       "\t<tr><td>27.4</td></tr>\n",
       "\t<tr><td>27.1</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A tibble: 6 × 1\n",
       "\\begin{tabular}{r|l}\n",
       " pbeef\\\\\n",
       " <dbl>\\\\\n",
       "\\hline\n",
       "\t 31.2\\\\\n",
       "\t 36.5\\\\\n",
       "\t 36.2\\\\\n",
       "\t 28.5\\\\\n",
       "\t 27.4\\\\\n",
       "\t 27.1\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A tibble: 6 × 1\n",
       "\n",
       "| pbeef &lt;dbl&gt; |\n",
       "|---|\n",
       "| 31.2 |\n",
       "| 36.5 |\n",
       "| 36.2 |\n",
       "| 28.5 |\n",
       "| 27.4 |\n",
       "| 27.1 |\n",
       "\n"
      ],
      "text/plain": [
       "  pbeef\n",
       "1 31.2 \n",
       "2 36.5 \n",
       "3 36.2 \n",
       "4 28.5 \n",
       "5 27.4 \n",
       "6 27.1 "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "data.df <- read_dta(url, col_select=\"pbeef\")\n",
    "\n",
    "head(data.df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read multiple columns\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table>\n",
       "<caption>A tibble: 6 × 4</caption>\n",
       "<thead>\n",
       "\t<tr><th scope=col>year</th><th scope=col>q</th><th scope=col>pbeef</th><th scope=col>pop</th></tr>\n",
       "\t<tr><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th><th scope=col>&lt;dbl&gt;</th></tr>\n",
       "</thead>\n",
       "<tbody>\n",
       "\t<tr><td>1950</td><td>14.3</td><td>31.2</td><td>151.684</td></tr>\n",
       "\t<tr><td>1951</td><td>15.1</td><td>36.5</td><td>154.287</td></tr>\n",
       "\t<tr><td>1952</td><td>15.3</td><td>36.2</td><td>156.954</td></tr>\n",
       "\t<tr><td>1953</td><td>15.2</td><td>28.5</td><td>159.565</td></tr>\n",
       "\t<tr><td>1954</td><td>15.8</td><td>27.4</td><td>162.391</td></tr>\n",
       "\t<tr><td>1955</td><td>14.7</td><td>27.1</td><td>165.275</td></tr>\n",
       "</tbody>\n",
       "</table>\n"
      ],
      "text/latex": [
       "A tibble: 6 × 4\n",
       "\\begin{tabular}{r|llll}\n",
       " year & q & pbeef & pop\\\\\n",
       " <dbl> & <dbl> & <dbl> & <dbl>\\\\\n",
       "\\hline\n",
       "\t 1950 & 14.3 & 31.2 & 151.684\\\\\n",
       "\t 1951 & 15.1 & 36.5 & 154.287\\\\\n",
       "\t 1952 & 15.3 & 36.2 & 156.954\\\\\n",
       "\t 1953 & 15.2 & 28.5 & 159.565\\\\\n",
       "\t 1954 & 15.8 & 27.4 & 162.391\\\\\n",
       "\t 1955 & 14.7 & 27.1 & 165.275\\\\\n",
       "\\end{tabular}\n"
      ],
      "text/markdown": [
       "\n",
       "A tibble: 6 × 4\n",
       "\n",
       "| year &lt;dbl&gt; | q &lt;dbl&gt; | pbeef &lt;dbl&gt; | pop &lt;dbl&gt; |\n",
       "|---|---|---|---|\n",
       "| 1950 | 14.3 | 31.2 | 151.684 |\n",
       "| 1951 | 15.1 | 36.5 | 154.287 |\n",
       "| 1952 | 15.3 | 36.2 | 156.954 |\n",
       "| 1953 | 15.2 | 28.5 | 159.565 |\n",
       "| 1954 | 15.8 | 27.4 | 162.391 |\n",
       "| 1955 | 14.7 | 27.1 | 165.275 |\n",
       "\n"
      ],
      "text/plain": [
       "  year q    pbeef pop    \n",
       "1 1950 14.3 31.2  151.684\n",
       "2 1951 15.1 36.5  154.287\n",
       "3 1952 15.3 36.2  156.954\n",
       "4 1953 15.2 28.5  159.565\n",
       "5 1954 15.8 27.4  162.391\n",
       "6 1955 14.7 27.1  165.275"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "cols <- c(\"year\", \"pbeef\", \"q\", \"pop\")\n",
    "\n",
    "data.df <- read_dta(url, col_select=all_of(cols))\n",
    "\n",
    "head(data.df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read multiple columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(haven);library(dplyr)\n",
    "\n",
    "## Dta file:\n",
    "dtafile  <-  file.path(data.path, \"FifthDayData.dta\")\n",
    "\n",
    "dta.df <- read_dta(dtafile)\n",
    "\n",
    "newdta.df <- select(dta.df, -c(index, Day))\n",
    "\n",
    "write_dta(newdta.df, file.path(data.path, \"NewFifthDayData.dta\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read a CSV File and Write a .dta file\n",
    "\n",
    "Here we use readr and read_csv to read a CSV file and then write it to a .dta file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Parsed with column specification:\n",
      "cols(\n",
      "  ID = \u001b[32mcol_double()\u001b[39m,\n",
      "  Name = \u001b[31mcol_character()\u001b[39m,\n",
      "  Day = \u001b[31mcol_character()\u001b[39m,\n",
      "  Age = \u001b[32mcol_double()\u001b[39m,\n",
      "  Response = \u001b[32mcol_double()\u001b[39m,\n",
      "  Gender = \u001b[32mcol_double()\u001b[39m\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "library(readr)\n",
    "\n",
    "csvfile <- file.path(data.path, \"FirstDayData.csv\") \n",
    "\n",
    "data.df <- read_csv(csvfile)\n",
    "\n",
    "## Saving it as a dta\n",
    "\n",
    "write_dta(data.df, file.path(data.path, \"FirstDayData.dta\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read a Excel File and Write a .dta file\n",
    "\n",
    "Here we use readxl and read_excel to read a CSV file and then write it to a .dta file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(readxl)\n",
    "\n",
    "csvfile <- file.path(data.path, \"example_concat.xlsx\") \n",
    "\n",
    "data.df <- read_excel(csvfile)\n",
    "\n",
    "## Saving it as a dta\n",
    "\n",
    "write_dta(data.df, file.path(data.path,  \"play_data2.dta\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}