{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This markdown contains the example code for the blog post [How to read and write Stata files in R](https://www.marsja.se/how-to-read-and-write-stata-dta-files-in-r-with-haven/). The code is, of course, more thouroughly explained in that blog post. \n", "\n", "Note, the difference between this example, and the code in the blog post is as that \"RScript\" is removed frome the file.path() function. This is done, because when running the Jupyter notebook will get the path of the .ipynp file. That is, if the file is in \"RScripts\" already we only need tell R where the data files are (i.e., in \"Data\"). Remember to change the file path so it corresponds to where the .dta files are. If we run the code, as a script, in RStudio, for instance, we need the \"RScript\" to be the second argument. E.g., dtafile <- file.path(getwd(), \"RScript\", \"Data\", \"FifthDayData.dta\").\n", "\n", "## Install Haven if missing\n", "\n", "First, we need the packages. This first code chunk will install them if we don't have them installed. Note, that we could just install tidyverse and get all the above packages.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "list.of.packages <- c(\"haven\", \"readr\", \"readxl\", \"dplyr\")\n", "new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,\"Package\"])]\n", "if(length(new.packages)) install.packages(new.packages)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Path\n", "Note, if you have cloned the Git repository or downloaded the files from the git repository and have the example data in the same folder as in the repo you can do as follows to get a running script:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "data.path <- file.path(getwd(), \"..\", \"SimData\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Load Haven:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "library(haven)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load a .dta file:\n", "\n", "Remember to change to the path where the .dta file is." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 7
indexIDNameDayAgeResponseGender
<dbl><dbl><chr><chr><dbl><dbl><dbl>
01John Fifth230.45373300
12BillieFifth220.25735970
23RobertFifth200.44339320
34Don Fifth270.42359210
45JosephFifth210.57135540
56James Fifth250.55779220
\n" ], "text/latex": [ "A tibble: 6 × 7\n", "\\begin{tabular}{r|lllllll}\n", " index & ID & Name & Day & Age & Response & Gender\\\\\n", " & & & & & & \\\\\n", "\\hline\n", "\t 0 & 1 & John & Fifth & 23 & 0.4537330 & 0\\\\\n", "\t 1 & 2 & Billie & Fifth & 22 & 0.2573597 & 0\\\\\n", "\t 2 & 3 & Robert & Fifth & 20 & 0.4433932 & 0\\\\\n", "\t 3 & 4 & Don & Fifth & 27 & 0.4235921 & 0\\\\\n", "\t 4 & 5 & Joseph & Fifth & 21 & 0.5713554 & 0\\\\\n", "\t 5 & 6 & James & Fifth & 25 & 0.5577922 & 0\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 7\n", "\n", "| index <dbl> | ID <dbl> | Name <chr> | Day <chr> | Age <dbl> | Response <dbl> | Gender <dbl> |\n", "|---|---|---|---|---|---|---|\n", "| 0 | 1 | John | Fifth | 23 | 0.4537330 | 0 |\n", "| 1 | 2 | Billie | Fifth | 22 | 0.2573597 | 0 |\n", "| 2 | 3 | Robert | Fifth | 20 | 0.4433932 | 0 |\n", "| 3 | 4 | Don | Fifth | 27 | 0.4235921 | 0 |\n", "| 4 | 5 | Joseph | Fifth | 21 | 0.5713554 | 0 |\n", "| 5 | 6 | James | Fifth | 25 | 0.5577922 | 0 |\n", "\n" ], "text/plain": [ " index ID Name Day Age Response Gender\n", "1 0 1 John Fifth 23 0.4537330 0 \n", "2 1 2 Billie Fifth 22 0.2573597 0 \n", "3 2 3 Robert Fifth 20 0.4433932 0 \n", "4 3 4 Don Fifth 27 0.4235921 0 \n", "5 4 5 Joseph Fifth 21 0.5713554 0 \n", "6 5 6 James Fifth 25 0.5577922 0 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dtafile <- file.path(data.path, \"FifthDayData.dta\")\n", "\n", "## Load the .dta file\n", "\n", "fifthD.df <- read_dta(dtafile)\n", "\n", "\n", "## Get the first five rows\n", "head(fifthD.df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load a .dta file from URL:\n", "\n", "Now, here we will read the Stata file from a URL:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 12
yearqypchickpbeefpcorpfcpiqprodapopmeatextime
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
195014.3786369.531.259.8NA24.12628500151.684NA41
195115.1795372.936.572.1NA26.02843000154.287NA42
195215.3807173.136.271.3NA26.52851200156.954NA43
195315.2831971.328.562.7NA26.72953900159.565NA44
195415.8827664.427.463.4NA26.93099700162.391NA45
195514.7867567.027.156.1NA26.82958100165.275NA46
\n" ], "text/latex": [ "A tibble: 6 × 12\n", "\\begin{tabular}{r|llllllllllll}\n", " year & q & y & pchick & pbeef & pcor & pf & cpi & qproda & pop & meatex & time\\\\\n", " & & & & & & & & & & & \\\\\n", "\\hline\n", "\t 1950 & 14.3 & 7863 & 69.5 & 31.2 & 59.8 & NA & 24.1 & 2628500 & 151.684 & NA & 41\\\\\n", "\t 1951 & 15.1 & 7953 & 72.9 & 36.5 & 72.1 & NA & 26.0 & 2843000 & 154.287 & NA & 42\\\\\n", "\t 1952 & 15.3 & 8071 & 73.1 & 36.2 & 71.3 & NA & 26.5 & 2851200 & 156.954 & NA & 43\\\\\n", "\t 1953 & 15.2 & 8319 & 71.3 & 28.5 & 62.7 & NA & 26.7 & 2953900 & 159.565 & NA & 44\\\\\n", "\t 1954 & 15.8 & 8276 & 64.4 & 27.4 & 63.4 & NA & 26.9 & 3099700 & 162.391 & NA & 45\\\\\n", "\t 1955 & 14.7 & 8675 & 67.0 & 27.1 & 56.1 & NA & 26.8 & 2958100 & 165.275 & NA & 46\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 12\n", "\n", "| year <dbl> | q <dbl> | y <dbl> | pchick <dbl> | pbeef <dbl> | pcor <dbl> | pf <dbl> | cpi <dbl> | qproda <dbl> | pop <dbl> | meatex <dbl> | time <dbl> |\n", "|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 1950 | 14.3 | 7863 | 69.5 | 31.2 | 59.8 | NA | 24.1 | 2628500 | 151.684 | NA | 41 |\n", "| 1951 | 15.1 | 7953 | 72.9 | 36.5 | 72.1 | NA | 26.0 | 2843000 | 154.287 | NA | 42 |\n", "| 1952 | 15.3 | 8071 | 73.1 | 36.2 | 71.3 | NA | 26.5 | 2851200 | 156.954 | NA | 43 |\n", "| 1953 | 15.2 | 8319 | 71.3 | 28.5 | 62.7 | NA | 26.7 | 2953900 | 159.565 | NA | 44 |\n", "| 1954 | 15.8 | 8276 | 64.4 | 27.4 | 63.4 | NA | 26.9 | 3099700 | 162.391 | NA | 45 |\n", "| 1955 | 14.7 | 8675 | 67.0 | 27.1 | 56.1 | NA | 26.8 | 2958100 | 165.275 | NA | 46 |\n", "\n" ], "text/plain": [ " year q y pchick pbeef pcor pf cpi qproda pop meatex time\n", "1 1950 14.3 7863 69.5 31.2 59.8 NA 24.1 2628500 151.684 NA 41 \n", "2 1951 15.1 7953 72.9 36.5 72.1 NA 26.0 2843000 154.287 NA 42 \n", "3 1952 15.3 8071 73.1 36.2 71.3 NA 26.5 2851200 156.954 NA 43 \n", "4 1953 15.2 8319 71.3 28.5 62.7 NA 26.7 2953900 159.565 NA 44 \n", "5 1954 15.8 8276 64.4 27.4 63.4 NA 26.9 3099700 162.391 NA 45 \n", "6 1955 14.7 8675 67.0 27.1 56.1 NA 26.8 2958100 165.275 NA 46 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "url = \"http://www.principlesofeconometrics.com/stata/broiler.dta\"\n", "\n", "data.df <- read_dta(url)\n", "\n", "head(data.df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read a specific column\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 1
pbeef
<dbl>
31.2
36.5
36.2
28.5
27.4
27.1
\n" ], "text/latex": [ "A tibble: 6 × 1\n", "\\begin{tabular}{r|l}\n", " pbeef\\\\\n", " \\\\\n", "\\hline\n", "\t 31.2\\\\\n", "\t 36.5\\\\\n", "\t 36.2\\\\\n", "\t 28.5\\\\\n", "\t 27.4\\\\\n", "\t 27.1\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 1\n", "\n", "| pbeef <dbl> |\n", "|---|\n", "| 31.2 |\n", "| 36.5 |\n", "| 36.2 |\n", "| 28.5 |\n", "| 27.4 |\n", "| 27.1 |\n", "\n" ], "text/plain": [ " pbeef\n", "1 31.2 \n", "2 36.5 \n", "3 36.2 \n", "4 28.5 \n", "5 27.4 \n", "6 27.1 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "data.df <- read_dta(url, col_select=\"pbeef\")\n", "\n", "head(data.df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read multiple columns\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
A tibble: 6 × 4
yearqpbeefpop
<dbl><dbl><dbl><dbl>
195014.331.2151.684
195115.136.5154.287
195215.336.2156.954
195315.228.5159.565
195415.827.4162.391
195514.727.1165.275
\n" ], "text/latex": [ "A tibble: 6 × 4\n", "\\begin{tabular}{r|llll}\n", " year & q & pbeef & pop\\\\\n", " & & & \\\\\n", "\\hline\n", "\t 1950 & 14.3 & 31.2 & 151.684\\\\\n", "\t 1951 & 15.1 & 36.5 & 154.287\\\\\n", "\t 1952 & 15.3 & 36.2 & 156.954\\\\\n", "\t 1953 & 15.2 & 28.5 & 159.565\\\\\n", "\t 1954 & 15.8 & 27.4 & 162.391\\\\\n", "\t 1955 & 14.7 & 27.1 & 165.275\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "A tibble: 6 × 4\n", "\n", "| year <dbl> | q <dbl> | pbeef <dbl> | pop <dbl> |\n", "|---|---|---|---|\n", "| 1950 | 14.3 | 31.2 | 151.684 |\n", "| 1951 | 15.1 | 36.5 | 154.287 |\n", "| 1952 | 15.3 | 36.2 | 156.954 |\n", "| 1953 | 15.2 | 28.5 | 159.565 |\n", "| 1954 | 15.8 | 27.4 | 162.391 |\n", "| 1955 | 14.7 | 27.1 | 165.275 |\n", "\n" ], "text/plain": [ " year q pbeef pop \n", "1 1950 14.3 31.2 151.684\n", "2 1951 15.1 36.5 154.287\n", "3 1952 15.3 36.2 156.954\n", "4 1953 15.2 28.5 159.565\n", "5 1954 15.8 27.4 162.391\n", "6 1955 14.7 27.1 165.275" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cols <- c(\"year\", \"pbeef\", \"q\", \"pop\")\n", "\n", "data.df <- read_dta(url, col_select=all_of(cols))\n", "\n", "head(data.df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read multiple columns" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "library(haven);library(dplyr)\n", "\n", "## Dta file:\n", "dtafile <- file.path(data.path, \"FifthDayData.dta\")\n", "\n", "dta.df <- read_dta(dtafile)\n", "\n", "newdta.df <- select(dta.df, -c(index, Day))\n", "\n", "write_dta(newdta.df, file.path(data.path, \"NewFifthDayData.dta\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read a CSV File and Write a .dta file\n", "\n", "Here we use readr and read_csv to read a CSV file and then write it to a .dta file:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Parsed with column specification:\n", "cols(\n", " ID = \u001b[32mcol_double()\u001b[39m,\n", " Name = \u001b[31mcol_character()\u001b[39m,\n", " Day = \u001b[31mcol_character()\u001b[39m,\n", " Age = \u001b[32mcol_double()\u001b[39m,\n", " Response = \u001b[32mcol_double()\u001b[39m,\n", " Gender = \u001b[32mcol_double()\u001b[39m\n", ")\n" ] } ], "source": [ "library(readr)\n", "\n", "csvfile <- file.path(data.path, \"FirstDayData.csv\") \n", "\n", "data.df <- read_csv(csvfile)\n", "\n", "## Saving it as a dta\n", "\n", "write_dta(data.df, file.path(data.path, \"FirstDayData.dta\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read a Excel File and Write a .dta file\n", "\n", "Here we use readxl and read_excel to read a CSV file and then write it to a .dta file:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "library(readxl)\n", "\n", "csvfile <- file.path(data.path, \"example_concat.xlsx\") \n", "\n", "data.df <- read_excel(csvfile)\n", "\n", "## Saving it as a dta\n", "\n", "write_dta(data.df, file.path(data.path, \"play_data2.dta\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }