{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to DataFrames in Julia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By Tyler Ransom, Duke University Social Science Research Institute\n", "\n", "tyler.ransom@duke.edu\n", "\n", "https://github.com/tyleransom" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Julia's `DataFrames` package is largely mirrored after the `data.frame()` package in R. The key underlying principle is that data frames allow for storage of mixed data types (e.g. strings and numbers) in the same object. Data frames also allow for a missing data type, which is `NA` in Julia. Other statistical software packages such as SAS, Stata, SPSS, and Matlab offer similar features in their data storage.\n", "\n", "This tutorial serves to familiarize Julia users with the primary syntax and capaibilities of Julia's `DataFrames` package. There will be emphasis on making connections with Stata's syntax and features, but anyone with experience in statistical programming will be able to make connections to their preferred language.\n", "\n", "First, let's call the packages we'll need for this demonstration. We'll be using Julia version 0.4.1 with DataFrames version 0.6.10 and FreqTables version 0.0.1." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "using DataFrames\n", "using FreqTables" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1. Reading in data, summarizing data structure, and browsing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reading in a delimited text file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's read in some sample data --- the `auto` dataset from Stata (in CSV form: ). In Julia, the `readtable()` function converts delimited text files into data frames. \n", "\n", "There are a number of options for configuring the read-in operation, but for now we'll use a simple comma-separated file with standard configurations.\n", "\n", "Notice that whatever variable name you choose on the left-hand side of the equals sign will be the name of your data frame moving forward." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto = readtable(\"auto.csv\");" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Summary of data structure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's look at the variables that are in our data frame. The `showcols` function accomplishes this task. This is very similar to Stata's `describe` command." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x12 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|--------------|------------|---------|\n", "| 1 | make | UTF8String | 0 |\n", "| 2 | price | Int64 | 0 |\n", "| 3 | mpg | Int64 | 0 |\n", "| 4 | rep78 | Int64 | 5 |\n", "| 5 | headroom | Float64 | 0 |\n", "| 6 | trunk | Int64 | 0 |\n", "| 7 | weight | Int64 | 0 |\n", "| 8 | length | Int64 | 0 |\n", "| 9 | turn | Int64 | 0 |\n", "| 10 | displacement | Int64 | 0 |\n", "| 11 | gear_ratio | Float64 | 0 |\n", "| 12 | foreign | Int64 | 0 |" ] } ], "source": [ "showcols(auto)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The output of `showcols()` shows us that we have 74 observations, 12 variables, the name and format of each of our variables, and the number of missing observations for each.\n", "\n", "We can also get the length and width of our data frame using the `size()` function:" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "74" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_obs = size(auto,1)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "12" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_vars = size(auto,2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Browsing your data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's look at some of our variables. We do this by either referencing the name with a `\":\"` in front, or with the column number:" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "74-element DataArrays.DataArray{Int64,1}:\n", " 4099\n", " 4749\n", " 3799\n", " 4816\n", " 7827\n", " 5788\n", " 4453\n", " 5189\n", " 10372\n", " 4082\n", " 11385\n", " 14500\n", " 15906\n", " ⋮\n", " 3995\n", " 12990\n", " 3895\n", " 3798\n", " 5899\n", " 3748\n", " 5719\n", " 7140\n", " 5397\n", " 4697\n", " 6850\n", " 11995" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto[:price]" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "74-element DataArrays.DataArray{Int64,1}:\n", " 4099\n", " 4749\n", " 3799\n", " 4816\n", " 7827\n", " 5788\n", " 4453\n", " 5189\n", " 10372\n", " 4082\n", " 11385\n", " 14500\n", " 15906\n", " ⋮\n", " 3995\n", " 12990\n", " 3895\n", " 3798\n", " 5899\n", " 3748\n", " 5719\n", " 7140\n", " 5397\n", " 4697\n", " 6850\n", " 11995" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto[2]" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
pricempg
1409922
2474917
3379922
4481620
5782715
6578818
7445326
8518920
91037216
10408219
111138514
121450014
131590621
14329929
15570516
16450422
17510422
18366724
19395519
20398430
21401018
22588616
23634217
24438928
25418721
261149712
271359412
281346614
29382922
30537914
" ], "text/plain": [ "74x2 DataFrames.DataFrame\n", "| Row | price | mpg |\n", "|-----|-------|-----|\n", "| 1 | 4099 | 22 |\n", "| 2 | 4749 | 17 |\n", "| 3 | 3799 | 22 |\n", "| 4 | 4816 | 20 |\n", "| 5 | 7827 | 15 |\n", "| 6 | 5788 | 18 |\n", "| 7 | 4453 | 26 |\n", "| 8 | 5189 | 20 |\n", "| 9 | 10372 | 16 |\n", "| 10 | 4082 | 19 |\n", "| 11 | 11385 | 14 |\n", "⋮\n", "| 63 | 3995 | 30 |\n", "| 64 | 12990 | 14 |\n", "| 65 | 3895 | 26 |\n", "| 66 | 3798 | 35 |\n", "| 67 | 5899 | 18 |\n", "| 68 | 3748 | 31 |\n", "| 69 | 5719 | 18 |\n", "| 70 | 7140 | 23 |\n", "| 71 | 5397 | 41 |\n", "| 72 | 4697 | 25 |\n", "| 73 | 6850 | 25 |\n", "| 74 | 11995 | 17 |" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto[:,[:price,:mpg]]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also use the `head()` and `tail()` functions to view the first `k` and last `k` observations for all variables in our data frame:" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
makepricempgrep78headroomtrunkweightlengthturndisplacementgear_ratioforeign
1AMC Concord40992232.5112930186401213.580
2AMC Pacer47491733.0113350173402582.530
3AMC Spirit379922NA3.0122640168351213.080
4Buick Century48162034.5163250196401962.930
" ], "text/plain": [ "4x12 DataFrames.DataFrame\n", "| Row | make | price | mpg | rep78 | headroom | trunk | weight |\n", "|-----|-----------------|-------|-----|-------|----------|-------|--------|\n", "| 1 | \"AMC Concord\" | 4099 | 22 | 3 | 2.5 | 11 | 2930 |\n", "| 2 | \"AMC Pacer\" | 4749 | 17 | 3 | 3.0 | 11 | 3350 |\n", "| 3 | \"AMC Spirit\" | 3799 | 22 | NA | 3.0 | 12 | 2640 |\n", "| 4 | \"Buick Century\" | 4816 | 20 | 3 | 4.5 | 16 | 3250 |\n", "\n", "| Row | length | turn | displacement | gear_ratio | foreign |\n", "|-----|--------|------|--------------|------------|---------|\n", "| 1 | 186 | 40 | 121 | 3.58 | 0 |\n", "| 2 | 173 | 40 | 258 | 2.53 | 0 |\n", "| 3 | 168 | 35 | 121 | 3.08 | 0 |\n", "| 4 | 196 | 40 | 196 | 2.93 | 0 |" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "head(auto,4)" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
makepricempgrep78headroomtrunkweightlengthturndisplacementgear_ratioforeign
1VW Diesel53974153.015204015535903.781
2VW Rabbit46972543.015193015535893.781
3VW Scirocco68502542.016199015636973.781
4Volvo 260119951752.5143170193371632.981
" ], "text/plain": [ "4x12 DataFrames.DataFrame\n", "| Row | make | price | mpg | rep78 | headroom | trunk | weight |\n", "|-----|---------------|-------|-----|-------|----------|-------|--------|\n", "| 1 | \"VW Diesel\" | 5397 | 41 | 5 | 3.0 | 15 | 2040 |\n", "| 2 | \"VW Rabbit\" | 4697 | 25 | 4 | 3.0 | 15 | 1930 |\n", "| 3 | \"VW Scirocco\" | 6850 | 25 | 4 | 2.0 | 16 | 1990 |\n", "| 4 | \"Volvo 260\" | 11995 | 17 | 5 | 2.5 | 14 | 3170 |\n", "\n", "| Row | length | turn | displacement | gear_ratio | foreign |\n", "|-----|--------|------|--------------|------------|---------|\n", "| 1 | 155 | 35 | 90 | 3.78 | 1 |\n", "| 2 | 155 | 35 | 89 | 3.78 | 1 |\n", "| 3 | 156 | 36 | 97 | 3.78 | 1 |\n", "| 4 | 193 | 37 | 163 | 2.98 | 1 |" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tail(auto,4)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also list observations of certain variables indexed by their row number:" ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
headroomtrunk
12.511
23.011
34.516
44.020
" ], "text/plain": [ "4x2 DataFrames.DataFrame\n", "| Row | headroom | trunk |\n", "|-----|----------|-------|\n", "| 1 | 2.5 | 11 |\n", "| 2 | 3.0 | 11 |\n", "| 3 | 4.5 | 16 |\n", "| 4 | 4.0 | 20 |" ] }, "execution_count": 119, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto[[1;2;4;15],[:headroom,:trunk]]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also list observations that meet some condition. For example, suppose we want to look at the `headroom` and `trunk` space for all cars that achieve less than 20 miles per gallon:" ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
headroomtrunk
13.011
24.020
34.021
43.517
53.513
64.020
73.516
84.020
93.513
104.017
114.017
124.521
133.522
142.518
153.515
163.516
173.523
183.015
193.016
202.016
214.516
224.020
234.514
243.517
255.016
264.020
271.57
282.016
293.517
303.513
" ], "text/plain": [ "35x2 DataFrames.DataFrame\n", "| Row | headroom | trunk |\n", "|-----|----------|-------|\n", "| 1 | 3.0 | 11 |\n", "| 2 | 4.0 | 20 |\n", "| 3 | 4.0 | 21 |\n", "| 4 | 3.5 | 17 |\n", "| 5 | 3.5 | 13 |\n", "| 6 | 4.0 | 20 |\n", "| 7 | 3.5 | 16 |\n", "| 8 | 4.0 | 20 |\n", "| 9 | 3.5 | 13 |\n", "| 10 | 4.0 | 17 |\n", "| 11 | 4.0 | 17 |\n", "⋮\n", "| 24 | 3.5 | 17 |\n", "| 25 | 5.0 | 16 |\n", "| 26 | 4.0 | 20 |\n", "| 27 | 1.5 | 7 |\n", "| 28 | 2.0 | 16 |\n", "| 29 | 3.5 | 17 |\n", "| 30 | 3.5 | 13 |\n", "| 31 | 3.0 | 15 |\n", "| 32 | 3.5 | 14 |\n", "| 33 | 2.5 | 14 |\n", "| 34 | 2.0 | 11 |\n", "| 35 | 2.5 | 14 |" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto[(auto[:,:mpg].<20),[:headroom,:trunk]]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 2. Summary statistics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can look at summary statistics in a few different ways. First, notice that the `showcols()` function reported the number of `NA` or missing values for each variable.\n", "\n", "The `describe()` function also displays missing value frequencies and percentages, in addition to reporting the min/max, mean, median, number of unique observations, and quartiles for each variable in the data frame:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "make\n", "Length 74\n", "Type UTF8String\n", "NAs 0\n", "NA% 0.0%\n", "Unique 74\n", "\n", "price\n", "Min 3291.0\n", "1st Qu. 4220.25\n", "Median 5006.5\n", "Mean 6165.256756756757\n", "3rd Qu. 6332.25\n", "Max 15906.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "mpg\n", "Min 12.0\n", "1st Qu. 18.0\n", "Median 20.0\n", "Mean 21.2972972972973\n", "3rd Qu. 24.75\n", "Max 41.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "rep78\n", "Min 1.0\n", "1st Qu. 3.0\n", "Median 3.0\n", "Mean 3.4057971014492754\n", "3rd Qu. 4.0\n", "Max 5.0\n", "NAs 5\n", "NA% 6.76%\n", "\n", "headroom\n", "Min 1.5\n", "1st Qu. 2.5\n", "Median 3.0\n", "Mean 2.9932432432432434\n", "3rd Qu. 3.5\n", "Max 5.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "trunk\n", "Min 5.0\n", "1st Qu. 10.25\n", "Median 14.0\n", "Mean 13.756756756756756\n", "3rd Qu. 16.75\n", "Max 23.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "weight\n", "Min 1760.0\n", "1st Qu. 2250.0\n", "Median 3190.0\n", "Mean 3019.4594594594596\n", "3rd Qu. 3600.0\n", "Max 4840.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "length\n", "Min 142.0\n", "1st Qu. 170.0\n", "Median 192.5\n", "Mean 187.93243243243242\n", "3rd Qu. 203.75\n", "Max 233.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "turn\n", "Min 31.0\n", "1st Qu. 36.0\n", "Median 40.0\n", "Mean 39.648648648648646\n", "3rd Qu. 43.0\n", "Max 51.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "displacement\n", "Min 79.0\n", "1st Qu. 119.0\n", "Median 196.0\n", "Mean 197.2972972972973\n", "3rd Qu. 245.25\n", "Max 425.0\n", "NAs 0\n", "NA% 0.0%\n", "\n", "gear_ratio\n", "Min 2.19\n", "1st Qu. 2.73\n", "Median 2.955\n", "Mean 3.0148648648648644\n", "3rd Qu. 3.3525\n", "Max 3.89\n", "NAs 0\n", "NA% 0.0%\n", "\n", "foreign\n", "Min 0.0\n", "1st Qu. 0.0\n", "Median 0.0\n", "Mean 0.2972972972972973\n", "3rd Qu. 1.0\n", "Max 1.0\n", "NAs 0\n", "NA% 0.0%\n", "\n" ] } ], "source": [ "describe(auto)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also display the mean for each variable using the `colwise()` function. However, this function will return an error if we include string variables." ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "collapsed": false }, "outputs": [ { "ename": "LoadError", "evalue": "LoadError: MethodError: `+` has no method matching +(::UTF8String, ::UTF8String)\nClosest candidates are:\n +(::Any, ::Any, !Matched::Any, !Matched::Any...)\nwhile loading In[122], in expression starting on line 1", "output_type": "error", "traceback": [ "LoadError: MethodError: `+` has no method matching +(::UTF8String, ::UTF8String)\nClosest candidates are:\n +(::Any, ::Any, !Matched::Any, !Matched::Any...)\nwhile loading In[122], in expression starting on line 1", "", " in mapreduce_seq_impl at reduce.jl:228", " in mapreduce_pairwise_impl at reduce.jl:108", " in _mapreduce at reduce.jl:153", " in mapreduce at C:\\Users\\tmr17\\.julia\\v0.4\\DataArrays\\src\\reduce.jl:110", " in mean at C:\\Users\\tmr17\\.julia\\v0.4\\DataArrays\\src\\reduce.jl:137", " in colwise at C:\\Users\\tmr17\\.julia\\v0.4\\DataFrames\\src\\groupeddataframe\\grouping.jl:248" ] } ], "source": [ "colwise(mean,auto)" ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "11-element Array{Any,1}:\n", " [6165.257]\n", " [21.297] \n", " [NA] \n", " [2.993] \n", " [13.757] \n", " [3019.459]\n", " [187.932] \n", " [39.649] \n", " [197.297] \n", " [3.015] \n", " [0.297] " ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(colwise(mean,auto[:,2:end]),3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that this returns `NA` for variables with at least one missing observation." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Tabulations and cross-tabulations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute frequencies of categorical variables, using a couple of different functions:\n", "\n", "`countmap()` returns cell counts as a dictionary:" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Dict{Union{DataArrays.NAtype,Int64},Int64} with 2 entries:\n", " 0 => 52\n", " 1 => 22" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countmap(auto[:foreign])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also use the `by` structure, coupled with the `nrow` function:" ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
foreignx1
1052
2122
" ], "text/plain": [ "2x2 DataFrames.DataFrame\n", "| Row | foreign | x1 |\n", "|-----|---------|----|\n", "| 1 | 0 | 52 |\n", "| 2 | 1 | 22 |" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "by(auto,:foreign,nrow)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For cross-tabulations, we require the `FreqTables` package, which was loaded earlier." ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "5x2 NamedArrays.NamedArray{Int64,2,Array{Int64,2},Tuple{Dict{Int64,Int64},Dict{Int64,Int64}}}\n", "rep78 \\ foreign 0 1 \n", "1 2 0 \n", "2 8 0 \n", "3 27 3 \n", "4 9 9 \n", "5 2 9 \n" ] }, "execution_count": 126, "metadata": {}, "output_type": "execute_result" } ], "source": [ "freqtable(auto, :rep78, :foreign, subset=!isna(auto[:rep78]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that, at the moment, the Julia's `DataFrames` is substantially lagging other languages in terms of computing cross-tabulations and contingency tables." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 3. Dropping, keeping, renaming, and generating" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Dropping observations and variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to delete observations in our dataset according to some rule. This amounts to *keeping* the complement of the rule. For example, if we want to drop all observations of the data frame where a variable is missing, we index the rows we want to keep with `!isna()` and select all columns:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto1 = auto[!isna(auto[:,:rep78]), :];" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "69x12 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|--------------|------------|---------|\n", "| 1 | make | UTF8String | 0 |\n", "| 2 | price | Int64 | 0 |\n", "| 3 | mpg | Int64 | 0 |\n", "| 4 | rep78 | Int64 | 0 |\n", "| 5 | headroom | Float64 | 0 |\n", "| 6 | trunk | Int64 | 0 |\n", "| 7 | weight | Int64 | 0 |\n", "| 8 | length | Int64 | 0 |\n", "| 9 | turn | Int64 | 0 |\n", "| 10 | displacement | Int64 | 0 |\n", "| 11 | gear_ratio | Float64 | 0 |\n", "| 12 | foreign | Int64 | 0 |" ] } ], "source": [ "showcols(auto1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that we now have 5 fewer observations in the new data frame, and that there are no missing values." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can drop variables in two different ways:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, by using the complement of a `keep` statement:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto1 = auto1[setdiff(names(auto1), [:price,:mpg])];" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "69x10 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|--------------|------------|---------|\n", "| 1 | make | UTF8String | 0 |\n", "| 2 | rep78 | Int64 | 0 |\n", "| 3 | headroom | Float64 | 0 |\n", "| 4 | trunk | Int64 | 0 |\n", "| 5 | weight | Int64 | 0 |\n", "| 6 | length | Int64 | 0 |\n", "| 7 | turn | Int64 | 0 |\n", "| 8 | displacement | Int64 | 0 |\n", "| 9 | gear_ratio | Float64 | 0 |\n", "| 10 | foreign | Int64 | 0 |" ] } ], "source": [ "showcols(autod1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Second, we can drop in-place using the `delete!()` function, which overwrites the data frame." ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "collapsed": false }, "outputs": [], "source": [ "delete!(auto1,[:weight,:length]);" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "69x8 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|--------------|------------|---------|\n", "| 1 | make | UTF8String | 0 |\n", "| 2 | rep78 | Int64 | 0 |\n", "| 3 | headroom | Float64 | 0 |\n", "| 4 | trunk | Int64 | 0 |\n", "| 5 | turn | Int64 | 0 |\n", "| 6 | displacement | Int64 | 0 |\n", "| 7 | gear_ratio | Float64 | 0 |\n", "| 8 | foreign | Int64 | 0 |" ] } ], "source": [ "showcols(auto1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Keeping variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can keep variables simply by indexing the variable names or column numbers of interest:" ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto2 = auto[:,[:make,:mpg,:displacement,:gear_ratio]];" ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x4 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|--------------|------------|---------|\n", "| 1 | make | UTF8String | 0 |\n", "| 2 | mpg | Int64 | 0 |\n", "| 3 | displacement | Int64 | 0 |\n", "| 4 | gear_ratio | Float64 | 0 |" ] } ], "source": [ "showcols(auto2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Renaming variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to rename the variables in the \"kept\" data frame from directly above. This is easily accomplished with the `rename!()` function:" ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "collapsed": true }, "outputs": [], "source": [ "rename!(auto2,[:make,:displacement],[:make_name,:CCs]);" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x4 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|------------|------------|---------|\n", "| 1 | make_name | UTF8String | 0 |\n", "| 2 | mpg | Int64 | 0 |\n", "| 3 | CCs | Int64 | 0 |\n", "| 4 | gear_ratio | Float64 | 0 |" ] } ], "source": [ "showcols(auto2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Generating new variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Cloning variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cloning a variable is easily done as follows:" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto2[:mpg_same] = auto2[:mpg];" ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x5 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|------------|------------|---------|\n", "| 1 | make_name | UTF8String | 0 |\n", "| 2 | mpg | Int64 | 0 |\n", "| 3 | CCs | Int64 | 0 |\n", "| 4 | gear_ratio | Float64 | 0 |\n", "| 5 | mpg_same | Int64 | 0 |" ] } ], "source": [ "showcols(auto2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Generating new variables using functions of existing variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To generate a new variable using a function of one or more existing variables, the syntax is a bit more involved. For instance, suppose we want to create a new variable called `mpgSquared`, which is equal to `mpg` squared:" ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto2[:mpgSquared] = map(temp -> temp.^2, auto[:mpg]);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We use `map()` to accomplish the task, which takes as arguments a function (using arrow notation) and an input (`auto[:mpg]`). Note that the argument of the function (`temp`, here) can be any name. I used `temp` to emphasize that it is a variable of local scope, and thus purley temporary.\n", "\n", "Finally, note that there is a `\".\"` before the caret symbol, indicating that this is a vectorized operation. Failure to include the `\".\"` will result in an error.\n", "\n", "We can verify that the function worked as expected:" ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
mpgmpgSquared
122484
217289
322484
420400
515225
618324
" ], "text/plain": [ "6x2 DataFrames.DataFrame\n", "| Row | mpg | mpgSquared |\n", "|-----|-----|------------|\n", "| 1 | 22 | 484 |\n", "| 2 | 17 | 289 |\n", "| 3 | 22 | 484 |\n", "| 4 | 20 | 400 |\n", "| 5 | 15 | 225 |\n", "| 6 | 18 | 324 |" ] }, "execution_count": 149, "metadata": {}, "output_type": "execute_result" } ], "source": [ "head(auto2[:,[:mpg,:mpgSquared]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can use this framework to generate new variables using any mathematical function. For example, a dummy variable that is equal to 1 if `:mpg` is less than 20 and `:gear_ratio` is less than 3, and 0 otherwise:" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "collapsed": false }, "outputs": [], "source": [ "auto2[:dummy_var] = map((tempx,tempy) -> (tempx.<20) & (tempy.<3), auto2[:mpg], auto2[:gear_ratio]);" ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x7 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|------------|------------|---------|\n", "| 1 | make_name | UTF8String | 0 |\n", "| 2 | mpg | Int64 | 0 |\n", "| 3 | CCs | Int64 | 0 |\n", "| 4 | gear_ratio | Float64 | 0 |\n", "| 5 | mpg_same | Int64 | 0 |\n", "| 6 | mpgSquared | Any | 0 |\n", "| 7 | dummy_var | Bool | 0 |" ] } ], "source": [ "showcols(auto2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that the type of the new dummy variable is `Bool` instead of `Int64`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 4. Ordering and Sorting" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Ordering columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to change the ordering of the variables of our data frame. This is most easily done as follows:" ] }, { "cell_type": "code", "execution_count": 159, "metadata": { "collapsed": true }, "outputs": [], "source": [ "auto2 = auto2[:,[2;3;4;1;5:end]];" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x7 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|------------|------------|---------|\n", "| 1 | mpg | Int64 | 0 |\n", "| 2 | CCs | Int64 | 0 |\n", "| 3 | gear_ratio | Float64 | 0 |\n", "| 4 | make_name | UTF8String | 0 |\n", "| 5 | mpg_same | Int64 | 0 |\n", "| 6 | mpgSquared | Any | 0 |\n", "| 7 | dummy_var | Bool | 0 |" ] } ], "source": [ "showcols(auto2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Sorting observations by various variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also sort the observations in our data frame by any number of columns and any number of methods (ascending or descending). Performance of the `sort!()` function closely mirrors Stata's `gsort` capabilities.\n", "\n", "Below, we will sort ascending by `:mpg` and descending by `:make_name`:" ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sort!(auto2,cols=[:mpg,:make_name],rev=[false,true]);" ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
mpgCCsgear_ratiomake_namempg_samempgSquareddummy_var
1121632.98Volvo 260121681false
212973.78VW Scirocco121225false
314893.78VW Rabbit141225false
414903.78VW Diesel141156false
" ], "text/plain": [ "4x7 DataFrames.DataFrame\n", "| Row | mpg | CCs | gear_ratio | make_name | mpg_same | mpgSquared |\n", "|-----|-----|-----|------------|---------------|----------|------------|\n", "| 1 | 12 | 163 | 2.98 | \"Volvo 260\" | 12 | 1681 |\n", "| 2 | 12 | 97 | 3.78 | \"VW Scirocco\" | 12 | 1225 |\n", "| 3 | 14 | 89 | 3.78 | \"VW Rabbit\" | 14 | 1225 |\n", "| 4 | 14 | 90 | 3.78 | \"VW Diesel\" | 14 | 1156 |\n", "\n", "| Row | dummy_var |\n", "|-----|-----------|\n", "| 1 | false |\n", "| 2 | false |\n", "| 3 | false |\n", "| 4 | false |" ] }, "execution_count": 170, "metadata": {}, "output_type": "execute_result" } ], "source": [ "head(auto2,4)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 5. Reshaping and Merging" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reshaping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Julia's `DataFrames` allows for reshaping of longitudinal datasets in a similar fashion as other statistical software programs.\n", "\n", "Let's start by hand-creating a \"wide\" panel dataset with 3 individuals and 3 time periods:" ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
idsexinc1980inc1981inc1982
110500055006000
221200022004400
330300020001000
" ], "text/plain": [ "3x5 DataFrames.DataFrame\n", "| Row | id | sex | inc1980 | inc1981 | inc1982 |\n", "|-----|----|-----|---------|---------|---------|\n", "| 1 | 1 | 0 | 5000 | 5500 | 6000 |\n", "| 2 | 2 | 1 | 2000 | 2200 | 4400 |\n", "| 3 | 3 | 0 | 3000 | 2000 | 1000 |" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reshape1 = DataFrame(id = 1:3, sex = [0;1;0], \n", " inc1980 = [5000;2000;3000],\n", " inc1981 = [5500;2200;2000],\n", " inc1982 = [6000;4400;1000])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can reshape this data frame to \"long\" format by using the `stack()` command provided by the `DataFrames` package:" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
variablevalueidsex
1inc1980500010
2inc1980200021
3inc1980300030
4inc1981550010
5inc1981220021
6inc1981200030
7inc1982600010
8inc1982440021
9inc1982100030
" ], "text/plain": [ "9x4 DataFrames.DataFrame\n", "| Row | variable | value | id | sex |\n", "|-----|----------|-------|----|-----|\n", "| 1 | inc1980 | 5000 | 1 | 0 |\n", "| 2 | inc1980 | 2000 | 2 | 1 |\n", "| 3 | inc1980 | 3000 | 3 | 0 |\n", "| 4 | inc1981 | 5500 | 1 | 0 |\n", "| 5 | inc1981 | 2200 | 2 | 1 |\n", "| 6 | inc1981 | 2000 | 3 | 0 |\n", "| 7 | inc1982 | 6000 | 1 | 0 |\n", "| 8 | inc1982 | 4400 | 2 | 1 |\n", "| 9 | inc1982 | 1000 | 3 | 0 |" ] }, "execution_count": 174, "metadata": {}, "output_type": "execute_result" } ], "source": [ "longform1A = stack(reshape1, [:inc1980, :inc1981, :inc1982], [:id, :sex])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now we have three replications of each `:id` and `:sex` (the time-invariant columns), as well as two new columns, labeled `:variable` and `:value`.\n", "\n", "We can sort the new dataframe so that it is in a more readable format:" ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
variablevalueidsex
1inc1980500010
2inc1981550010
3inc1982600010
4inc1980200021
5inc1981220021
6inc1982440021
7inc1980300030
8inc1981200030
9inc1982100030
" ], "text/plain": [ "9x4 DataFrames.DataFrame\n", "| Row | variable | value | id | sex |\n", "|-----|----------|-------|----|-----|\n", "| 1 | inc1980 | 5000 | 1 | 0 |\n", "| 2 | inc1981 | 5500 | 1 | 0 |\n", "| 3 | inc1982 | 6000 | 1 | 0 |\n", "| 4 | inc1980 | 2000 | 2 | 1 |\n", "| 5 | inc1981 | 2200 | 2 | 1 |\n", "| 6 | inc1982 | 4400 | 2 | 1 |\n", "| 7 | inc1980 | 3000 | 3 | 0 |\n", "| 8 | inc1981 | 2000 | 3 | 0 |\n", "| 9 | inc1982 | 1000 | 3 | 0 |" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sort!(longform1A, cols = [:id, :variable])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "And we can also reshape back to \"wide\" format using the `unstack()` function:" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
idsexinc1980inc1981inc1982
110500055006000
221200022004400
330300020001000
" ], "text/plain": [ "3x5 DataFrames.DataFrame\n", "| Row | id | sex | inc1980 | inc1981 | inc1982 |\n", "|-----|----|-----|---------|---------|---------|\n", "| 1 | 1 | 0 | 5000 | 5500 | 6000 |\n", "| 2 | 2 | 1 | 2000 | 2200 | 4400 |\n", "| 3 | 3 | 0 | 3000 | 2000 | 1000 |" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wideform1A = unstack(longform1A, :variable, :value)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It's worth noting that this method does not work very well when there are multiple time-varying variables per `:id`. We'll discuss this in detail a bit later." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Merging" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "`DataFrames` also has functions that allow the user to merge two data frames together. There are many different types of possible merges, all accessible via the `join()` function.\n", "\n", "The different types of merges depend on if the user wants to keep unmatched observations from either data frame, not on whether or not the identification is duplicated in the merging data frames (i.e. each type of merge can be used for both one-to-one merges and many-to-one merges).\n", "\n", "The basic syntax is `c = join(a, b, on = [:id1, :id2], kind = symbol)`, where `a` and `b` are data frames each with the identifiers `:id1` and `id2`, and `kind` is a symbol that can take any of the following 7 values:\n", "\n", "Let's show how to do each of these merges using a simple set of data frames." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "1. `:inner`: The output contains rows for values of the key that exist in both the first (left) and second (right) arguments to join *(this is the `keep(match)` option in Stata)*\n", "2. `:left`: The output contains rows for values of the key that exist in the first (left) argument to join, whether or not that value exists in the second (right) argument *(this is the `keep(master)` option in Stata)*\n", "3. `:right`: The output contains rows for values of the key that exist in the second (right) argument to join, whether or not that value exists in the first (left) argument *(this is the `keep(using)` option in Stata)*\n", "4. `:outer`: The output contains rows for values of the key that exist in the first (left) or second (right) argument to join *(this is the Stata default)*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "5. `:semi`: Like an inner join, but output is restricted to columns from the first (left) argument to join *(this is the `keep(match master)` option in Stata)*\n", "6. `:anti`: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument to join. As with semi joins, output is restricted to columns from the first (left) argument *(there is no natural stata equivalent for this)*\n", "7. `:cross`: The output is the cartesian product of rows from the first (left) and second (right) arguments to join *(this is equivalent to Stata's `append` command)*. Note also that `:cross` is the only merge type that does not require an identifier in each data frame\n", "\n", "Let's show how to do each of these merges using a simple set of data frames." ] }, { "cell_type": "code", "execution_count": 178, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDName
11John
22Jane
33Mark
44Ann
55Vlad
66Maria
" ], "text/plain": [ "6x2 DataFrames.DataFrame\n", "| Row | ID | Name |\n", "|-----|----|---------|\n", "| 1 | 1 | \"John\" |\n", "| 2 | 2 | \"Jane\" |\n", "| 3 | 3 | \"Mark\" |\n", "| 4 | 4 | \"Ann\" |\n", "| 5 | 5 | \"Vlad\" |\n", "| 6 | 6 | \"Maria\" |" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "name = DataFrame(ID = [1, 2, 3, 4, 5, 6], Name = [\"John\", \"Jane\", \"Mark\", \"Ann\", \"Vlad\", \"Maria\"])" ] }, { "cell_type": "code", "execution_count": 179, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDJob
11Lawyer
22Doctor
33Mechanic
44Doctor
55Judge
66Pilot
" ], "text/plain": [ "6x2 DataFrames.DataFrame\n", "| Row | ID | Job |\n", "|-----|----|------------|\n", "| 1 | 1 | \"Lawyer\" |\n", "| 2 | 2 | \"Doctor\" |\n", "| 3 | 3 | \"Mechanic\" |\n", "| 4 | 4 | \"Doctor\" |\n", "| 5 | 5 | \"Judge\" |\n", "| 6 | 6 | \"Pilot\" |" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jobs = DataFrame(ID = [1, 2, 3, 4, 5, 6], Job = [\"Lawyer\", \"Doctor\", \"Mechanic\", \"Doctor\", \"Judge\", \"Pilot\"])" ] }, { "cell_type": "code", "execution_count": 186, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDSibling
11Eric
21Ryan
32Jennifer
43Heather
55Carl
65Dmitri
75Andrei
86Pedro
" ], "text/plain": [ "8x2 DataFrames.DataFrame\n", "| Row | ID | Sibling |\n", "|-----|----|------------|\n", "| 1 | 1 | \"Eric\" |\n", "| 2 | 1 | \"Ryan\" |\n", "| 3 | 2 | \"Jennifer\" |\n", "| 4 | 3 | \"Heather\" |\n", "| 5 | 5 | \"Carl\" |\n", "| 6 | 5 | \"Dmitri\" |\n", "| 7 | 5 | \"Andrei\" |\n", "| 8 | 6 | \"Pedro\" |" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "siblings = DataFrame(ID = [1, 1, 2, 3, 5, 5, 5, 6],\n", " Sibling = [\"Eric\", \"Ryan\", \"Jennifer\", \"Heather\", \"Carl\", \"Dmitri\", \"Andrei\", \"Pedro\"])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's do a simple `:inner` merge on the first `name` and `jobs` data frames:" ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
IDNameJob
11JohnLawyer
22JaneDoctor
33MarkMechanic
44AnnDoctor
55VladJudge
66MariaPilot
" ], "text/plain": [ "6x3 DataFrames.DataFrame\n", "| Row | ID | Name | Job |\n", "|-----|----|---------|------------|\n", "| 1 | 1 | \"John\" | \"Lawyer\" |\n", "| 2 | 2 | \"Jane\" | \"Doctor\" |\n", "| 3 | 3 | \"Mark\" | \"Mechanic\" |\n", "| 4 | 4 | \"Ann\" | \"Doctor\" |\n", "| 5 | 5 | \"Vlad\" | \"Judge\" |\n", "| 6 | 6 | \"Maria\" | \"Pilot\" |" ] }, "execution_count": 183, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameJobs = join(name,jobs, on = :ID, kind = :inner)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now let's see what happens when we merge `name` with `siblings`, under a variety of `join` types:" ] }, { "cell_type": "code", "execution_count": 184, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
IDNameSibling
11JohnEric
21JohnRyan
32JaneJennifer
43MarkHeather
55VladCarl
65VladDmitri
75VladAndrei
86MariaPedro
" ], "text/plain": [ "8x3 DataFrames.DataFrame\n", "| Row | ID | Name | Sibling |\n", "|-----|----|---------|------------|\n", "| 1 | 1 | \"John\" | \"Eric\" |\n", "| 2 | 1 | \"John\" | \"Ryan\" |\n", "| 3 | 2 | \"Jane\" | \"Jennifer\" |\n", "| 4 | 3 | \"Mark\" | \"Heather\" |\n", "| 5 | 5 | \"Vlad\" | \"Carl\" |\n", "| 6 | 5 | \"Vlad\" | \"Dmitri\" |\n", "| 7 | 5 | \"Vlad\" | \"Andrei\" |\n", "| 8 | 6 | \"Maria\" | \"Pedro\" |" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsInner = join(name,siblings, on = :ID, kind = :inner)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "With the `:inner` join, those who don't have siblings are removed from the merged data frame." ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDNameSibling
11JohnEric
21JohnRyan
32JaneJennifer
43MarkHeather
55VladCarl
65VladDmitri
75VladAndrei
86MariaPedro
94AnnNA
" ], "text/plain": [ "9x3 DataFrames.DataFrame\n", "| Row | ID | Name | Sibling |\n", "|-----|----|---------|------------|\n", "| 1 | 1 | \"John\" | \"Eric\" |\n", "| 2 | 1 | \"John\" | \"Ryan\" |\n", "| 3 | 2 | \"Jane\" | \"Jennifer\" |\n", "| 4 | 3 | \"Mark\" | \"Heather\" |\n", "| 5 | 5 | \"Vlad\" | \"Carl\" |\n", "| 6 | 5 | \"Vlad\" | \"Dmitri\" |\n", "| 7 | 5 | \"Vlad\" | \"Andrei\" |\n", "| 8 | 6 | \"Maria\" | \"Pedro\" |\n", "| 9 | 4 | \"Ann\" | NA |" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsOuter = join(name,siblings, on = :ID, kind = :outer)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When we instead do an `:outer` join, we see that Ann, who doesn't have any siblings, shows as `NA` under `:Sibling`.\n", "\n", "#### Other less-common merge types:" ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDNameSibling
11JohnEric
21JohnRyan
32JaneJennifer
43MarkHeather
55VladCarl
65VladDmitri
75VladAndrei
86MariaPedro
94AnnNA
" ], "text/plain": [ "9x3 DataFrames.DataFrame\n", "| Row | ID | Name | Sibling |\n", "|-----|----|---------|------------|\n", "| 1 | 1 | \"John\" | \"Eric\" |\n", "| 2 | 1 | \"John\" | \"Ryan\" |\n", "| 3 | 2 | \"Jane\" | \"Jennifer\" |\n", "| 4 | 3 | \"Mark\" | \"Heather\" |\n", "| 5 | 5 | \"Vlad\" | \"Carl\" |\n", "| 6 | 5 | \"Vlad\" | \"Dmitri\" |\n", "| 7 | 5 | \"Vlad\" | \"Andrei\" |\n", "| 8 | 6 | \"Maria\" | \"Pedro\" |\n", "| 9 | 4 | \"Ann\" | NA |" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsLeft = join(name,siblings, on = :ID, kind = :left)" ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
NameIDSibling
1John1Eric
2John1Ryan
3Jane2Jennifer
4Mark3Heather
5Vlad5Carl
6Vlad5Dmitri
7Vlad5Andrei
8Maria6Pedro
" ], "text/plain": [ "8x3 DataFrames.DataFrame\n", "| Row | Name | ID | Sibling |\n", "|-----|---------|----|------------|\n", "| 1 | \"John\" | 1 | \"Eric\" |\n", "| 2 | \"John\" | 1 | \"Ryan\" |\n", "| 3 | \"Jane\" | 2 | \"Jennifer\" |\n", "| 4 | \"Mark\" | 3 | \"Heather\" |\n", "| 5 | \"Vlad\" | 5 | \"Carl\" |\n", "| 6 | \"Vlad\" | 5 | \"Dmitri\" |\n", "| 7 | \"Vlad\" | 5 | \"Andrei\" |\n", "| 8 | \"Maria\" | 6 | \"Pedro\" |" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsOuter = join(name,siblings, on = :ID, kind = :right)" ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDName
11John
22Jane
33Mark
45Vlad
56Maria
" ], "text/plain": [ "5x2 DataFrames.DataFrame\n", "| Row | ID | Name |\n", "|-----|----|---------|\n", "| 1 | 1 | \"John\" |\n", "| 2 | 2 | \"Jane\" |\n", "| 3 | 3 | \"Mark\" |\n", "| 4 | 5 | \"Vlad\" |\n", "| 5 | 6 | \"Maria\" |" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsSemi = join(name,siblings, on = :ID, kind = :semi)" ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDName
14Ann
" ], "text/plain": [ "1x2 DataFrames.DataFrame\n", "| Row | ID | Name |\n", "|-----|----|-------|\n", "| 1 | 4 | \"Ann\" |" ] }, "execution_count": 192, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsAnti = join(name,siblings, on = :ID, kind = :anti)" ] }, { "cell_type": "code", "execution_count": 193, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
IDNameID_1Sibling
11John1Eric
21John1Ryan
31John2Jennifer
41John3Heather
51John5Carl
61John5Dmitri
71John5Andrei
81John6Pedro
92Jane1Eric
102Jane1Ryan
112Jane2Jennifer
122Jane3Heather
132Jane5Carl
142Jane5Dmitri
152Jane5Andrei
162Jane6Pedro
173Mark1Eric
183Mark1Ryan
193Mark2Jennifer
203Mark3Heather
213Mark5Carl
223Mark5Dmitri
233Mark5Andrei
243Mark6Pedro
254Ann1Eric
264Ann1Ryan
274Ann2Jennifer
284Ann3Heather
294Ann5Carl
304Ann5Dmitri
" ], "text/plain": [ "48x4 DataFrames.DataFrame\n", "| Row | ID | Name | ID_1 | Sibling |\n", "|-----|----|---------|------|------------|\n", "| 1 | 1 | \"John\" | 1 | \"Eric\" |\n", "| 2 | 1 | \"John\" | 1 | \"Ryan\" |\n", "| 3 | 1 | \"John\" | 2 | \"Jennifer\" |\n", "| 4 | 1 | \"John\" | 3 | \"Heather\" |\n", "| 5 | 1 | \"John\" | 5 | \"Carl\" |\n", "| 6 | 1 | \"John\" | 5 | \"Dmitri\" |\n", "| 7 | 1 | \"John\" | 5 | \"Andrei\" |\n", "| 8 | 1 | \"John\" | 6 | \"Pedro\" |\n", "| 9 | 2 | \"Jane\" | 1 | \"Eric\" |\n", "| 10 | 2 | \"Jane\" | 1 | \"Ryan\" |\n", "| 11 | 2 | \"Jane\" | 2 | \"Jennifer\" |\n", "⋮\n", "| 37 | 5 | \"Vlad\" | 5 | \"Carl\" |\n", "| 38 | 5 | \"Vlad\" | 5 | \"Dmitri\" |\n", "| 39 | 5 | \"Vlad\" | 5 | \"Andrei\" |\n", "| 40 | 5 | \"Vlad\" | 6 | \"Pedro\" |\n", "| 41 | 6 | \"Maria\" | 1 | \"Eric\" |\n", "| 42 | 6 | \"Maria\" | 1 | \"Ryan\" |\n", "| 43 | 6 | \"Maria\" | 2 | \"Jennifer\" |\n", "| 44 | 6 | \"Maria\" | 3 | \"Heather\" |\n", "| 45 | 6 | \"Maria\" | 5 | \"Carl\" |\n", "| 46 | 6 | \"Maria\" | 5 | \"Dmitri\" |\n", "| 47 | 6 | \"Maria\" | 5 | \"Andrei\" |\n", "| 48 | 6 | \"Maria\" | 6 | \"Pedro\" |" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mergedNameSibsCross = join(name,siblings, kind = :cross)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Reshaping with multiple time-varying variables" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "I mentioned previously that the reshaping method outlined previously does not work very well when there are multiple time-varying variables per :id. With the `join()` functions in hand, this is possible, though not ideal compared to other software packages.\n", "\n", "Let's revisit our previous example, except now with two time-varying variables (inc\\* and ue\\*):" ] }, { "cell_type": "code", "execution_count": 194, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
idsexinc1980inc1981inc1982ue1980ue1981ue1982
110500055006000010
221200022004400100
330300020001000001
" ], "text/plain": [ "3x8 DataFrames.DataFrame\n", "| Row | id | sex | inc1980 | inc1981 | inc1982 | ue1980 | ue1981 | ue1982 |\n", "|-----|----|-----|---------|---------|---------|--------|--------|--------|\n", "| 1 | 1 | 0 | 5000 | 5500 | 6000 | 0 | 1 | 0 |\n", "| 2 | 2 | 1 | 2000 | 2200 | 4400 | 1 | 0 | 0 |\n", "| 3 | 3 | 0 | 3000 | 2000 | 1000 | 0 | 0 | 1 |" ] }, "execution_count": 194, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reshape2 = DataFrame(id = 1:3, sex = [0;1;0], inc1980 = [5000;2000;3000], \n", " inc1981 = [5500;2200;2000],inc1982 = [6000;4400;1000],\n", " ue1980 = [0;1;0], ue1981 = [1;0;0], ue1982 = [0;0;1])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "If we try to reshape this using a similar `stack()` call as before, we get:" ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
variablevalueidsex
1inc1980500010
2inc1980200021
3inc1980300030
4inc1981550010
5inc1981220021
6inc1981200030
7inc1982600010
8inc1982440021
9inc1982100030
10ue1980010
11ue1980121
12ue1980030
13ue1981110
14ue1981021
15ue1981030
16ue1982010
17ue1982021
18ue1982130
" ], "text/plain": [ "18x4 DataFrames.DataFrame\n", "| Row | variable | value | id | sex |\n", "|-----|----------|-------|----|-----|\n", "| 1 | inc1980 | 5000 | 1 | 0 |\n", "| 2 | inc1980 | 2000 | 2 | 1 |\n", "| 3 | inc1980 | 3000 | 3 | 0 |\n", "| 4 | inc1981 | 5500 | 1 | 0 |\n", "| 5 | inc1981 | 2200 | 2 | 1 |\n", "| 6 | inc1981 | 2000 | 3 | 0 |\n", "| 7 | inc1982 | 6000 | 1 | 0 |\n", "| 8 | inc1982 | 4400 | 2 | 1 |\n", "| 9 | inc1982 | 1000 | 3 | 0 |\n", "| 10 | ue1980 | 0 | 1 | 0 |\n", "| 11 | ue1980 | 1 | 2 | 1 |\n", "| 12 | ue1980 | 0 | 3 | 0 |\n", "| 13 | ue1981 | 1 | 1 | 0 |\n", "| 14 | ue1981 | 0 | 2 | 1 |\n", "| 15 | ue1981 | 0 | 3 | 0 |\n", "| 16 | ue1982 | 0 | 1 | 0 |\n", "| 17 | ue1982 | 0 | 2 | 1 |\n", "| 18 | ue1982 | 1 | 3 | 0 |" ] }, "execution_count": 195, "metadata": {}, "output_type": "execute_result" } ], "source": [ "longform2 = stack(reshape2, [:inc1980, :inc1981, :inc1982, :ue1980, :ue1981, :ue1982],\n", " [:id, :sex])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The inc\\* and ue\\* values are stacked, so that we have double the number of observations we would like to have." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "The remedy for this is to do the reshaping separately for each type of variable, and then merge together." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Converting data frames to regular Julia arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conversion from data frames to regular `Julia` arrays may be required for use of libraries outside of the `DataFrames` and `GLM` world.\n", "\n", "To convert, simply type\n", "\n", "`arrayName = convert(Array,dataFrameName)`\n", "\n", "But be aware that any `NA` elements of the data frame will cause an error to be thrown (because `Julia`'s regular arrays do not know the `NA` type)." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.4.3", "language": "julia", "name": "julia-0.4" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }