Relational data is multiple tables of data that when combined together answer research questions. Relations define the important element, not just the individual tables. Relations are defined between a pair of tables, or potentially complex structures can be built up with more than 2 tables. In many situations, data is stored in a relational format because to do otherwise would introduce redundancy and use unnecessary storage space.
This data structure requires relational verbs to combine data across tables. Mutating joins add new variables to one data frame from matching observations in another, whereas filtering joins filter observations from one data frame based on whether or not they match an observation in the other table.
superheroes
and publishers
Let’s review how these different types of joining operations work with relational data on comic books. Load the rcfss
library. There are two data frames which contain data on comic books.
library(tidyverse)
library(rcfss)
superheroes
## # A tibble: 7 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Magneto bad male Marvel
## 2 Storm good female Marvel
## 3 Mystique bad female Marvel
## 4 Batman good male DC
## 5 Joker bad male DC
## 6 Catwoman bad female DC
## 7 Hellboy good male Dark Horse Comics
publishers
## # A tibble: 3 x 2
## publisher yr_founded
## <chr> <int>
## 1 DC 1934
## 2 Marvel 1939
## 3 Image 1992
Would it make sense to store these two data frames in the same tibble? No! This is because each data frame contains substantively different information:
superheroes
contains data on superheroespublishers
contains data on publishersThe units of analysis are completely different. Just as it made sense to split Minard’s data into two separate data frames, it also makes sense to store them separately here. That said, depending on the type of analysis you seek to perform, it makes sense to join the data frames together temporarily. How should we join them? Well it depends on how you plan to ask your question. Let’s look at the result of several different join operations.
inner_join(x, y)
: Return all rows fromx
where there are matching values iny
, and all columns fromx
andy
. If there are multiple matches betweenx
andy
, all combination of the matches are returned. This is a mutating join.
(ijsp <- inner_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 6 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
We lose Hellboy in the join because, although he appears in x = superheroes
, his publisher Dark Horse Comics does not appear in y = publishers
. The join result has all variables from x = superheroes
plus yr_founded
, from y
.
|
|
||||||||||||||||||||||||||||||||||||||||
|
left_join(x, y)
: Return all rows fromx
, and all columns fromx
andy
. If there are multiple matches betweenx
andy
, all combination of the matches are returned. This is a mutating join.
(ljsp <- left_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
We basically get x = superheroes
back, but with the addition of variable yr_founded
, which is unique to y = publishers
. Hellboy, whose publisher does not appear in y = publishers
, has an NA
for yr_founded
.
|
|
||||||||||||||||||||||||||||||||||||||||
|
right_join(x, y)
: Return all rows fromy
, and all columns fromx
andy
. If there are multiple matches betweenx
andy
, all combination of the matches are returned. This is a mutating join.
(rjsp <- right_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 7 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Batman good male DC 1934
## 2 Joker bad male DC 1934
## 3 Catwoman bad female DC 1934
## 4 Magneto bad male Marvel 1939
## 5 Storm good female Marvel 1939
## 6 Mystique bad female Marvel 1939
## 7 <NA> <NA> <NA> Image 1992
We basically get y = publishers
back, but with the addition of variables name
, alignment
, and gender
, which is unique to x = superheroes
. Image, who did not publish any of the characters in superheroes
, has an NA
for the new variables.
We could also accomplish virtually the same thing using left_join()
by reversing the order of the data frames in the function:
left_join(publishers, superheroes)
## Joining, by = "publisher"
## # A tibble: 7 x 5
## publisher yr_founded name alignment gender
## <chr> <int> <chr> <chr> <chr>
## 1 DC 1934 Batman good male
## 2 DC 1934 Joker bad male
## 3 DC 1934 Catwoman bad female
## 4 Marvel 1939 Magneto bad male
## 5 Marvel 1939 Storm good female
## 6 Marvel 1939 Mystique bad female
## 7 Image 1992 <NA> <NA> <NA>
Doing so returns the same basic data frame, with the column orders reversed. right_join()
is not used as commonly as left_join()
, but works well in a piped operation when you perform several functions on x
but then want to join it with y
and only keep rows that appear in y
.
|
|
||||||||||||||||||||||||||||||||||||||||
|
full_join(x, y)
: Return all rows and all columns from bothx
andy
. Where there are not matching values, returnsNA
for the one missing. This is a mutating join.
(fjsp <- full_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 8 x 5
## name alignment gender publisher yr_founded
## <chr> <chr> <chr> <chr> <int>
## 1 Magneto bad male Marvel 1939
## 2 Storm good female Marvel 1939
## 3 Mystique bad female Marvel 1939
## 4 Batman good male DC 1934
## 5 Joker bad male DC 1934
## 6 Catwoman bad female DC 1934
## 7 Hellboy good male Dark Horse Comics NA
## 8 <NA> <NA> <NA> Image 1992
We get all rows of x = superheroes
plus a new row from y = publishers
, containing the publisher “Image”. We get all variables from x = superheroes
AND all variables from y = publishers
. Any row that derives solely from one table or the other carries NA
s in the variables found only in the other table.
|
|
|||||||||||||||||||||||||||||||||||||||||||||
|
semi_join(x, y)
: Return all rows fromx
where there are matching values iny
, keeping just columns fromx
. A semi join differs from an inner join because an inner join will return one row ofx
for each matching row ofy
(potentially duplicating rows inx
), whereas a semi join will never duplicate rows ofx
. This is a filtering join.
(sjsp <- semi_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 6 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Batman good male DC
## 2 Joker bad male DC
## 3 Catwoman bad female DC
## 4 Magneto bad male Marvel
## 5 Storm good female Marvel
## 6 Mystique bad female Marvel
We get a similar result as with inner_join()
but the join result contains only the variables originally found in x = superheroes
. But note the row order has changed.
|
|
||||||||||||||||||||||||||||||||||||||||
|
anti_join(x, y)
: Return all rows fromx
where there are not matching values iny
, keeping just columns fromx
. This is a filtering join.
(ajsp <- anti_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 1 x 4
## name alignment gender publisher
## <chr> <chr> <chr> <chr>
## 1 Hellboy good male Dark Horse Comics
We keep only Hellboy now (and do not get yr_founded
).
|
|
||||||||||||||||||||||||||||||||||||||||
|
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.3 (2017-11-30)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2018-03-19
## Packages -----------------------------------------------------------------
## package * version date source
## backports 1.1.2 2017-12-13 CRAN (R 3.4.3)
## base * 3.4.3 2017-12-07 local
## compiler 3.4.3 2017-12-07 local
## datasets * 3.4.3 2017-12-07 local
## devtools 1.13.5 2018-02-18 CRAN (R 3.4.3)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1)
## graphics * 3.4.3 2017-12-07 local
## grDevices * 3.4.3 2017-12-07 local
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## knitr 1.20 2018-02-20 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.3 2017-12-07 local
## Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
## rmarkdown 1.9 2018-03-01 CRAN (R 3.4.3)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## stats * 3.4.3 2017-12-07 local
## stringi 1.1.7 2018-03-12 CRAN (R 3.4.3)
## stringr 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tools 3.4.3 2017-12-07 local
## utils * 3.4.3 2017-12-07 local
## withr 2.1.1 2017-12-19 CRAN (R 3.4.3)
## yaml 2.1.18 2018-03-08 CRAN (R 3.4.4)
This work is licensed under the CC BY-NC 4.0 Creative Commons License.