Data wrangling: relational data and factors

MACS 30500 University of Chicago

Introduction to relational data

  • Multiple tables of data that when combined together answer research questions
  • Relations define the important element, not just the individual tables
  • Relations are defined between a pair of tables
  • Relational verbs
    • Mutating joins
    • Filtering joins

Deadpool

Superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

Publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992

superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992
inner_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934

superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992
left_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA

superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992
right_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
NA NA NA Image 1992

superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992
full_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA
NA NA NA Image 1992

superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992
semi_join(x = superheroes, y = publishers)
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC

superheroes

name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics

publishers

publisher yr_founded
DC 1934
Marvel 1939
Image 1992
anti_join(x = superheroes, y = publishers)
name alignment gender publisher
Hellboy good male Dark Horse Comics