--- title: "Data Transformation With dplyr" subtitle: Biostat 203B author: "Dr. Hua Zhou @ UCLA" date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false knitr: opts_chunk: fig.align: 'center' fig.width: 6 fig.height: 4 message: FALSE cache: false --- ## Preamble Display machine information for reproducibility. ::: {.panel-tabset} #### R ```{r} sessionInfo() ``` #### Python ```{python} import IPython print(IPython.sys_info()) ``` #### Julia ```{julia} using InteractiveUtils versioninfo() ``` ::: Load tidyverse (R), Pandas (Python), and DataFrames.jl (Julia). ::: {.panel-tabset} #### R ```{r} library(tidyverse) ``` #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np ``` #### Julia ```{julia} using DataFrames, Pipe, StatsBase ``` ::: A typical data science project:

## nycflights13 data - Available from the nycflights13 package. - 336,776 flights that departed from New York City in 2013: ::: {.panel-tabset} #### R ```{r} library("nycflights13") flights ``` #### Python The nycflights13 data is available from the nycflights13 package in Python. ```{python} from nycflights13 import flights flights ``` Note there are some differences of this `flights` data from that in tidyverse. The data types for some variables are different. There are no natural ways in Pandas to hold integer column with missing values; so `dep_time` , `arr_time` are `float64` instead of `int64`. ```{python} flights.info() ``` To be more consistent with `nycflights13` in tidyverse, we cast `time_hour` to `datetime` type. ```{python} flights['time_hour'] = pd.to_datetime(flights['time_hour']) ``` #### Julia Let's use RCall.jl to retrieve the nycflights13 data from R. ```{julia} using RCall R""" library(nycflights13) """ flights = rcopy(R"flights") ``` ::: To display more rows or columns: ::: {.panel-tabset} #### R - By default, tibble prints the first 10 rows and all columns _that fit on screen_. - To change number of rows and columns to display: ```{r} nycflights13::flights |> print(n = 10, width = Inf) ``` Here we see the **pipe operator** `|>` pipes the output from previous command to the (first) argument of the next command. - To change the default print setting globally: - `options(tibble.print_max = n, tibble.print_min = m)`: if more than `m` rows, print only `n` rows. - `options(dplyr.print_min = Inf)`: print all row. - `options(tibble.width = Inf)`: print all columns. #### Python - Pandas by default displays 10 rows and limits the number of columns to the display area. - We can override this behavior by ```{python} #| eval: true pd.set_option("display.max_rows", 500) pd.set_option("display.max_columns", 20) ``` #### Julia By default DataFrames.jl limits the number of rows and columns when displaying a data frame in a Jupyter Notebook to 25 and 100, respectively. You can override this behavior by changing the values of the `ENV["DATAFRAMES_COLUMNS"]` and `ENV["DATAFRAMES_ROWS"]` variables to hold the maximum number of columns and rows of the output. All columns or rows will be printed if those numbers are equal or lower than 0. ::: ## dplyr basics * Pick observations (rows) by their values: `filter()`. * Reorder the rows: `arrange()`. * Pick variables (columns) by their names: `select()`. * Create new variables with functions of existing variables: `mutate()`. * Collapse many values down to a single summary: `summarise()`. ``` verb meaning -------------------------------------------- filter() subset observations (or rows) arrange() re-order the observations distinct() remove duplicate entries slice_*() select rows by position sample_*() sample rows -------------------------------------------- select() select variables (or columns) mutate() add new variables (or columns) relocate() move variables (or columns) to new positions rename() rename variables (or columns) -------------------------------------------- group_by() aggregate summarise() reduce to a single row -------------------------------------------- left_join() merge two data objects collect() force computation and bring data back into R ``` ## Manipulate rows (cases) ### Filter rows with `filter()` - Flights on Jan 1st: ::: {.panel-tabset} #### R ```{r} # same as filter(flights, month == 1 & day == 1) filter(flights, month == 1, day == 1) ``` #### Python ```{python} flights[(flights['month'] == 1) & (flights['day'] == 1)] ``` #### Julia ```{julia} filter(row -> (row.month == 1) & (row.day == 1), flights) ``` ::: - Flights in Nov or Dec: ::: {.panel-tabset} #### R ```{r} filter(flights, month == 11 | month == 12) ``` #### Python ```{python} flights[(flights['month'] == 11) | (flights['month'] == 12)] ``` #### Julia ```{julia} filter(row -> (row.month == 11) | (row.month == 12), flights) ``` ::: ### Remove rows with duplicate values - One row from each month: ::: {.panel-tabset} #### R ```{r} distinct(flights, month, .keep_all = TRUE) ``` - With `.keep_all = FALSE`, all variables/columns are kept: ```{r} distinct(flights, month) ``` #### Python ```{python} flights.drop_duplicates(subset = ['month']) ``` #### Julia ```{julia} unique(flights, :month) ``` ::: ### Sample rows ::: {.panel-tabset} #### R - Randomly select `n` rows: ```{r} sample_n(flights, 10, replace = TRUE) ``` - Randomly select fraction of rows: ```{r} sample_frac(flights, 0.1, replace = TRUE) ``` #### Python Sample `n=10` rows. ```{python} flights.sample(n = 10, axis = 0, replace = True) ``` Sample 10\% rows: ```{python} flights.sample(frac = 0.1, replace = True) ``` #### Julia I'm not sure whether there's a native function in DataFrames.jl for sampling. Sample 10 rows: ```{julia} rowidx = StatsBase.sample(1:nrow(flights), 10, replace = true); flights[rowidx, :] ``` Sample 10\% rows: ```{julia} rowidx = StatsBase.sample( 1:nrow(flights), round(Int, nrow(flights) * 0.1), replace = true); flights[rowidx, :] ``` ::: ### Select rows by position ::: {.panel-tabset} #### R - Select rows by position: ```{r} slice(flights, 1:5) ``` - First rows: ```{r} slice_head(flights, n = 5) ``` - Last rows: ```{r} slice_tail(flights, n = 5) ``` - Top `n` rows with the highest values: ```{r} # deprecated: top_n(flights, 5, wt = time_hour) # This function is quick slice_max(flights, n = 5, order_by = time_hour) |> print(width = Inf) ``` - Bottom `n` rows with lowest values: ```{r} # same as slice_max(flights, n = 5, order_by = desc(time_hour)) slice_min(flights, n = 5, order_by = time_hour) |> print(width = Inf) ``` - `slice_*` verbs apply to groups for grouped tibbles. #### Python - Select rows by position: ```{python} flights.iloc[range(0, 5)] ``` - First rows: ```{python} flights.head(5) ``` - Last rows: ```{python} flights.tail(5) ``` - Top `n` rows with the highest values: ```{python} flights.nlargest(n = 5, columns = 'time_hour') ``` - Bottom `n` rows with lowest values: ```{python} flights.nsmallest(n = 5, columns = 'time_hour') ``` I don't think `nlargest` and `nsmallest` apply to grouped DataFrame. But I may be wrong. #### Julia - Select rows by position: ```{julia} flights[1:5, :] ``` - First rows: ```{julia} first(flights, 5) ``` - Last rows: ```{julia} last(flights, 5) ``` - Top `n` rows with the highest values: ```{julia} last(sort(flights, [:time_hour]), 5) ``` - Bottom `n` rows with lowest values: ```{julia} first(sort(flights, [:time_hour]), 5) ``` ::: ### Arrange rows with `arrange()` ::: {.panel-tabset} #### R - Sort in ascending order: ```{r} arrange(flights, year, month, day) ``` Note input order matters! ```{r} arrange(flights, day, month, year) ``` - Sort in descending order: ```{r} arrange(flights, desc(arr_delay)) |> print(width = Inf) ``` - By default, `arrange` ignores grouping in grouped tibbles. Set `.by_group = TRUE` to arrange within each group. ```{r} # What are the worst delays in each month? flights |> group_by(month) |> arrange(desc(arr_delay), .by_group = TRUE) |> print(width = Inf) ``` #### Python - Sort in ascending order: ```{python} flights.sort_values(by = 'arr_delay') ``` - Sort in descending order: ```{python} flights.sort_values( by = 'arr_delay', ascending = False ) ``` - To sort within groups (`month`) ```{python} flights.sort_values( by = ['month', 'arr_delay'], ascending = [True, False] ) ``` #### Julia Sort in ascending order: ```{julia} sort(flights, [:arr_delay]) ``` Sort in descending order: ```{julia} sort(flights, [:arr_delay], rev = true) ``` To sort within groups (`month`): ```{julia} sort(flights, [:month, order(:arr_delay, rev= true)]) ``` ::: ## Manipulate columns (variables) ### Select columns with `select()` - Select columns by variable names: ::: {.panel-tabset} #### R ```{r} select(flights, year, month, day) ``` #### Python ```{python} flights[['year', 'month', 'day']] ``` #### Julia ```{julia} select(flights, [:year, :month, :day]) ``` ::: - Pull values of _one_ column as a vector: ::: {.panel-tabset} #### R Not displayed because the vector is long. ```{r} #| eval: false pull(flights, year) ``` #### Python ```{python} #| eval: false # Following are same flights.year flights.loc[:, 'year'] ``` #### Julia ```{julia} #| eval: false # Return a vector flights.year # Return a vector flights."year" # Return a vector flights[!, :year] # does not make a copy # Return a vector flights[!, "year"] # does not make a copy # Return a vector flights[:, :year] # make a copy! # Return a vector flights[:, "year"] # make a copy! ``` ::: - Select columns between two variables: ::: {.panel-tabset} #### R ```{r} select(flights, year:day) ``` #### Python ```{python} flights.loc[:, 'year':'day'] ``` #### Julia ```{julia} select(flights, Between(:year, :day)) ``` ::: - Select all columns _except_ those between two variables: ::: {.panel-tabset} #### R ```{r} select(flights, -(year:day)) ``` #### Python ```{python} flights.drop(flights.loc[:, 'year':'day'].columns, axis = 1) ``` #### Julia ```{julia} select(flights, Not(Between(:year, :day))) ``` ::: - Select columns by positions: ::: {.panel-tabset} #### R ```{r} select(flights, seq(1, 10, by = 2)) ``` #### Python ```{python} flights.iloc[:, range(0, 9, 2)] ``` #### Julia ```{julia} select(flights, 1:2:10) ``` ::: - Move variables to the start of data frame: ::: {.panel-tabset} #### R ```{r} select(flights, time_hour, air_time, everything()) ``` #### Python (???) Not sure what's the optimal way to do this. ```{python} # Note time_hour is missing in Python dataframe cols_to_move = ['arr_delay', 'air_time'] flights[cols_to_move + [x for x in flights.columns if x not in cols_to_move]] ``` #### Julia ```{julia} select(flights, :time_hour, :air_time, Not([:time_hour, :air_time])) ``` ::: - Helper functions in `dplyr`. * `everying()`: matches all variables. * `last_col()`: select last variable, possibly with an offset. * `starts_with("abc")`: matches names that begin with “abc”. * `ends_with("xyz")`: matches names that end with “xyz”. * `contains("ijk")`: matches names that contain “ijk”. * `matches("(.)\\1")`: selects variables that match a regular expression. * `num_range("x", 1:3)`: matches x1, x2 and x3. * `all_of()`: matches variables names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown. * `any_of()`: same as `all_of()`, but no error is thrown. ### Add new variables with `mutate()` - A tibble with fewer columns. ::: {.panel-tabset} #### R ```{r} flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time) flights_sml ``` #### Python (???) Is there better way? ```{python} import re cols = ['year', 'month', 'day'] + list(filter(re.compile(".*delay").match, flights.columns)) + ['distance', 'air_time'] flights_sml = flights.loc[:, cols] flights_sml ``` #### Julia ```{julia} flights_sml = select(flights, Between(:year, :day), r".*delay$", :distance, :air_time) flights_sml ``` ::: - Add variables `gain` and `speed`: ::: {.panel-tabset} #### R ```{r} mutate( flights_sml, gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` #### Python ```{python} flights_sml['gain'] = flights_sml['arr_delay'] - flights_sml['dep_delay'] flights_sml['speed'] = flights_sml['distance'] / flights_sml['air_time'] * 60 flights_sml ``` #### Julia Julia analog is `transform`: ```{julia} # Following are equivalent transform(flights_sml, [:arr_delay, :dep_delay] => (-) => :gain) insertcols!(flights_sml, :gain => flights.arr_delay - flights.dep_delay) ``` ::: - Refer to columns that you’ve just created: ::: {.panel-tabset} #### R ```{r} mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` #### Python (???) Not sure how to refer to columns in the same command. #### Julia (???) Not sure how to do this, except using two lines. ```{julia} # Following are equivalent @pipe flights |> transform( _ , [:arr_delay, :dep_delay] => (-) => :gain, [:air_time] => (x -> x / 60) => :hours, ) |> transform( _, [:gain, :hours] => ByRow(/) => :gain_per_hour ) ``` ::: - Only keep the new variables by `transmute()`: ```{r} transmute( flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` - `mutate_all()`: apply funs to all columns. ::: {.panel-tabset} #### R ```{r} #| eval: false mutate_all(data, funs(log(.), log2(.))) ``` #### Python (???) TODO #### Julia ```{julia} #| eval: false mapcols(col -> 2col, df) ``` ::: - `mutate_at()`: apply funs to specific columns. ```{r} #| eval: false mutate_at(data, vars(-Species), funs(log(.))) ``` - `mutate_if()`: apply funs of one type ```{r} #| eval: false mutate_if(data, is.numeric, funs(log(.))) ``` ## Summaries ### Summaries with `summarise()` - Mean of a variable: ::: {.panel-tabset} #### R ```{r} summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` #### Python ```{python} flights.agg({'dep_delay': np.mean}) ``` #### Julia ```{julia} combine(flights, :dep_delay => (x -> mean(skipmissing(x))) => :delay) ``` ::: - Convert a tibble into a grouped tibble: ::: {.panel-tabset} #### R ```{r} by_day <- group_by(flights, year, month, day) |> print(width = Inf) ``` #### Python ```{python} by_day = flights.groupby(['year', 'month', 'day']) by_day ``` #### Julia ```{julia} by_day = groupby(flights, [:year, :month, :day]) by_day ``` ::: - Grouped summaries: ```{r} summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) ``` ### Pipe - Consider following analysis (find destinations excluding `HNL` that have >20 flights, and calculate the average distances and arrival delay): ```{r} #| message: false by_dest <- group_by(flights, dest) delay <- summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) delay <- filter(delay, count > 20, dest != "HNL") delay ``` ---- - Cleaner code using pipe `|>`: ```{r} delays <- flights |> group_by(dest) |> summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) |> filter(count > 20, dest != "HNL") delays ``` - ggplot2 accepts pipe too. ```{r} delays |> ggplot(mapping = aes(x = dist, y = delay)) + geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE) + labs(x = "Distance from NYC (miles)", y = "Arrival delay (mins)") ``` ### Other summary functions - Location: `mean(x)`, `median(x)`. ::: {.panel-tabset} #### R ```{r} # Equivalent code using filter # not_cancelled <- flights |> # filter(!is.na(dep_delay), !is.na(arr_delay)) |> # print(width = Inf) not_cancelled <- flights |> drop_na(dep_delay, arr_delay) |> print(width = Inf) ``` ```{r} not_cancelled |> group_by(year, month, day) |> summarise( avg_delay1 = mean(arr_delay), avg_delay2 = mean(arr_delay[arr_delay > 0]), # the average positive delay ) ``` Question: why is the `day` group dropped? #### Python ```{python} not_cancelled = flights.dropna(subset = ['dep_delay', 'arr_delay']) not_cancelled ``` ```{python} flights.groupby(['year', 'month', 'day']).agg( avg_delay1 = ('arr_delay', np.mean), avg_delay2 = ('arr_delay', lambda x: np.mean(x[x > 0])) ) ``` #### Julia ```{julia} not_cancelled = dropmissing(flights, [:dep_delay, :arr_delay]) not_cancelled ``` ```{julia} @pipe not_cancelled |> groupby(_, [:year, :month, :day]) |> combine( _, :arr_delay => (x -> [(mean(x), mean(skipmissing(x[x .>= 0])))]) => [:avg_delay1, :avg_delay2] ) ``` ::: - Spread: `sd(x)`, `IQR(x)`, `mad(x)`. ::: {.panel-tabset} #### R ```{r} # destinations with largest variation in distance not_cancelled |> group_by(dest) |> summarise(distance_sd = sd(distance)) |> arrange(desc(distance_sd)) ``` #### Python ```{python} flights.groupby(['dest']).agg( distance_sd = ('distance', np.std) ).sort_values('distance_sd', ascending = False) ``` #### Julia ```{julia} @pipe flights |> groupby(_, :dest) |> combine(_, :distance => std => :distance_sd) |> sort(_, :distance_sd, rev = true) ``` ::: - Rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. ::: {.panel-tabset} #### R ```{r} # Earliest and latest flights on each day? not_cancelled |> group_by(year, month, day) |> summarise( first = min(dep_time), last = max(dep_time) ) ``` #### Python ```{python} not_cancelled.groupby(['year', 'month', 'day']).agg( first = ('dep_time', np.min), last = ('dep_time', np.max) ) ``` #### Julia ```{julia} @pipe not_cancelled |> groupby(_, [:year, :month, :day]) |> combine(_, :dep_time => (x -> [extrema(x)]) => [:first, :last]) ``` ::: - Position: `first(x)`, `nth(x, 2)`, `last(x)`. Note unless the variable is sorted, `first` is different from `min` and `last` is different from `max`. ::: {.panel-tabset} #### R ```{r} not_cancelled |> group_by(year, month, day) |> summarise( first_dep = first(dep_time), last_dep = last(dep_time) ) ``` #### Python ```{python} not_cancelled.groupby(['year', 'month', 'day']).agg( first_dep = ('dep_time', lambda x: x.iloc[0]), last_dep = ('dep_time', lambda x: x.iloc[-1]), ) ``` #### Julia ```{julia} @pipe not_cancelled |> groupby(_, [:year, :month, :day]) |> combine( _, :dep_time => first => :first_dep, :dep_time => last => :last_dep ) ``` ::: - Count: `n(x)`, `sum(!is.na(x))`, `n_distinct(x)`. ::: {.panel-tabset} #### R ```{r} # Which destinations have the most carriers? not_cancelled |> group_by(dest) |> summarise(carriers = n_distinct(carrier)) |> arrange(desc(carriers)) ``` Similarly ```{r} # which destination has most flights from NYC? not_cancelled |> count(dest) |> arrange(desc(n)) ``` #### Python ```{python} not_cancelled.groupby('dest').agg( carriers = ('carrier', lambda x: x.nunique(dropna = True)) ).sort_values('carriers', ascending = False) ``` #### Julia ```{julia} @pipe not_cancelled |> groupby(_, :dest) |> combine(_, :carrier => length ∘ unique => :carriers) |> sort(_, :carriers, rev = true) ``` ::: - Example: which aircraft flew most (in distance) in 2013? ::: {.panel-tabset} #### R ```{r} not_cancelled |> count(tailnum, wt = distance) |> arrange(desc(n)) ``` #### Python ```{python} not_cancelled.groupby('tailnum').agg( total_distance = ('distance', sum) ).sort_values('total_distance', ascending = False) ``` #### Julia ```{julia} @pipe not_cancelled |> groupby(_, :tailnum) |> combine(_, :distance => sum ∘ skipmissing => :total_distance) |> sort(_, :total_distance, rev = true) ``` ::: - Example: How many flights left before 5am? (these usually indicate delayed flights from the previous day) ::: {.panel-tabset} #### R ```{r} not_cancelled |> group_by(year, month, day) |> summarise(n_early = sum(dep_time < 500)) |> arrange(desc(n_early)) ``` #### Python ```{python} not_cancelled.groupby(['year', 'month', 'day']).agg( n_early = ('dep_time', lambda x: sum(x < 500)) ).sort_values('n_early', ascending = False) ``` #### Julia ```{julia} @pipe not_cancelled |> groupby(_, [:year, :month, :day]) |> combine(_, :dep_time => (x -> sum(skipmissing(x .< 500))) => :n_early) |> sort(_, :n_early, rev = true) ``` ::: - Example: What proportion of flights are delayed by more than an hour? ::: {.panel-tabset} #### R ```{r} not_cancelled |> group_by(year, month, day) |> summarise(hour_perc = mean(arr_delay > 60)) |> arrange(desc(hour_perc)) ``` #### Python ```{python} not_cancelled.groupby(['year', 'month', 'day']).agg( hour_perc = ('arr_delay', lambda x: np.mean(x > 60)) ).sort_values('hour_perc', ascending = False) ``` #### Julia ```{julia} @pipe not_cancelled |> groupby(_, [:year, :month, :day]) |> combine(_, :arr_delay => (x -> mean(skipmissing(x .> 60))) => :hour_perc) |> sort(_, :hour_perc, rev = true) ``` ::: ## Grouped mutates (and filters) - Recall the `flights_sml` tibble created earlier: ::: {.panel-tabset} #### R ```{r} flights_sml ``` #### Python ```{python} flights_sml ``` #### Julia ```{julia} flights_sml ``` ::: - Find the worst members of each group: ::: {.panel-tabset} #### R ```{r} flights_sml |> group_by(year, month, day) |> filter(rank(desc(arr_delay)) < 10) ``` #### Python ```{python} flights_sml.groupby( ['year', 'month', 'day'] )['arr_delay'].nlargest( n = 10 ) ``` #### Julia ```{julia} @pipe flights_sml |> dropmissing(_, :arr_delay) |> groupby(_, [:year, :month, :day]) |> combine( _, :arr_delay => (x -> x[x .>= partialsort(x, 10, rev = true)]) ) ``` ::: - Find all groups bigger than a threshold: ::: {.panel-tabset} #### R ```{r} popular_dests <- flights |> group_by(dest) |> filter(n() > 365) |> print(width = Inf) ``` #### Python ```{python} popular_dests = flights.groupby('dest').filter(lambda x: len(x) > 365) popular_dests ``` #### Julia ```{julia} popular_dests = @pipe flights |> groupby(_, :dest) |> combine(_) do sdf nrow(sdf) > 365 ? sdf : DataFrame() end popular_dests ``` ::: - Standardise to compute per group metrics: ::: {.panel-tabset} #### R ```{r} popular_dests <- popular_dests |> filter(arr_delay > 0) |> mutate(prop_delay = arr_delay / sum(arr_delay)) |> select(year:day, dest, arr_delay, prop_delay) |> print(width = Inf) ``` #### Python ```{python} popular_dests[popular_dests['arr_delay'] > 0].groupby( 'dest' ).apply( lambda x: x['arr_delay'] / x['arr_delay'].sum() ) ``` #### Julia ```{julia} @pipe popular_dests |> dropmissing(_, :arr_delay) |> subset(_, :arr_delay => x -> x .> 0 ) |> groupby(_, :dest) |> combine(_, :arr_delay => (x -> x ./ sum(x)) => :prop_delay) ``` ::: ## Combine tables nycflights13 package has >1 tables: - We already know a lot about flights: ::: {.panel-tabset} #### R ```{r} flights |> print(width = Inf) ``` #### Python ```{python} flights ``` #### Julia ```{julia} flights ``` ::: - airlines: ::: {.panel-tabset} #### R ```{r} airlines ``` #### Python ```{python} from nycflights13 import airlines airlines ``` #### Julia ```{julia} airlines = rcopy(R"airlines") ``` ::: - airports: ::: {.panel-tabset} #### R ```{r} airports ``` #### Python ```{python} from nycflights13 import airports airports ``` #### Julia ```{julia} airports = rcopy(R"airports") ``` ::: - planes: ::: {.panel-tabset} #### R ```{r} planes ``` #### Python ```{python} from nycflights13 import planes planes ``` #### Julia ```{julia} planes = rcopy(R"planes") ``` ::: - Weather: ::: {.panel-tabset} #### R ```{r} weather |> print(width = Inf) ``` #### Python ```{python} from nycflights13 import weather weather ``` #### Julia ```{julia} weather = rcopy(R"weather") ``` ::: ## Relational data

For the MIMIC-III data, the relation structure can be explored at . ### Keys - A **primary key** uniquely identifies an observation in its own table. - A **foreign key** uniquely identifies an observation in another table. ## Combine variables (columns) ### Demo tables ::: {.panel-tabset} #### R ```{r} (x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 3, "x3" )) ``` ```{r} (y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3" )) ``` #### Python ```{python} x = pd.DataFrame({ 'key': [1, 2, 4], 'val_x': ['x1', 'x2', 'x3'] }) x ``` ```{python} y = pd.DataFrame({ 'key': [1, 2, 3], 'val_y': ['y1', 'y2', 'y3'] }) x ``` #### Julia ```{julia} x = DataFrame( key = 1:3, val_x = ["x1", "x2", "x3"] ) y = DataFrame( key = [1, 2, 4], val_y = ["y1", "y2", "y3"] ) ``` ::: ### Inner join - An **inner join** matches pairs of observations whenever their keys are equal:

::: {.panel-tabset} #### R ```{r} inner_join(x, y, by = "key") ``` Same as ```{r} #| eval: false x |> inner_join(y, by = "key") ``` #### Python ```{python} x.join(y.set_index('key'), on = 'key', how = 'inner') ``` #### Julia ```{julia} innerjoin(x, y, on = :key) ``` ::: ### Outer join - An **outer join** keeps observations that appear in at least one of the tables. - Three types of outer joins: left join, right join, and full join. - A **left join** keeps all observations in `x`. ::: {.panel-tabset} #### R ```{r} left_join(x, y, by = "key") ``` #### Python ```{python} x.join(y.set_index('key'), on = 'key', how = 'left') ``` #### Julia ```{julia} leftjoin(x, y, on = :key) ``` ::: - A **right join** keeps all observations in `y`. ::: {.panel-tabset} #### R ```{r} right_join(x, y, by = "key") ``` #### Python ```{python} x.join(y.set_index('key'), on = 'key', how = 'right') ``` #### Julia ```{julia} rightjoin(x, y, on = :key) ``` ::: - A **full join** keeps all observations in `x` or `y`. ::: {.panel-tabset} #### R ```{r} full_join(x, y, by = "key") ``` #### Python ```{python} x.join(y.set_index('key'), on = 'key', how = 'outer') ``` #### Julia ```{julia} outerjoin(x, y, on = :key) ``` :::

### Duplicate keys - One table has duplicate keys.

::: {.panel-tabset} #### R ```{r} x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 2, "x3", 1, "x4" ) x y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2" ) y left_join(x, y, by = "key") ``` #### Python ```{python} x = pd.DataFrame({ 'key': [1, 2, 2, 1], 'val_x': ["x1", "x2", "x3", "x4"] }) x y = pd.DataFrame({ 'key': [1, 2], 'val_y': ["y1", "y2"] }) y x.join(y.set_index('key'), on = 'key', how = 'left') ``` #### Julia ```{julia} x = DataFrame( key = [1, 2, 2, 1], val_x = ["x1", "x2", "x3", "x4"] ) y = DataFrame( key = [1, 2], val_y = ["y1", "y2"] ) leftjoin(x, y, on = :key) ``` ::: - Both tables have duplicate keys. You get all possible combinations, the Cartesian product:

::: {.panel-tabset} #### R ```{r} x <- tribble( ~key, ~val_x, 1, "x1", 2, "x2", 2, "x3", 3, "x4" ) y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 2, "y3", 3, "y4" ) left_join(x, y, by = "key") ``` #### Python ```{python} x = pd.DataFrame({ 'key': [1, 2, 2, 3], 'val_x': ["x1", "x2", "x3", "x4"] }) x y = pd.DataFrame({ 'key': [1, 2, 2, 3], 'val_y': ["y1", "y2", "y3", "y4"] }) y x.join(y.set_index('key'), on = 'key', how = 'left') ``` #### Julia ```{julia} x = DataFrame( key = [1, 2, 2, 3], val_x = ["x1", "x2", "x3", "x4"] ) y = DataFrame( key = [1, 2, 2, 3], val_y = ["y1", "y2", "y3", "y4"] ) leftjoin(x, y, on = :key) ``` ::: - Let's create a narrower table from the flights data: ::: {.panel-tabset} #### R ```{r} flights2 <- flights |> select(year:day, hour, origin, dest, tailnum, carrier) |> print(width = Inf) ``` #### Python ```{python} flights2 = flights[['year', 'month', 'day', 'hour', 'origin', 'dest', 'tailnum', 'carrier']] flights2 ``` #### Julia ```{julia} flights2 = select( flights, Between(:year, :day), :hour, :origin, :dest, :tailnum, :carrier ) ``` ::: - We want to merge with the `weather` table: ::: {.panel-tabset} #### R ```{r} weather ``` #### Python ```{python} weather ``` #### Julia ```{julia} weather ``` ::: ### Defining the key columns ::: {.panel-tabset} #### R - `by = NULL` (default): use all variables that appear in both tables: ```{r} # same as: flights2 |> left_join(weather) left_join(flights2, weather) ``` - `by = "x"`: use the common variable `x`: ```{r} # same as: flights2 |> left_join(weather) left_join(flights2, planes, by = "tailnum") ``` - `by = c("a" = "b")`: match variable `a` in table `x` to the variable `b` in table `y`. ```{r} # same as: flights2 |> left_join(weather) left_join(flights2, airports, by = c("dest" = "faa")) ``` #### Python - Match multiple keys using multi-index: ```{python} keys = ['origin', 'year', 'month', 'day', 'hour'] flights2.join( weather.set_index(keys), on = keys, how = 'left') ``` - Match the common variable `tailnum`: ```{python} flights2.join( planes.set_index('tailnum'), on = 'tailnum', how = 'left', lsuffix = '_x', rsuffix = '_y' ) ``` - Match variable `a` in table `x` to the variable `b` in table `y`. ```{python} flights2.set_index( 'dest' ).join( airports.set_index('faa'), how = 'left' ) ``` #### Julia - Match multiple variables: ```{julia} leftjoin( flights2, weather, on = [:year, :month, :day, :hour, :origin] ) ``` - Match the common variable `tailnum`: ```{julia} leftjoin( flights2, planes, on = :tailnum, makeunique = true, matchmissing = :notequal ) ``` - Match variable `a` in table `x` to the variable `b` in table `y`. ```{julia} leftjoin( flights2, airports, on = :dest => :faa ) ``` ::: ## Combine cases (rows) - Top 10 most popular destinations: ::: {.panel-tabset} #### R ```{r} top_dest <- flights |> count(dest, sort = TRUE) |> head(10) |> print() ``` #### Python ```{python} top_dest = flights.groupby('dest')['dest'].count( ).to_frame( name = 'n' ).reset_index( ).sort_values( 'n', ascending = False ).head(10) top_dest ``` #### Julia ```{julia} top_dest = @pipe flights |> groupby(_, :dest) |> combine(_, nrow) |> sort(_, :nrow, rev = true) |> first(_, 10) ``` ::: - How to filter the cases that fly to these destinations? ### Semi-join - `semi_join(x, y)` keeps the rows in `x` that have a match in `y`.

::: {.panel-tabset} #### R ```{r} semi_join(flights, top_dest) ``` #### Python ```{python} flights.loc[flights['dest'].isin(top_dest['dest'])] ``` #### Julia ```{julia} semijoin(flights, top_dest, on = :dest) ``` ::: ### Anti-join - `anti_join(x, y)` keeps the rows that don’t have a match.

- Useful to see what will not be joined. ::: {.panel-tabset} #### R ```{r} # Planes that are not in planes table flights |> anti_join(planes, by = "tailnum") |> count(tailnum, sort = TRUE) ``` #### Python ```{python} flights.loc[-flights['tailnum'].isin(planes['tailnum'])].groupby('tailnum')['tailnum'].count().sort_values(ascending = False) ``` #### Julia ```{julia} @pipe antijoin( flights, planes, on = :tailnum, matchmissing = :notequal ) |> groupby(_, :tailnum) |> combine(_, nrow) |> sort(_, :nrow, rev = true) ``` ::: ## Set operations - Generate two tables: ::: {.panel-tabset} #### R ```{r} (df1 <- tribble( ~x, ~y, 1, 1, 2, 1 )) ``` ```{r} (df2 <- tribble( ~x, ~y, 1, 1, 1, 2 )) ``` #### Python ```{python} df1 = pd.DataFrame({ 'x': [1, 2], 'y': [1, 1] }) df1 df2 = pd.DataFrame({ 'x': [1, 1], 'y': [1, 2] }) df2 ``` #### Julia ```{julia} df1 = DataFrame( x = [1, 2], y = [1, 1] ) df2 = DataFrame( x = [1, 1], y = [1, 2] ) ``` ::: - `bind_rows(x, y)` stacks table `x` one on top of `y`. ::: {.panel-tabset} #### R ```{r} bind_rows(df1, df2) ``` #### Python ```{python} pd.concat([df1, df2], axis = 0) ``` #### Julia ```{julia} vcat(df1, df2) ``` ::: - `intersect(x, y)` returns rows that appear in both `x` and `y`. ::: {.panel-tabset} #### R ```{r} intersect(df1, df2) ``` #### Python ```{python} pd.merge(df1, df2, how = 'inner', on = ['x', 'y']) ``` #### Julia ```{julia} DataFrame(intersect(eachrow(df1), eachrow(df2))) ``` ::: - `union(x, y)` returns unique observations in `x` and `y`. ::: {.panel-tabset} #### R ```{r} union(df1, df2) ``` #### Python ```{python} pd.merge(df1, df2, how = 'outer', on = ['x', 'y']) ``` #### Julia ```{julia} DataFrame(union(eachrow(df1), eachrow(df2))) ``` ::: - `setdiff(x, y)` returns rows that appear in `x` but not in `y`. ::: {.panel-tabset} #### R ```{r} setdiff(df1, df2) ``` ```{r} setdiff(df2, df1) ``` #### Python (???) Not sure how to do this elegantly. #### Julia ```{julia} DataFrame(setdiff(eachrow(df1), eachrow(df2))) DataFrame(setdiff(eachrow(df2), eachrow(df1))) ``` ::: ## Cheat sheet [Posit dplyr cheat sheet](https://rstudio.github.io/cheatsheets/html/data-transformation.html) is extremely helpful.