Skip to contents

Users of the r2dii.match package reported that their R session crashed when they fed match_name() with big data. A recent post acknowledged the issue and promised examples on how to handle big data. This article shows one approach: feed match_name() with a sequence of small chunks of the loanbook dataset.

Setup

This example uses r2dii.match plus a few optional but convenient packages, including r2dii.data for example datasets.

# Packages
library(dplyr, warn.conflicts = FALSE)
library(fs)
library(vroom)
library(r2dii.data)
library(r2dii.match)

# Example datasets from the r2dii.data package
loanbook <- loanbook_demo
abcd <- abcd_demo

If the entire loanbook is too large, feed match_name() with smaller chunks, so that any call to match_name(this_chunk, abcd) fits in memory. More chunks take longer to run but use less memory; you’ll need to experiment to find the number of chunks that best works for you.

Say you try three chunks. You can take the loanbook dataset and then use mutate() to add the new column chunk, which assigns each row to one of the chunks:

chunks <- 3
chunked <- loanbook %>% mutate(chunk = as.integer(cut(row_number(), chunks)))

The total number of rows in the entire loanbook equals the sum of the rows across chunks.

count(loanbook)
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1   321

count(chunked, chunk)
#> # A tibble: 3 × 2
#>   chunk     n
#>   <int> <int>
#> 1     1   107
#> 2     2   107
#> 3     3   107

For each chunk you need to repeat this process:

  1. Match this chunk against the entire abcd dataset.
  2. If this chunk matched nothing, move to the next chunk.
  3. Else, save the result to a .csv file.
# This "output" directory is temporary; you may use any folder in your computer
out <- path(tempdir(), "output")
if (!dir_exists(out)) dir_create(out)

for (i in unique(chunked$chunk)) {
  # 1. Match this chunk against the entire `abcd` dataset.
  this_chunk <- filter(chunked, chunk == i)
  this_result <- match_name(this_chunk, abcd)
  
  # 2. If this chunk matched nothing, move to the next chunk
  matched_nothing <- nrow(this_result) == 0L
  if (matched_nothing) next()
  
  # 3. Else, save the result to a .csv file.
  vroom_write(this_result, path(out, paste0(i, ".csv")))
}

The result is one .csv file per chunk.

dir_ls(out)
#> /tmp/RtmpCxJBe4/output/1.csv /tmp/RtmpCxJBe4/output/2.csv 
#> /tmp/RtmpCxJBe4/output/3.csv

You can read and combine all files in one step with vroom().

matched <- vroom(dir_ls(out))
#> Rows: 410 Columns: 29
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (20): id_loan, id_direct_loantaker, name_direct_loantaker, id_intermedia...
#> dbl  (5): loan_size_outstanding, loan_size_credit_limit, sector_classificati...
#> lgl  (4): name_project, lei_direct_loantaker, isin_direct_loantaker, borderline
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
matched
#> # A tibble: 410 × 29
#>    id_loan id_direct_l…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#>    <chr>   <chr>         <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  
#>  1 L1      C294          Yuamen… NA      NA      UP15    Alpine…  225625 EUR    
#>  2 L3      C292          Yuama … IP5     Yuama … UP288   Univer…  410297 EUR    
#>  3 L3      C292          Yuama … IP5     Yuama … UP288   Univer…  410297 EUR    
#>  4 L5      C305          Yukon … NA      NA      UP104   Garlan…  406585 EUR    
#>  5 L5      C305          Yukon … NA      NA      UP104   Garlan…  406585 EUR    
#>  6 L6      C304          Yukon … NA      NA      UP83    Earthp…  185721 EUR    
#>  7 L6      C304          Yukon … NA      NA      UP83    Earthp…  185721 EUR    
#>  8 L8      C303          Yueyan… NA      NA      UP163   Kraftw…  291513 EUR    
#>  9 L9      C301          Yuedxi… IP10    Yuedxi… UP138   Jai Bh…  407513 EUR    
#> 10 L10     C302          Yuexi … NA      NA      UP32    Bhagwa…  186649 EUR    
#> # … with 400 more rows, 20 more variables: loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, chunk <dbl>,
#> #   id_2dii <chr>, level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …

The matched result should be similar to that of match_name(loanbook, abcd). Your next steps are documented on the Home page and Get started sections of the package website.

Anecdote

I tested match_name() with datasets which size (on disk as a .csv file) was 20MB for the loanbook dataset and 100MB for the abcd dataset. Feeding match_name() with the entire loanbook crashed my R session. But feeding it with a sequence of 30 chunks run in about 25’ – successfully; the combined result had over 10 million rows:

sector                       data
---------------------------------
1 automotive     [2,644,628 × 15]
2 aviation         [377,200 × 15]
3 cement           [942,526 × 15]
4 oil and gas    [1,551,805 × 15]
5 power          [7,353,772 × 15]
6 shipping       [4,194,067 × 15]
7 steel                 [15 × 15]