Skip to contents

match_name() scores the match between names in a loanbook dataset (columns can be name_direct_loantaker, name_intermediate_parent* and name_ultimate_parent) with names in an asset-based company data (column name_company). The raw names are first internally transformed, and aliases are assigned. The similarity between aliases in each of the loanbook and abcd is scored using stringdist::stringsim().

Usage

match_name(
  loanbook,
  abcd,
  by_sector = TRUE,
  min_score = 0.8,
  method = "jw",
  p = 0.1,
  overwrite = NULL,
  ald = deprecated(),
  ...
)

Arguments

loanbook, abcd

data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.

by_sector

Should names only be compared if companies belong to the same sector?

min_score

A number between 0-1, to set the minimum score threshold. A score of 1 is a perfect match.

method

Method for distance calculation. One of c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). See stringdist::stringdist-metrics.

p

Prefix factor for Jaro-Winkler distance. The valid range for p is 0 <= p <= 0.25. If p=0 (default), the Jaro-distance is returned. Applies only to method='jw'.

overwrite

A data frame used to overwrite the sector and/or name columns of a particular direct loantaker or ultimate parent. To overwrite only sector, the value in the name column should be NA and vice-versa. This file can be used to manually match loanbook companies to abcd.

ald

[Superseded] ald has been superseded by abcd.

...

Arguments passed on to stringdist::stringsim().

Value

A data frame with the same groups (if any) and columns as loanbook, and the additional columns:

  • id_2dii - an id used internally by match_name() to distinguish companies

  • level - the level of granularity that the loan was matched at (e.g direct_loantaker or ultimate_parent)

  • sector - the sector of the loanbook company

  • sector_abcd - the sector of the abcd company

  • name - the name of the loanbook company

  • name_abcd - the name of the abcd company

  • score - the score of the match (manually set this to 1 prior to calling prioritize() to validate the match)

  • source - determines the source of the match. (equal to loanbook unless the match is from overwrite

The returned rows depend on the argument min_value and the result of the column score for each loan: * If any row has score equal to 1, match_name() returns all rows where score equals 1, dropping all other rows. * If no row has score equal to 1,match_name() returns all rows where score is equal to or greater than min_score. * If there is no match the output is a 0-row tibble with the expected column names -- for type stability.

Package options

r2dii.match.sector_classifications: Allows you to use your own sector_classififications instead of the default. This feature is experimental and may be dropped and/or become a new argument to match_name().

Assigning aliases

The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:

  • Remove special characters.

  • Replace language specific characters.

  • Abbreviate certain names to reduce their importance in the matching.

  • Spell out numbers to increase their importance.

Handling grouped data

This function ignores but preserves existing groups.

See also

Other main functions: prioritize()

Examples

library(r2dii.data)
library(tibble)

# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)

match_name(loanbook, abcd)
#> # A tibble: 2 × 28
#>   id_loan id_direct_lo…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#>   <chr>   <chr>          <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  
#> 1 L14     C296           Yuasfn… NA      NA      UP3     Affini…  187577 EUR    
#> 2 L15     C295           Yuanbs… NA      NA      UP196   Noshir…  192217 EUR    
#> # … with 19 more variables: loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> #   level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …

match_name(loanbook, abcd, min_score = 0.9)
#> # A tibble: 1 × 28
#>   id_loan id_direct_lo…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#>   <chr>   <chr>          <chr>   <chr>   <chr>   <chr>   <chr>     <dbl> <chr>  
#> 1 L14     C296           Yuasfn… NA      NA      UP3     Affini…  187577 EUR    
#> # … with 19 more variables: loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_input_type <chr>,
#> #   sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> #   flag_project_finance_loan <chr>, name_project <lgl>,
#> #   lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> #   level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …

# Use your own `sector_classifications`
your_classifications <- tibble(
  sector = "power",
  borderline = FALSE,
  code = "3511",
  code_system = "XYZ"
)

restore <- options(r2dii.match.sector_classifications = your_classifications)

loanbook <- tibble(
  sector_classification_system = "XYZ",
  sector_classification_direct_loantaker = "3511",
  id_ultimate_parent = "UP15",
  name_ultimate_parent = "Alpine Knits India Pvt. Limited",
  id_direct_loantaker = "C294",
  name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power"
)

match_name(loanbook, abcd)
#> # A tibble: 1 × 15
#>   sector_…¹ secto…² id_ul…³ name_…⁴ id_di…⁵ name_…⁶ id_2dii level sector secto…⁷
#>   <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr> <chr>  <chr>  
#> 1 XYZ       3511    UP15    Alpine… C294    Yuamen… UP1     ulti… power  power  
#> # … with 5 more variables: name <chr>, name_abcd <chr>, score <dbl>,
#> #   source <chr>, borderline <lgl>, and abbreviated variable names
#> #   ¹​sector_classification_system, ²​sector_classification_direct_loantaker,
#> #   ³​id_ultimate_parent, ⁴​name_ultimate_parent, ⁵​id_direct_loantaker,
#> #   ⁶​name_direct_loantaker, ⁷​sector_abcd

# Cleanup
options(restore)