Match a loanbook to asset-based company data (abcd) by the name_* columns
Source: R/match_name.R
match_name.Rdmatch_name() scores the match between names in a loanbook dataset (columns
can be name_direct_loantaker, name_intermediate_parent* and
name_ultimate_parent) with names in an asset-based company data (column
name_company). The raw names are first internally transformed, and aliases
are assigned. The similarity between aliases in each of the loanbook and abcd
is scored using stringdist::stringsim().
Usage
match_name(
loanbook,
abcd,
by_sector = TRUE,
min_score = 0.8,
method = "jw",
p = 0.1,
overwrite = NULL,
ald = deprecated(),
...
)Arguments
- loanbook, abcd
data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.
- by_sector
Should names only be compared if companies belong to the same
sector?- min_score
A number between 0-1, to set the minimum
scorethreshold. Ascoreof 1 is a perfect match.- method
Method for distance calculation. One of
c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). See stringdist::stringdist-metrics.- p
Prefix factor for Jaro-Winkler distance. The valid range for
pis0 <= p <= 0.25. Ifp=0(default), the Jaro-distance is returned. Applies only tomethod='jw'.- overwrite
A data frame used to overwrite the
sectorand/ornamecolumns of a particular direct loantaker or ultimate parent. To overwrite onlysector, the value in thenamecolumn should beNAand vice-versa. This file can be used to manually match loanbook companies to abcd.- ald
- ...
Arguments passed on to
stringdist::stringsim().
Value
A data frame with the same groups (if any) and columns as loanbook,
and the additional columns:
id_2dii- an id used internally bymatch_name()to distinguish companieslevel- the level of granularity that the loan was matched at (e.gdirect_loantakerorultimate_parent)sector- the sector of theloanbookcompanysector_abcd- the sector of theabcdcompanyname- the name of theloanbookcompanyname_abcd- the name of theabcdcompanyscore- the score of the match (manually set this to1prior to callingprioritize()to validate the match)source- determines the source of the match. (equal toloanbookunless the match is fromoverwrite
The returned rows depend on the argument min_value and the result of the
column score for each loan: * If any row has score equal to 1,
match_name() returns all rows where score equals 1, dropping all other
rows. * If no row has score equal to 1,match_name() returns all rows
where score is equal to or greater than min_score. * If there is no
match the output is a 0-row tibble with the expected column names -- for
type stability.
Package options
r2dii.match.sector_classifications: Allows you to use your own
sector_classififications instead of the default. This feature is
experimental and may be dropped and/or become a new argument to
match_name().
Assigning aliases
The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:
Remove special characters.
Replace language specific characters.
Abbreviate certain names to reduce their importance in the matching.
Spell out numbers to increase their importance.
See also
Other main functions:
prioritize()
Examples
library(r2dii.data)
library(tibble)
# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)
match_name(loanbook, abcd)
#> # A tibble: 2 × 28
#> id_loan id_direct_lo…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 L14 C296 Yuasfn… NA NA UP3 Affini… 187577 EUR
#> 2 L15 C295 Yuanbs… NA NA UP196 Noshir… 192217 EUR
#> # … with 19 more variables: loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> # level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …
match_name(loanbook, abcd, min_score = 0.9)
#> # A tibble: 1 × 28
#> id_loan id_direct_lo…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 L14 C296 Yuasfn… NA NA UP3 Affini… 187577 EUR
#> # … with 19 more variables: loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> # level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …
# Use your own `sector_classifications`
your_classifications <- tibble(
sector = "power",
borderline = FALSE,
code = "3511",
code_system = "XYZ"
)
restore <- options(r2dii.match.sector_classifications = your_classifications)
loanbook <- tibble(
sector_classification_system = "XYZ",
sector_classification_direct_loantaker = "3511",
id_ultimate_parent = "UP15",
name_ultimate_parent = "Alpine Knits India Pvt. Limited",
id_direct_loantaker = "C294",
name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)
abcd <- tibble(
name_company = "alpine knits india pvt. limited",
sector = "power"
)
match_name(loanbook, abcd)
#> # A tibble: 1 × 15
#> sector_…¹ secto…² id_ul…³ name_…⁴ id_di…⁵ name_…⁶ id_2dii level sector secto…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 XYZ 3511 UP15 Alpine… C294 Yuamen… UP1 ulti… power power
#> # … with 5 more variables: name <chr>, name_abcd <chr>, score <dbl>,
#> # source <chr>, borderline <lgl>, and abbreviated variable names
#> # ¹sector_classification_system, ²sector_classification_direct_loantaker,
#> # ³id_ultimate_parent, ⁴name_ultimate_parent, ⁵id_direct_loantaker,
#> # ⁶name_direct_loantaker, ⁷sector_abcd
# Cleanup
options(restore)