{
"cells": [
{
"cell_type": "markdown",
"id": "previous-guest",
"metadata": {},
"source": [
"# Entity Resolution in Japanese-American Internee Records\n",
"\n",
"## Dependencies\n",
"\n",
"Dependencies are installed via the Pipfile in the same directory."
]
},
{
"cell_type": "markdown",
"id": "palestinian-outside",
"metadata": {},
"source": [
"## The Resolver\n",
"\n",
"Here we define a function that take two data frame-like objects with our chosen column names and attempts to identify matches between them. It returns a list of matches."
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "indirect-export",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from IPython.display import display, HTML\n",
"# from datetime import datetime, date\n",
"from fuzzywuzzy import fuzz\n",
"\n",
"def resolveFamilyMembers( sourceDF, targetDF, debug=False ): # source, target\n",
" result = pd.DataFrame(None, index=sourceDF.index, columns=['match'])\n",
" threshold = float(.5)\n",
" xdf = pd.DataFrame(float(0), index=targetDF.index, columns=sourceDF.index )\n",
" if debug:\n",
" display(HTML('
grid of possible matches:
'), xdf)\n",
" # Visit every cell in the grid and plug in a similarity value between the FAR (cols) and WRA (index) axis..\n",
" for (id1, row1) in targetDF.iterrows():\n",
" for (id2, row2) in sourceDF.iterrows():\n",
" xdf.at[id1, id2] = compare(row1, row2)\n",
" if debug:\n",
" display(HTML('grid with similarity scores:
'),xdf)\n",
" maxMatches = xdf.idxmax() # now we want to find the best matches for each FAR record\n",
" if debug:\n",
" display(HTML('The first, best scores for each column:
'), maxMatches)\n",
" for id2 in maxMatches.index:\n",
" if xdf.loc[maxMatches[id2], id2] > threshold: # max could be zero, so check that we have acceptable score\n",
" # sourceDF.at[id2, 'pid'] = targetDF.loc[maxMatches[id2]]['pid'] # assign the pid to the FAR row\n",
" result.at[id2, 'match'] = maxMatches[id2]\n",
" #xdf.drop([maxMatches[id2]], inplace=True) # drop the WRA row we used\n",
" xdf.drop(columns=[id2], inplace=True) # drop the FAR column we used\n",
" if debug:\n",
" display(HTML('Possibilities remaining:
'), xdf)\n",
" if xdf.size == 1: # if only one possibility remains, choose it\n",
" if debug:\n",
" print('just one possibility left!')\n",
" id1 = xdf.index[0]\n",
" id2 = xdf.columns[0]\n",
" # sourceDF.at[id2, 'pid'] = targetDF.loc[id1]['pid']\n",
" result.at[id2, 'match'] = id1\n",
" return result\n",
" \n",
"# Scoring is weird when there are multiple values to match..\n",
"# Using fn/ofn, byear+-1 for matching so far..\n",
"def compare(a, b):\n",
" result = float(max( # use whichever score is higher in this list..\n",
" min(fn(a,b), by1(a,b)), # if either fn() or by1() return zero then this result is zero..\n",
" 0.5*(by1(a,b)), # if we only match the byear, then score is just .75\n",
" #### another compare method\n",
" ))\n",
" return result\n",
"\n",
"# returns score of 1 if fn matches or 1 if ofn matches fn\n",
"# now with fuzzywuzzy ratio adjusted to btw 0 and 1\n",
"def fn(a, b):\n",
" try:\n",
" result = float(max(\n",
" fuzz.ratio(str(a['fn']), str(b['fn'])),\n",
" fuzz.ratio(str(a['fn']), str(b['ofn'])), # includes matching with other names\n",
" fuzz.ratio(str(b['fn']), str(a['ofn'])),\n",
" fuzz.ratio(str(b['ofn']), str(a['ofn'])),\n",
" )/100)\n",
" return result\n",
" except TypeError as e:\n",
" import traceback\n",
" print(traceback.format_exc())\n",
" display(a['fn'], a['ofn'], b['fn'], b['ofn'])\n",
" return 0\n",
"\n",
"# returns a score of 1 if byears are within 1 year\n",
"def by1(a, b):\n",
" return ( abs( a['byear'] - b['byear'] ) <= 1)"
]
},
{
"cell_type": "markdown",
"id": "noted-masters",
"metadata": {},
"source": [
"## Load the big datasets into Pandas data frames"
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "italic-packaging",
"metadata": {},
"outputs": [],
"source": [
"farfile = './far_cleanup-09-26-2021.xlsx'\n",
"wrafile = './wra_cleanup-09-26-2021.csv'\n",
"na = ['Unknown', 'unknown', 'Unk.', 'UNK', 'unk.', 'Unk', 'unk', ' ', ' ', ' ']\n",
"fardf = pd.read_excel(farfile,dtype=str,na_values=na,keep_default_na=True)\n",
"wradf = pd.read_csv(wrafile,dtype=str,na_values=na,keep_default_na=True)"
]
},
{
"cell_type": "markdown",
"id": "lined-survey",
"metadata": {},
"source": [
"## Create consistent labels for the significant columns across datasets\n",
"\n",
"We are going to do a lot of work with these particular columns, let's use some short, easy labels and defined the data types we want:\n",
"\n",
"* m_pseudoid - str - a string that is unique to each row in the dataset (mostly) and used to verify results against a reference dataset.\n",
"* famid - str - WRA family id number (also used in FAR)\n",
"* fn - str - first name\n",
"* ofn - str - other first name(s)\n",
"* ln - str - last name\n",
"* byear - float - birth year\n",
"\n",
"We are also going to reindex, starting with 1 instead of zero. This makes it easier to create our unique 'id' column. The 'id' column is a globally unique url for the line within the source file."
]
},
{
"cell_type": "code",
"execution_count": 46,
"id": "parallel-rwanda",
"metadata": {},
"outputs": [],
"source": [
"# first step, create uniform column names across dataframes\n",
"wradf.rename(columns = {'NEW-m_lastname':'ln', 'NEW-m_firstname':'fn', 'NEW-m_familyno_normal':'famid', 'NEW-m_birthyear':'byear'}, inplace = True)\n",
"wradf['row'] = wradf.index.map(lambda x: x+1)\n",
"wradf['id'] = 'http://densho.org/data/WRA.csv#' + wradf['row'].map(str)\n",
"wradf['ofn'] = None # not in WRA data\n",
"wradf.set_index('row', inplace=True, verify_integrity=True)\n",
"fardf.rename(columns = {'NEW-last_name_corrected':'ln', 'NEW-first_name_corrected':'fn', 'NEW-other_names':'ofn', 'NEW-family_number':'famid', 'NEW-year_of_birth':'byear'}, inplace = True)\n",
"fardf['row'] = fardf.index.map(lambda x: x+1)\n",
"fardf['id'] = 'http://densho.org/data/FAR.csv#' + fardf['row'].map(str)\n",
"fardf['m_pseudoid'] = fardf['FAR Exit Dataset'] + \":\" + fardf['original_order'] + \":\" + fardf['far_line_id']\n",
"fardf.set_index('row', inplace=True, verify_integrity=True)\n",
"# fardf.dropna(subset=['m_pseudoid'], inplace=True) # drop rows without pseudoid\n",
"# fardf.dropna(subset=['fn'], inplace=True) # drop rows without first name\n",
"# fardf.dropna(subset=['famid'], inplace=True) # no longer dropping empty famid rows b/c trying soundex"
]
},
{
"cell_type": "markdown",
"id": "restricted-portugal",
"metadata": {},
"source": [
"## Gold Standard Reference Dataframe\n",
"\n",
"These matches were performed manually on a subset of FAR records, based on the list of internees that were transferred to North Dakota after the event described as a \"riot\". We are taking the Excel file and rearranging and simplifying that data in order to have a simple list of expected matched. Each line in Excel contains two FAR ids that are expect to match one WRA id.\n",
"\n",
"We will use this data as the \"ground truth\" to measure the accuracy of algorithmic entity resolution. The \"gold\" data frame is only the id columns. The additional columns are in \"golddf\"."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "individual-yellow",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wraid | \n",
" farid | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 7-manzanar_ajisaka_1890_tatsuo | \n",
" tulelake1:208:198 | \n",
"
\n",
" \n",
" | 1 | \n",
" 7-manzanar_ajisaka_1890_tatsuo | \n",
" manzanar1:36:35 | \n",
"
\n",
" \n",
" | 2 | \n",
" 6-jerome_arichi_1894_saburo | \n",
" tulelake1:628:602 | \n",
"
\n",
" \n",
" | 3 | \n",
" 6-jerome_arichi_1894_saburo | \n",
" jerome1:293:289 | \n",
"
\n",
" \n",
" | 4 | \n",
" 2-poston_hagio_1889_takemats | \n",
" tulelake1:2715:2575 | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 199 | \n",
" 1-topaz_yamasaki_1914_satoru | \n",
" topaz1:10959:10587 | \n",
"
\n",
" \n",
" | 200 | \n",
" 10-tulelake_yoshida_1911_riichi | \n",
" tulelake1:29776:28342 | \n",
"
\n",
" \n",
" | 201 | \n",
" 10-tulelake_yoshida_1911_riichi | \n",
" NOT IN FAR | \n",
"
\n",
" \n",
" | 202 | \n",
" 3-gilariver_yoshioka_1918_kiyozo | \n",
" tulelake1:30146:28693 | \n",
"
\n",
" \n",
" | 203 | \n",
" 3-gilariver_yoshioka_1918_kiyozo | \n",
" gilariver1:16851:16379 | \n",
"
\n",
" \n",
"
\n",
"
204 rows × 2 columns
\n",
"
"
],
"text/plain": [
" wraid farid\n",
"0 7-manzanar_ajisaka_1890_tatsuo tulelake1:208:198\n",
"1 7-manzanar_ajisaka_1890_tatsuo manzanar1:36:35\n",
"2 6-jerome_arichi_1894_saburo tulelake1:628:602\n",
"3 6-jerome_arichi_1894_saburo jerome1:293:289\n",
"4 2-poston_hagio_1889_takemats tulelake1:2715:2575\n",
".. ... ...\n",
"199 1-topaz_yamasaki_1914_satoru topaz1:10959:10587\n",
"200 10-tulelake_yoshida_1911_riichi tulelake1:29776:28342\n",
"201 10-tulelake_yoshida_1911_riichi NOT IN FAR\n",
"202 3-gilariver_yoshioka_1918_kiyozo tulelake1:30146:28693\n",
"203 3-gilariver_yoshioka_1918_kiyozo gilariver1:16851:16379\n",
"\n",
"[204 rows x 2 columns]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"matchesfile = './Transfer_NDak_100-male_Jul-3-1945_CLEANED.xlsx'\n",
"matches_src_df = pd.read_excel(matchesfile,dtype=str,na_values=na,keep_default_na=True)\n",
"golddf = matches_src_df.loc[2:101, ['WRA_pseudoid', 'FAR_Exit_Dataset-original_order-far_line_id', 'FAR_Exit_Dataset-original_order-far_line_id.1', 'LastName', 'FirstName']]\n",
"golddf = golddf.merge(wradf, left_on=['WRA_pseudoid'], right_on=['m_pseudoid'], how='left')\n",
"golddf = golddf.loc[golddf.index.repeat(2),:].reset_index(drop=True)\n",
"idx_duplicate = golddf.duplicated(keep=\"first\")\n",
"golddf.loc[idx_duplicate, 'farid'] = golddf['FAR_Exit_Dataset-original_order-far_line_id.1']\n",
"golddf.loc[~idx_duplicate, 'farid'] = golddf['FAR_Exit_Dataset-original_order-far_line_id']\n",
"golddf.rename(columns={'WRA_pseudoid': 'wraid'}, inplace=True)\n",
"gold = golddf.loc[:,['wraid', 'farid']]\n",
"gold"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "intimate-brown",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" wraid | \n",
" farid | \n",
"
\n",
" \n",
" \n",
" \n",
" | 51 | \n",
" 10-tulelake_kiyama_1924_jimmie | \n",
" NOT IN FAR | \n",
"
\n",
" \n",
" | 53 | \n",
" 10-tulelake_kiyama_1926_tommie | \n",
" NOT IN FAR | \n",
"
\n",
" \n",
" | 119 | \n",
" NOT IN WRA | \n",
" NOT IN FAR | \n",
"
\n",
" \n",
" | 177 | \n",
" 10-tulelake_toyota_1919_yoshimi | \n",
" NOT IN FAR | \n",
"
\n",
" \n",
" | 201 | \n",
" 10-tulelake_yoshida_1911_riichi | \n",
" NOT IN FAR | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" wraid farid\n",
"51 10-tulelake_kiyama_1924_jimmie NOT IN FAR\n",
"53 10-tulelake_kiyama_1926_tommie NOT IN FAR\n",
"119 NOT IN WRA NOT IN FAR\n",
"177 10-tulelake_toyota_1919_yoshimi NOT IN FAR\n",
"201 10-tulelake_yoshida_1911_riichi NOT IN FAR"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gold[gold.farid.str.contains('NOT') == True]"
]
},
{
"cell_type": "markdown",
"id": "stone-growth",
"metadata": {},
"source": [
"## Combine our FAR and WRA datasets."
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "naughty-ultimate",
"metadata": {},
"outputs": [],
"source": [
"# alldf = pd.concat([wradf, fardf], axis=0, keys=['wra', 'far'])"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "numerous-universal",
"metadata": {},
"outputs": [],
"source": [
"# alldf.head()"
]
},
{
"cell_type": "markdown",
"id": "incoming-eclipse",
"metadata": {},
"source": [
"## Use case-insensitive strings for names\n",
"\n",
"We want to treat most of our columns as case insensitive strings. `casefold()` is a function that is more consistent in different languages than simply calling `lower()`."
]
},
{
"cell_type": "code",
"execution_count": 51,
"id": "original-faith",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_261915/2963965774.py:14: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" alldf.replace(r'^\\s*$|^none$', value=np.nan, regex=True, inplace=True)\n"
]
}
],
"source": [
"# deal with empty and missing strings => None\n",
"def clean(alldf):\n",
" alldf['famid'] = alldf['famid'].astype(str)\n",
" alldf['fn'] = alldf['fn'].astype(str)\n",
" alldf['ln'] = alldf['ln'].astype(str)\n",
" alldf['ofn'] = alldf['ofn'].astype(str)\n",
"\n",
" # Casefold is like lowercase, but better for consistent case insensitive match in more languages\n",
" alldf['famid'] = alldf['famid'].str.casefold()\n",
" alldf['fn'] = alldf['fn'].str.casefold()\n",
" alldf['ln'] = alldf['ln'].str.casefold()\n",
" alldf['ofn'] = alldf['ofn'].str.casefold()\n",
"\n",
" alldf.replace(r'^\\s*$|^none$', value=np.nan, regex=True, inplace=True)\n",
" alldf.loc[:, 'ofn'] = alldf.loc[:, 'ofn'].where(alldf.ofn.notna(), None)\n",
" alldf.loc[:, 'fn'] = alldf.loc[:, 'fn'].where(alldf.fn.notna(), None)\n",
"\n",
"clean(fardf)\n",
"clean(wradf)"
]
},
{
"cell_type": "markdown",
"id": "skilled-reduction",
"metadata": {},
"source": [
"## Clean the family ids\n",
"\n",
"Some of the family ids in the source data may not make sense. Other family ids may have spaces at the start or the end that need to be trimmed. Sometimes the family id string is just a number. Others are a combination of number and letter code, with varying separator characters, such as hyphens and spaces. Others are two numbers separated by a slash. Perhaps this indicates that someone has a new family in addition to their original family ID.\n",
"\n",
"Let's first use a regex to find all the rows do not match our expectations."
]
},
{
"cell_type": "code",
"execution_count": 52,
"id": "outside-chart",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" famid | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 40450 | \n",
" 1932-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 40451 | \n",
" 1932-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 40452 | \n",
" 1932-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 41944 | \n",
" 1933-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 41945 | \n",
" 1933-07-01 00:00:00 | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
"
\n",
" \n",
" | 151724 | \n",
" nan | \n",
"
\n",
" \n",
" | 151725 | \n",
" nan | \n",
"
\n",
" \n",
" | 151726 | \n",
" nan | \n",
"
\n",
" \n",
" | 151727 | \n",
" nan | \n",
"
\n",
" \n",
" | 151728 | \n",
" nan | \n",
"
\n",
" \n",
"
\n",
"
10606 rows × 1 columns
\n",
"
"
],
"text/plain": [
" famid\n",
"row \n",
"40450 1932-07-01 00:00:00\n",
"40451 1932-07-01 00:00:00\n",
"40452 1932-07-01 00:00:00\n",
"41944 1933-07-01 00:00:00\n",
"41945 1933-07-01 00:00:00\n",
"... ...\n",
"151724 nan\n",
"151725 nan\n",
"151726 nan\n",
"151727 nan\n",
"151728 nan\n",
"\n",
"[10606 rows x 1 columns]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"famid 10606\n",
"dtype: int64\n"
]
}
],
"source": [
"weird_famids = fardf[ fardf['famid'].str.contains('^\\d+$', na=True) == False ][['famid']]\n",
"display(weird_famids)\n",
"print(weird_famids.count())"
]
},
{
"cell_type": "markdown",
"id": "located-egypt",
"metadata": {},
"source": [
"So we have 11168 weird Fam IDs. \n",
"\n",
"We see all of these letter and number patterns in the data:\n",
"* xx-11-xx\n",
"* xx 11 xx\n",
"* 11-xx-11\n",
"* 11 xx 11\n",
"* xx-11\n",
"* xx 11\n",
"* 11-xx\n",
"* 11 xx\n",
"\n",
"These patterns are not consistent in the WRA or the FAR data. We'll have to make them consistent if we want them to match up. (The letter and number segments above can be longer than two characters.)\n",
"\n",
"Some of these weird values record that the family ID was unknown when the record was created; \"unk\", \"unk.\", or \"unknown\". Others may have been known at the time, but are now \\[illegible\\]. Only one row contains a note rather than a number.\n",
"\n",
"For our purposes today, we are going to count all of the \"unknown\" FIDs as NaN or \"not a number\", which says that effectively there is no useful value available."
]
},
{
"cell_type": "code",
"execution_count": 53,
"id": "parliamentary-travel",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"FAR Exit Dataset object\n",
"original_order object\n",
"far_line_id object\n",
"family_number object\n",
"famid object\n",
"last_name_corrected object\n",
"ln object\n",
"last_name_original object\n",
"first_name_corrected object\n",
"first_name_original object\n",
"fn object\n",
"other_names object\n",
"ofn object\n",
"date_of_birth object\n",
"NEW-date_of_birth object\n",
"year_of_birth object\n",
"byear object\n",
"sex object\n",
"NEW_marital_status object\n",
"citizenship object\n",
"alien_registration_no. object\n",
"type_of_original_entry object\n",
"NEW-type_of_original_entry object\n",
"NEW-pre-evacuation_address object\n",
"pre-evacuation_state object\n",
"date_of_original_entry object\n",
"type_of_final_departure object\n",
"NEW-type_of_final_departure object\n",
"date_of_final_departure object\n",
"final_departure_state object\n",
"camp_address_original object\n",
"camp_address_block object\n",
"camp_address_barracks object\n",
"camp_address_room object\n",
"reference object\n",
"notes object\n",
"id object\n",
"m_pseudoid object\n",
"dtype: object"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"foofid = fardf.famid.str.replace(r'^([a-zA-Z]+)[\\s-](\\d+)$', lambda m: m.group(1).lower()+'-'+m.group(2), case=False, regex=True)\n",
"foofid = foofid.str.replace(r'^(\\d+)[\\s-]([a-zA-Z]+)$', lambda m: m.group(2).lower()+'-'+m.group(1), case=False, regex=True)\n",
"foofid = foofid.str.replace(r'^(\\d+)[\\s-]([a-zA-Z]+)[\\s-](\\d+)$', lambda m: m.group(1)+'-'+m.group(2).lower()+'-'+m.group(3), case=False, regex=True)\n",
"foofid = foofid.str.replace(r'^([a-zA-Z]+)[\\s-](\\d+)[\\s-]([a-zA-Z]+)$', lambda m: m.group(1).lower()+'-'+m.group(2)+'-'+m.group(3).lower(), case=False, regex=True)\n",
"\n",
"#foofid[ foofid.str.contains(' ', na=False) ]\n",
"display(fardf.dtypes)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "foster-history",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" famid | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 40450 | \n",
" 1932-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 40451 | \n",
" 1932-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 40452 | \n",
" 1932-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 41944 | \n",
" 1933-07-01 00:00:00 | \n",
"
\n",
" \n",
" | 41945 | \n",
" 1933-07-01 00:00:00 | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
"
\n",
" \n",
" | 151724 | \n",
" nan | \n",
"
\n",
" \n",
" | 151725 | \n",
" nan | \n",
"
\n",
" \n",
" | 151726 | \n",
" nan | \n",
"
\n",
" \n",
" | 151727 | \n",
" nan | \n",
"
\n",
" \n",
" | 151728 | \n",
" nan | \n",
"
\n",
" \n",
"
\n",
"
10606 rows × 1 columns
\n",
"
"
],
"text/plain": [
" famid\n",
"row \n",
"40450 1932-07-01 00:00:00\n",
"40451 1932-07-01 00:00:00\n",
"40452 1932-07-01 00:00:00\n",
"41944 1933-07-01 00:00:00\n",
"41945 1933-07-01 00:00:00\n",
"... ...\n",
"151724 nan\n",
"151725 nan\n",
"151726 nan\n",
"151727 nan\n",
"151728 nan\n",
"\n",
"[10606 rows x 1 columns]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fardf[ fardf['famid'].str.contains('^\\d+$', na=True) == False ][['famid']] ## weird Fam IDs should be gone"
]
},
{
"cell_type": "code",
"execution_count": 55,
"id": "uniform-python",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fam ID NaN count: 0 / 151728\n"
]
}
],
"source": [
"print('Fam ID NaN count: %s / %s' % (fardf['famid'].isna().sum(), fardf['famid'].count()))"
]
},
{
"cell_type": "markdown",
"id": "honey-thinking",
"metadata": {},
"source": [
"## Clean birth year data\n",
"\n",
"Some of the birth years are considered to be strings by Pandas. This includes values such as \"17 y\". We can clean this up with a regular expression that extracts the numeric digits. Since some of the birth years are missing and therefore NaN or \"not a number\", we will set the data type of the series to \"float\", since we can do the necessary birth year arithmetic on a float and NaN is considered a float."
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "particular-needle",
"metadata": {},
"outputs": [],
"source": [
"fardf['byear'] = fardf['byear'].str.replace('(\\d+).*', lambda m: m.group(1), regex=True)\n",
"fardf['byear'] = fardf['byear'].astype(str).astype(float)\n",
"wradf['byear'] = wradf['byear'].str.replace('(\\d+).*', lambda m: m.group(1), regex=True)\n",
"wradf['byear'] = wradf['byear'].astype(str).astype(float)"
]
},
{
"cell_type": "markdown",
"id": "constant-gazette",
"metadata": {},
"source": [
"## Test the `resolveFamilyMembers()` function on a single family\n",
"\n",
"Here was are modifying the FAR dataframe using the matches returned by the function. Since the index of the matches is the same index from the original FAR DF, we can simply assign the match series to a new column and the indices will be aligned for us."
]
},
{
"cell_type": "code",
"execution_count": 62,
"id": "resistant-divorce",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" famid | \n",
" ln | \n",
" fn | \n",
" ofn | \n",
" byear | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 27510 | \n",
" 7423 | \n",
" abe | \n",
" agnes | \n",
" harumi | \n",
" 1920.0 | \n",
"
\n",
" \n",
" | 27511 | \n",
" 7423 | \n",
" abe | \n",
" hama | \n",
" nan | \n",
" 1900.0 | \n",
"
\n",
" \n",
" | 27512 | \n",
" 7423 | \n",
" abe | \n",
" janice | \n",
" mikiye | \n",
" 1923.0 | \n",
"
\n",
" \n",
" | 27513 | \n",
" 7423 | \n",
" abe | \n",
" shuji | \n",
" nan | \n",
" 1878.0 | \n",
"
\n",
" \n",
" | 27514 | \n",
" 7423 | \n",
" abe | \n",
" shuzo | \n",
" nan | \n",
" 1874.0 | \n",
"
\n",
" \n",
" | 27515 | \n",
" 7423 | \n",
" abe | \n",
" tami | \n",
" nan | \n",
" 1874.0 | \n",
"
\n",
" \n",
" | 27516 | \n",
" 7423 | \n",
" abe | \n",
" tomee | \n",
" tomoye | \n",
" 1935.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" famid ln fn ofn byear\n",
"row \n",
"27510 7423 abe agnes harumi 1920.0\n",
"27511 7423 abe hama nan 1900.0\n",
"27512 7423 abe janice mikiye 1923.0\n",
"27513 7423 abe shuji nan 1878.0\n",
"27514 7423 abe shuzo nan 1874.0\n",
"27515 7423 abe tami nan 1874.0\n",
"27516 7423 abe tomee tomoye 1935.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"grid of possible matches:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | row | \n",
" 27510 | \n",
" 27511 | \n",
" 27512 | \n",
" 27513 | \n",
" 27514 | \n",
" 27515 | \n",
" 27516 | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 73 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 80 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 113 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 258 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 260 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 276 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 289 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"row 27510 27511 27512 27513 27514 27515 27516\n",
"row \n",
"73 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"80 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"113 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"258 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"260 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"276 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"289 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"grid with similarity scores:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | row | \n",
" 27510 | \n",
" 27511 | \n",
" 27512 | \n",
" 27513 | \n",
" 27514 | \n",
" 27515 | \n",
" 27516 | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 73 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 80 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 113 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 258 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 260 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 276 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" | 289 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"row 27510 27511 27512 27513 27514 27515 27516\n",
"row \n",
"73 0.0 1.0 0.0 0.0 0.0 0.0 0.0\n",
"80 1.0 0.0 0.0 0.0 0.0 0.0 0.0\n",
"113 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n",
"258 0.0 0.0 0.0 1.0 0.0 0.0 0.0\n",
"260 0.0 0.0 0.0 0.0 1.0 1.0 0.0\n",
"276 0.0 0.0 0.0 0.0 1.0 1.0 0.0\n",
"289 0.0 0.0 0.0 0.0 0.0 0.0 1.0"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"The first, best scores for each column:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"row\n",
"27510 80\n",
"27511 73\n",
"27512 113\n",
"27513 258\n",
"27514 260\n",
"27515 260\n",
"27516 289\n",
"dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"Possibilities remaining:
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | row | \n",
"
\n",
" \n",
" | row | \n",
"
\n",
" \n",
" \n",
" \n",
" | 73 | \n",
"
\n",
" \n",
" | 80 | \n",
"
\n",
" \n",
" | 113 | \n",
"
\n",
" \n",
" | 258 | \n",
"
\n",
" \n",
" | 260 | \n",
"
\n",
" \n",
" | 276 | \n",
"
\n",
" \n",
" | 289 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: [73, 80, 113, 258, 260, 276, 289]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" match | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 27510 | \n",
" 80 | \n",
"
\n",
" \n",
" | 27511 | \n",
" 73 | \n",
"
\n",
" \n",
" | 27512 | \n",
" 113 | \n",
"
\n",
" \n",
" | 27513 | \n",
" 258 | \n",
"
\n",
" \n",
" | 27514 | \n",
" 260 | \n",
"
\n",
" \n",
" | 27515 | \n",
" 260 | \n",
"
\n",
" \n",
" | 27516 | \n",
" 289 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" match\n",
"row \n",
"27510 80\n",
"27511 73\n",
"27512 113\n",
"27513 258\n",
"27514 260\n",
"27515 260\n",
"27516 289"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" match | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 27510 | \n",
" 80 | \n",
"
\n",
" \n",
" | 27511 | \n",
" 73 | \n",
"
\n",
" \n",
" | 27512 | \n",
" 113 | \n",
"
\n",
" \n",
" | 27513 | \n",
" 258 | \n",
"
\n",
" \n",
" | 27514 | \n",
" 260 | \n",
"
\n",
" \n",
" | 27515 | \n",
" 260 | \n",
"
\n",
" \n",
" | 27516 | \n",
" 289 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" match\n",
"row \n",
"27510 80\n",
"27511 73\n",
"27512 113\n",
"27513 258\n",
"27514 260\n",
"27515 260\n",
"27516 289"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" famid | \n",
" ln | \n",
" fn | \n",
" ofn | \n",
" byear | \n",
" wra row | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 27510 | \n",
" 7423 | \n",
" abe | \n",
" agnes | \n",
" harumi | \n",
" 1920.0 | \n",
" 80 | \n",
"
\n",
" \n",
" | 27511 | \n",
" 7423 | \n",
" abe | \n",
" hama | \n",
" nan | \n",
" 1900.0 | \n",
" 73 | \n",
"
\n",
" \n",
" | 27512 | \n",
" 7423 | \n",
" abe | \n",
" janice | \n",
" mikiye | \n",
" 1923.0 | \n",
" 113 | \n",
"
\n",
" \n",
" | 27513 | \n",
" 7423 | \n",
" abe | \n",
" shuji | \n",
" nan | \n",
" 1878.0 | \n",
" 258 | \n",
"
\n",
" \n",
" | 27514 | \n",
" 7423 | \n",
" abe | \n",
" shuzo | \n",
" nan | \n",
" 1874.0 | \n",
" 260 | \n",
"
\n",
" \n",
" | 27515 | \n",
" 7423 | \n",
" abe | \n",
" tami | \n",
" nan | \n",
" 1874.0 | \n",
" 260 | \n",
"
\n",
" \n",
" | 27516 | \n",
" 7423 | \n",
" abe | \n",
" tomee | \n",
" tomoye | \n",
" 1935.0 | \n",
" 289 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" famid ln fn ofn byear wra row\n",
"row \n",
"27510 7423 abe agnes harumi 1920.0 80\n",
"27511 7423 abe hama nan 1900.0 73\n",
"27512 7423 abe janice mikiye 1923.0 113\n",
"27513 7423 abe shuji nan 1878.0 258\n",
"27514 7423 abe shuzo nan 1874.0 260\n",
"27515 7423 abe tami nan 1874.0 260\n",
"27516 7423 abe tomee tomoye 1935.0 289"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Look at our person matching data problem w/in a family\n",
"fam_far = fardf[fardf['famid']=='7423'][['famid','ln','fn','ofn','byear']]\n",
"fam_wra = wradf[wradf['famid']=='7423'][['famid','ln','fn','ofn','byear']]\n",
"display(fam_far)\n",
"all_matchesdf2 = pd.DataFrame()\n",
"matches = resolveFamilyMembers(fam_far, fam_wra, debug=True)\n",
"display(matches)\n",
"all_matchesdf2 = pd.concat([all_matchesdf2, matches])\n",
"display(all_matchesdf2)\n",
"fardf['wra row'] = all_matchesdf2.match\n",
"fardf[fardf['famid']=='7423'][['famid','ln','fn','ofn','byear', 'wra row']]"
]
},
{
"cell_type": "markdown",
"id": "spread-presentation",
"metadata": {},
"source": [
"## The Whole Enchilada\n",
"\n",
"Let's see if we can perform matching against the entire dataset. What could go wrong?"
]
},
{
"cell_type": "code",
"execution_count": 63,
"id": "supreme-polyester",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: []\n",
"Index: []"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fams = 10\n",
"skip = 5000\n",
"count = 0\n",
"all_matchesdf = pd.DataFrame()\n",
"unmatched_famids = pd.DataFrame(columns=['dataset', 'famid'])\n",
"for famid, frame in fardf.groupby(['famid']): # , 'FAR Exit Dataset']):\n",
"# count = count + 1\n",
"# if count < skip:\n",
"# continue\n",
"# if count >= skip + fams:\n",
"# break\n",
" try:\n",
" wra_fam = wradf[wradf['famid'] == famid]\n",
" if wra_fam.empty:\n",
" unmatched_famids.loc[len(unmatched_famids.index)] = ['far', famid]\n",
" continue\n",
" matches = resolveFamilyMembers(frame, wra_fam, debug=False)\n",
" all_matchesdf = pd.concat([all_matchesdf, matches])\n",
" except KeyError as e:\n",
" display(e)\n",
"unmatched_famids.to_csv('./unmatched_famids.csv')\n",
"#fardf['wra row'] = all_matchesdf.match\n",
"all_matchesdf.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "coastal-fighter",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"match 119953\n",
"dtype: int64"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#\n",
"display(all_matchesdf.count())\n",
"#fardf['wra row'] = all_matchesdf[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "common-problem",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" famid | \n",
"
\n",
" \n",
" | dataset | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | far | \n",
" 1490 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" famid\n",
"dataset \n",
"far 1490"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unmatched_famids.groupby(['dataset']).count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "unavailable-interview",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" famid | \n",
" ln | \n",
" fn | \n",
" ofn | \n",
" byear | \n",
" wra row | \n",
"
\n",
" \n",
" | row | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 7000 | \n",
" 2324 | \n",
" yamada | \n",
" mitsuzo | \n",
" yugetsu | \n",
" 1903.0 | \n",
" 100780 | \n",
"
\n",
" \n",
" | 7001 | \n",
" 2324 | \n",
" yamada | \n",
" mitzi | \n",
" mitsue | \n",
" 1925.0 | \n",
" 100782 | \n",
"
\n",
" \n",
" | 7002 | \n",
" 2324 | \n",
" yamada | \n",
" yuriko | \n",
" nan | \n",
" 1929.0 | \n",
" 101035 | \n",
"
\n",
" \n",
" | 7003 | \n",
" 2325 | \n",
" segawa | \n",
" chiyoko | \n",
" angie | \n",
" 1918.0 | \n",
" 76628 | \n",
"
\n",
" \n",
" | 7004 | \n",
" 2325 | \n",
" segawa | \n",
" chiyoko | \n",
" nan | \n",
" 1918.0 | \n",
" 76628 | \n",
"
\n",
" \n",
" | 7005 | \n",
" 2325 | \n",
" segawa | \n",
" sharleen | \n",
" chizuko | \n",
" 1944.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 7006 | \n",
" 2325 | \n",
" segawa | \n",
" ted | \n",
" jr. | \n",
" 1941.0 | \n",
" 76647 | \n",
"
\n",
" \n",
" | 7007 | \n",
" 2325 | \n",
" segawa | \n",
" theodore | \n",
" soichi | \n",
" 1941.0 | \n",
" 76647 | \n",
"
\n",
" \n",
" | 7008 | \n",
" 2325 | \n",
" segawa | \n",
" theodore | \n",
" nan | \n",
" 1914.0 | \n",
" 76648 | \n",
"
\n",
" \n",
" | 7009 | \n",
" 2326 | \n",
" osugi | \n",
" harumi | \n",
" nan | \n",
" 1914.0 | \n",
" 70154 | \n",
"
\n",
" \n",
" | 7010 | \n",
" 2326 | \n",
" osugi | \n",
" mikio | \n",
" donald | \n",
" 1934.0 | \n",
" 70161 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" famid ln fn ofn byear wra row\n",
"row \n",
"7000 2324 yamada mitsuzo yugetsu 1903.0 100780\n",
"7001 2324 yamada mitzi mitsue 1925.0 100782\n",
"7002 2324 yamada yuriko nan 1929.0 101035\n",
"7003 2325 segawa chiyoko angie 1918.0 76628\n",
"7004 2325 segawa chiyoko nan 1918.0 76628\n",
"7005 2325 segawa sharleen chizuko 1944.0 NaN\n",
"7006 2325 segawa ted jr. 1941.0 76647\n",
"7007 2325 segawa theodore soichi 1941.0 76647\n",
"7008 2325 segawa theodore nan 1914.0 76648\n",
"7009 2326 osugi harumi nan 1914.0 70154\n",
"7010 2326 osugi mikio donald 1934.0 70161"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fardf.dropna(subset=['famid'], inplace=True)\n",
"fardf.loc[7000:7010, :][['famid','ln','fn','ofn','byear', 'wra row']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "antique-circular",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unmatched with famid: 31775 / 151728\n",
"unmatched with famid percent: 20.94208056522198\n"
]
}
],
"source": [
"print( 'unmatched with famid: ' + str(fardf['wra row'].isna().sum()) + ' / '+ str(fardf['famid'].count()))\n",
"print( 'unmatched with famid percent: ' + str(fardf['wra row'].isna().sum()/fardf['famid'].count()*100))"
]
},
{
"cell_type": "markdown",
"id": "southwest-persian",
"metadata": {},
"source": [
"# Measure Accuracy against Gold Standard Dataset of 204 FAR records\n",
"\n",
"199 or the 204 FAR records were able to be matched up by hand with their WRA records. Now we want to see how many of these same records were matched by the algorithm so far."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "informed-injury",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"matched: 155 / 204\n",
"gold matched: 199 / 204\n"
]
}
],
"source": [
"cnt = 0\n",
"matched = 0\n",
"for index, row in gold.iterrows():\n",
" cnt += 1\n",
" try:\n",
" far_row = fardf[fardf['m_pseudoid'] == row['farid']]\n",
" wra_row = wradf.index[wradf['m_pseudoid'] == row['wraid']].tolist()[0]\n",
" if (wra_row != None) & (far_row['wra row'].tolist()[0] == wra_row):\n",
" matched += 1\n",
" except:\n",
" pass\n",
"print('matched: %s / %s'%(matched, cnt))\n",
"\n",
"gold_unmatched = gold.farid.str.contains('NOT').sum()\n",
"print('gold matched: %s / %s'%(cnt-gold_unmatched, cnt))"
]
},
{
"cell_type": "markdown",
"id": "aboriginal-palestine",
"metadata": {},
"source": [
"# Try resolving the remainder with family name soundex segments"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "korean-passing",
"metadata": {},
"outputs": [],
"source": [
"from libindic.soundex import Soundex\n",
"instance = Soundex()\n",
"fardf['ln_soundex'] = fardf[fardf['ln'].isna() == False]['ln'].apply(instance.soundex)\n",
"wradf['ln_soundex'] = wradf[wradf['ln'].isna() == False]['ln'].apply(instance.soundex)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "extensive-addition",
"metadata": {},
"outputs": [],
"source": [
"sndx_matchesdf = pd.DataFrame()\n",
"unmatched_lnsx = pd.DataFrame(columns=['dataset', 'lnsx'])\n",
"unmatched_far = fardf[fardf['wra row'].isna()]\n",
"for lnsx, frame in unmatched_far.groupby(['ln_soundex']):\n",
" try:\n",
" wra = wradf[wradf['ln_soundex'] == lnsx]\n",
" if wra.empty:\n",
" continue;\n",
" matches = resolveFamilyMembers(frame, wra, debug=False)\n",
" sndx_matchesdf = pd.concat([sndx_matchesdf, matches])\n",
" # display(fardf[fardf['famid']==fam][['famid','pid','fn','ofn','byear', 'wra row']])\n",
" except KeyError as e:\n",
" # display(e)\n",
" unmatched_lnsx.loc[len(unmatched_lnsx.index)] = ['far', lnsx]\n",
" continue\n",
"unmatched_lnsx.to_csv('./unmatched_lnsx.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "female-timing",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"match 11837\n",
"dtype: int64"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sndx_matchesdf.count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "outdoor-machine",
"metadata": {},
"outputs": [],
"source": [
"fardf['wra row sndx'] = sndx_matchesdf.match"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "smart-harbor",
"metadata": {},
"outputs": [],
"source": [
"fardf.to_csv('./MATCHED FAR.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "posted-there",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"FAR Exit Dataset 131790\n",
"original_order 131790\n",
"far_line_id 131130\n",
"family_number 131669\n",
"famid 131790\n",
"last_name_corrected 131789\n",
"ln 131790\n",
"last_name_original 131694\n",
"first_name_corrected 131783\n",
"first_name_original 131783\n",
"fn 131790\n",
"other_names 48898\n",
"ofn 131790\n",
"date_of_birth 131200\n",
"NEW-date_of_birth 131200\n",
"year_of_birth 131217\n",
"byear 131223\n",
"sex 131697\n",
"NEW_marital_status 131683\n",
"citizenship 131707\n",
"alien_registration_no. 49199\n",
"type_of_original_entry 131718\n",
"NEW-type_of_original_entry 131710\n",
"NEW-pre-evacuation_address 129817\n",
"pre-evacuation_state 129503\n",
"date_of_original_entry 131720\n",
"type_of_final_departure 131177\n",
"NEW-type_of_final_departure 131176\n",
"date_of_final_departure 131182\n",
"final_departure_state 117194\n",
"camp_address_original 52351\n",
"camp_address_block 62573\n",
"camp_address_barracks 62137\n",
"camp_address_room 62105\n",
"reference 565\n",
"notes 2893\n",
"id 131790\n",
"m_pseudoid 131130\n",
"wra row 131790\n",
"ln_soundex 131790\n",
"wra row sndx 11837\n",
"dtype: int64"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fardf[~fardf['wra row'].isna() | ~fardf['wra row sndx'].isna()].count()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "artificial-adjustment",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.13140620056944008"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1- 131790 / 151728"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "raised-boutique",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"19938"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"151728 - 131790"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "defensive-frame",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"matched: 168 / 204\n",
"famid errors: 183 / 204\n",
"soundex errors: 0 / 204\n",
"gold matched: 199 / 204\n"
]
}
],
"source": [
"cnt = 0\n",
"matched = 0\n",
"famid_errors = 0\n",
"soundex_errors = 0\n",
"for index, row in gold.iterrows():\n",
" cnt += 1\n",
" try:\n",
" far_row = fardf[fardf['m_pseudoid'] == row['farid']]\n",
" wraidx = wradf.index[wradf['m_pseudoid'] == row['wraid']].tolist()[0]\n",
" if (far_row['wra row'].tolist()[0] == wraidx) | (far_row['wra row sndx'].tolist()[0] == wraidx):\n",
" matched += 1\n",
" if (far_row['wra row'].tolist()[0] != wraidx): # & (~far_row['wra row'].isna()):\n",
" famid_errors += 1\n",
" if (far_row['wra row sndx'].tolist()[0] != wraidx): # & (~far_row['wra row sndx'].isna()):\n",
" famid_errors += 1\n",
" except:\n",
" pass\n",
"print('matched: %s / %s'%(matched, cnt))\n",
"print('famid errors: %s / %s'%(famid_errors, cnt))\n",
"print('soundex errors: %s / %s'%(soundex_errors, cnt))\n",
"\n",
"gold_unmatched = gold.farid.str.contains('NOT').sum()\n",
"print('gold matched: %s / %s'%(cnt-gold_unmatched, cnt))"
]
},
{
"cell_type": "markdown",
"id": "secondary-savings",
"metadata": {},
"source": [
"## Family ID discrepancies\n",
"\n",
"We can see above that many family IDs were found in one dataset, but not found in the other dataset. If you examine these IDs you can see that formatting differences are common, with hypens and spaces used interchangeably. Also the letter codes are sometimes before or after the numeric portion of the ID.\n",
"\n",
"We are going to need to create consistency for the resolved to work on these families. Then we'll see how many unmatched family IDs are left.\n",
"\n",
"To create consistency we need to capture the letter and the number portions of the IDs and reformat them one consistent way. We'll choose xx-1111 as that target format. We're only going to reformat IDs that have a letter portion, leaving the plain integers alone."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "defensive-tucson",
"metadata": {},
"outputs": [],
"source": [
"alldf[ alldf['famid'] =='15170'][['fn', 'famid', 'ofn']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "differential-extra",
"metadata": {},
"outputs": [],
"source": [
"alldf[ alldf['fn'] =='harry'][['fn', 'famid', 'ofn', 'sn']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "entertaining-authorization",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}