{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "previous-guest",
   "metadata": {},
   "source": [
    "# Entity Resolution in Japanese-American Internee Records\n",
    "\n",
    "## Dependencies\n",
    "\n",
    "Dependencies are installed via the Pipfile in the same directory."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "palestinian-outside",
   "metadata": {},
   "source": [
    "## The Resolver\n",
    "\n",
    "Here we define a function that take two data frame-like objects with our chosen column names and attempts to identify matches between them. It returns a list of matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "indirect-export",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from IPython.display import display, HTML\n",
    "# from datetime import datetime, date\n",
    "from fuzzywuzzy import fuzz\n",
    "\n",
    "def resolveFamilyMembers( sourceDF, targetDF, debug=False ):  # source, target\n",
    "    result = pd.DataFrame(None, index=sourceDF.index, columns=['match'])\n",
    "    threshold = float(.5)\n",
    "    xdf = pd.DataFrame(float(0), index=targetDF.index, columns=sourceDF.index )\n",
    "    if debug:\n",
    "        display(HTML('<h3>grid of possible matches:</h3>'), xdf)\n",
    "    # Visit every cell in the grid and plug in a similarity value between the FAR (cols) and WRA (index) axis..\n",
    "    for (id1, row1) in targetDF.iterrows():\n",
    "        for (id2, row2) in sourceDF.iterrows():\n",
    "            xdf.at[id1, id2] = compare(row1, row2)\n",
    "    if debug:\n",
    "        display(HTML('<h3>grid with similarity scores:</h3>'),xdf)\n",
    "    maxMatches = xdf.idxmax()  # now we want to find the best matches for each FAR record\n",
    "    if debug:\n",
    "        display(HTML('<h3>The first, best scores for each column:</h3>'), maxMatches)\n",
    "    for id2 in maxMatches.index:\n",
    "        if xdf.loc[maxMatches[id2], id2] > threshold:  # max could be zero, so check that we have acceptable score\n",
    "            # sourceDF.at[id2, 'pid'] = targetDF.loc[maxMatches[id2]]['pid']  # assign the pid to the FAR row\n",
    "            result.at[id2, 'match'] = maxMatches[id2]\n",
    "            #xdf.drop([maxMatches[id2]], inplace=True)  # drop the WRA row we used\n",
    "            xdf.drop(columns=[id2], inplace=True)  # drop the FAR column we used\n",
    "    if debug:\n",
    "        display(HTML('<h3>Possibilities remaining:</h3>'), xdf)\n",
    "    if xdf.size == 1:  # if only one possibility remains, choose it\n",
    "        if debug:\n",
    "            print('just one possibility left!')\n",
    "        id1 = xdf.index[0]\n",
    "        id2 = xdf.columns[0]\n",
    "        # sourceDF.at[id2, 'pid'] = targetDF.loc[id1]['pid']\n",
    "        result.at[id2, 'match'] = id1\n",
    "    return result\n",
    "        \n",
    "# Scoring is weird when there are multiple values to match..\n",
    "# Using fn/ofn, byear+-1 for matching so far..\n",
    "def compare(a, b):\n",
    "    result = float(max( # use whichever score is higher in this list..\n",
    "        min(fn(a,b), by1(a,b)),  # if either fn() or by1() return zero then this result is zero..\n",
    "        0.5*(by1(a,b)),  # if we only match the byear, then score is just .75\n",
    "        #### another compare method\n",
    "    ))\n",
    "    return result\n",
    "\n",
    "# returns score of 1 if fn matches or 1 if ofn matches fn\n",
    "# now with fuzzywuzzy ratio adjusted to btw 0 and 1\n",
    "def fn(a, b):\n",
    "    try:\n",
    "        result = float(max(\n",
    "            fuzz.ratio(str(a['fn']), str(b['fn'])),\n",
    "            fuzz.ratio(str(a['fn']), str(b['ofn'])),  # includes matching with other names\n",
    "            fuzz.ratio(str(b['fn']), str(a['ofn'])),\n",
    "            fuzz.ratio(str(b['ofn']), str(a['ofn'])),\n",
    "        )/100)\n",
    "        return result\n",
    "    except TypeError as e:\n",
    "        import traceback\n",
    "        print(traceback.format_exc())\n",
    "        display(a['fn'], a['ofn'], b['fn'], b['ofn'])\n",
    "    return 0\n",
    "\n",
    "# returns a score of 1 if byears are within 1 year\n",
    "def by1(a, b):\n",
    "    return ( abs( a['byear'] - b['byear'] ) <= 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "noted-masters",
   "metadata": {},
   "source": [
    "## Load the big datasets into Pandas data frames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "italic-packaging",
   "metadata": {},
   "outputs": [],
   "source": [
    "farfile = './far_cleanup-09-26-2021.xlsx'\n",
    "wrafile = './wra_cleanup-09-26-2021.csv'\n",
    "na = ['Unknown', 'unknown', 'Unk.', 'UNK', 'unk.', 'Unk', 'unk', ' ', '  ', '   ']\n",
    "fardf = pd.read_excel(farfile,dtype=str,na_values=na,keep_default_na=True)\n",
    "wradf = pd.read_csv(wrafile,dtype=str,na_values=na,keep_default_na=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lined-survey",
   "metadata": {},
   "source": [
    "## Create consistent labels for the significant columns across datasets\n",
    "\n",
    "We are going to do a lot of work with these particular columns, let's use some short, easy labels and defined the data types we want:\n",
    "\n",
    "* m_pseudoid - str - a string that is unique to each row in the dataset (mostly) and used to verify results against a reference dataset.\n",
    "* famid - str - WRA family id number (also used in FAR)\n",
    "* fn - str - first name\n",
    "* ofn - str - other first name(s)\n",
    "* ln - str - last name\n",
    "* byear - float - birth year\n",
    "\n",
    "We are also going to reindex, starting with 1 instead of zero. This makes it easier to create our unique 'id' column. The 'id' column is a globally unique url for the line within the source file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "parallel-rwanda",
   "metadata": {},
   "outputs": [],
   "source": [
    "# first step, create uniform column names across dataframes\n",
    "wradf.rename(columns = {'NEW-m_lastname':'ln', 'NEW-m_firstname':'fn', 'NEW-m_familyno_normal':'famid', 'NEW-m_birthyear':'byear'}, inplace = True)\n",
    "wradf['row'] = wradf.index.map(lambda x: x+1)\n",
    "wradf['id'] = 'http://densho.org/data/WRA.csv#' + wradf['row'].map(str)\n",
    "wradf['ofn'] = None  # not in WRA data\n",
    "wradf.set_index('row', inplace=True, verify_integrity=True)\n",
    "fardf.rename(columns = {'NEW-last_name_corrected':'ln', 'NEW-first_name_corrected':'fn', 'NEW-other_names':'ofn', 'NEW-family_number':'famid', 'NEW-year_of_birth':'byear'}, inplace = True)\n",
    "fardf['row'] = fardf.index.map(lambda x: x+1)\n",
    "fardf['id'] = 'http://densho.org/data/FAR.csv#' + fardf['row'].map(str)\n",
    "fardf['m_pseudoid'] = fardf['FAR Exit Dataset'] + \":\" + fardf['original_order'] + \":\" + fardf['far_line_id']\n",
    "fardf.set_index('row', inplace=True, verify_integrity=True)\n",
    "# fardf.dropna(subset=['m_pseudoid'], inplace=True)  # drop rows without pseudoid\n",
    "# fardf.dropna(subset=['fn'], inplace=True)  # drop rows without first name\n",
    "# fardf.dropna(subset=['famid'], inplace=True)  # no longer dropping empty famid rows b/c trying soundex"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "restricted-portugal",
   "metadata": {},
   "source": [
    "## Gold Standard Reference Dataframe\n",
    "\n",
    "These matches were performed manually on a subset of FAR records, based on the list of internees that were transferred to North Dakota after the event described as a \"riot\". We are taking the Excel file and rearranging and simplifying that data in order to have a simple list of expected matched. Each line in Excel contains two FAR ids that are expect to match one WRA id.\n",
    "\n",
    "We will use this data as the \"ground truth\" to measure the accuracy of algorithmic entity resolution. The \"gold\" data frame is only the id columns. The additional columns are in \"golddf\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "individual-yellow",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>wraid</th>\n",
       "      <th>farid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>7-manzanar_ajisaka_1890_tatsuo</td>\n",
       "      <td>tulelake1:208:198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>7-manzanar_ajisaka_1890_tatsuo</td>\n",
       "      <td>manzanar1:36:35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6-jerome_arichi_1894_saburo</td>\n",
       "      <td>tulelake1:628:602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6-jerome_arichi_1894_saburo</td>\n",
       "      <td>jerome1:293:289</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2-poston_hagio_1889_takemats</td>\n",
       "      <td>tulelake1:2715:2575</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>1-topaz_yamasaki_1914_satoru</td>\n",
       "      <td>topaz1:10959:10587</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>200</th>\n",
       "      <td>10-tulelake_yoshida_1911_riichi</td>\n",
       "      <td>tulelake1:29776:28342</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>201</th>\n",
       "      <td>10-tulelake_yoshida_1911_riichi</td>\n",
       "      <td>NOT IN FAR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>202</th>\n",
       "      <td>3-gilariver_yoshioka_1918_kiyozo</td>\n",
       "      <td>tulelake1:30146:28693</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>203</th>\n",
       "      <td>3-gilariver_yoshioka_1918_kiyozo</td>\n",
       "      <td>gilariver1:16851:16379</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>204 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                wraid                   farid\n",
       "0      7-manzanar_ajisaka_1890_tatsuo       tulelake1:208:198\n",
       "1      7-manzanar_ajisaka_1890_tatsuo         manzanar1:36:35\n",
       "2         6-jerome_arichi_1894_saburo       tulelake1:628:602\n",
       "3         6-jerome_arichi_1894_saburo         jerome1:293:289\n",
       "4        2-poston_hagio_1889_takemats     tulelake1:2715:2575\n",
       "..                                ...                     ...\n",
       "199      1-topaz_yamasaki_1914_satoru      topaz1:10959:10587\n",
       "200   10-tulelake_yoshida_1911_riichi   tulelake1:29776:28342\n",
       "201   10-tulelake_yoshida_1911_riichi              NOT IN FAR\n",
       "202  3-gilariver_yoshioka_1918_kiyozo   tulelake1:30146:28693\n",
       "203  3-gilariver_yoshioka_1918_kiyozo  gilariver1:16851:16379\n",
       "\n",
       "[204 rows x 2 columns]"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "matchesfile = './Transfer_NDak_100-male_Jul-3-1945_CLEANED.xlsx'\n",
    "matches_src_df = pd.read_excel(matchesfile,dtype=str,na_values=na,keep_default_na=True)\n",
    "golddf = matches_src_df.loc[2:101, ['WRA_pseudoid', 'FAR_Exit_Dataset-original_order-far_line_id', 'FAR_Exit_Dataset-original_order-far_line_id.1', 'LastName', 'FirstName']]\n",
    "golddf = golddf.merge(wradf, left_on=['WRA_pseudoid'], right_on=['m_pseudoid'], how='left')\n",
    "golddf = golddf.loc[golddf.index.repeat(2),:].reset_index(drop=True)\n",
    "idx_duplicate = golddf.duplicated(keep=\"first\")\n",
    "golddf.loc[idx_duplicate, 'farid'] = golddf['FAR_Exit_Dataset-original_order-far_line_id.1']\n",
    "golddf.loc[~idx_duplicate, 'farid'] = golddf['FAR_Exit_Dataset-original_order-far_line_id']\n",
    "golddf.rename(columns={'WRA_pseudoid': 'wraid'}, inplace=True)\n",
    "gold = golddf.loc[:,['wraid', 'farid']]\n",
    "gold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "intimate-brown",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>wraid</th>\n",
       "      <th>farid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>51</th>\n",
       "      <td>10-tulelake_kiyama_1924_jimmie</td>\n",
       "      <td>NOT IN FAR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>10-tulelake_kiyama_1926_tommie</td>\n",
       "      <td>NOT IN FAR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>119</th>\n",
       "      <td>NOT IN WRA</td>\n",
       "      <td>NOT IN FAR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>177</th>\n",
       "      <td>10-tulelake_toyota_1919_yoshimi</td>\n",
       "      <td>NOT IN FAR</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>201</th>\n",
       "      <td>10-tulelake_yoshida_1911_riichi</td>\n",
       "      <td>NOT IN FAR</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                               wraid       farid\n",
       "51    10-tulelake_kiyama_1924_jimmie  NOT IN FAR\n",
       "53    10-tulelake_kiyama_1926_tommie  NOT IN FAR\n",
       "119                       NOT IN WRA  NOT IN FAR\n",
       "177  10-tulelake_toyota_1919_yoshimi  NOT IN FAR\n",
       "201  10-tulelake_yoshida_1911_riichi  NOT IN FAR"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gold[gold.farid.str.contains('NOT') == True]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "stone-growth",
   "metadata": {},
   "source": [
    "## Combine our FAR and WRA datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "naughty-ultimate",
   "metadata": {},
   "outputs": [],
   "source": [
    "# alldf = pd.concat([wradf, fardf], axis=0, keys=['wra', 'far'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "numerous-universal",
   "metadata": {},
   "outputs": [],
   "source": [
    "# alldf.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "incoming-eclipse",
   "metadata": {},
   "source": [
    "## Use case-insensitive strings for names\n",
    "\n",
    "We want to treat most of our columns as case insensitive strings. `casefold()` is a function that is more consistent in different languages than simply calling `lower()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "original-faith",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_261915/2963965774.py:14: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
      "  alldf.replace(r'^\\s*$|^none$', value=np.nan, regex=True, inplace=True)\n"
     ]
    }
   ],
   "source": [
    "# deal with empty and missing strings => None\n",
    "def clean(alldf):\n",
    "    alldf['famid'] = alldf['famid'].astype(str)\n",
    "    alldf['fn'] = alldf['fn'].astype(str)\n",
    "    alldf['ln'] = alldf['ln'].astype(str)\n",
    "    alldf['ofn'] = alldf['ofn'].astype(str)\n",
    "\n",
    "    # Casefold is like lowercase, but better for consistent case insensitive match in more languages\n",
    "    alldf['famid'] = alldf['famid'].str.casefold()\n",
    "    alldf['fn'] = alldf['fn'].str.casefold()\n",
    "    alldf['ln'] = alldf['ln'].str.casefold()\n",
    "    alldf['ofn'] = alldf['ofn'].str.casefold()\n",
    "\n",
    "    alldf.replace(r'^\\s*$|^none$', value=np.nan, regex=True, inplace=True)\n",
    "    alldf.loc[:, 'ofn'] = alldf.loc[:, 'ofn'].where(alldf.ofn.notna(), None)\n",
    "    alldf.loc[:, 'fn'] = alldf.loc[:, 'fn'].where(alldf.fn.notna(), None)\n",
    "\n",
    "clean(fardf)\n",
    "clean(wradf)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "skilled-reduction",
   "metadata": {},
   "source": [
    "## Clean the family ids\n",
    "\n",
    "Some of the family ids in the source data may not make sense. Other family ids may have spaces at the start or the end that need to be trimmed. Sometimes the family id string is just a number. Others are a combination of number and letter code, with varying separator characters, such as hyphens and spaces. Others are two numbers separated by a slash. Perhaps this indicates that someone has a new family in addition to their original family ID.\n",
    "\n",
    "Let's first use a regex to find all the rows do not match our expectations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "outside-chart",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>famid</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>40450</th>\n",
       "      <td>1932-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40451</th>\n",
       "      <td>1932-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40452</th>\n",
       "      <td>1932-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41944</th>\n",
       "      <td>1933-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41945</th>\n",
       "      <td>1933-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151724</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151725</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151726</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151727</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151728</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10606 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      famid\n",
       "row                        \n",
       "40450   1932-07-01 00:00:00\n",
       "40451   1932-07-01 00:00:00\n",
       "40452   1932-07-01 00:00:00\n",
       "41944   1933-07-01 00:00:00\n",
       "41945   1933-07-01 00:00:00\n",
       "...                     ...\n",
       "151724                  nan\n",
       "151725                  nan\n",
       "151726                  nan\n",
       "151727                  nan\n",
       "151728                  nan\n",
       "\n",
       "[10606 rows x 1 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "famid    10606\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "weird_famids = fardf[ fardf['famid'].str.contains('^\\d+$', na=True) == False ][['famid']]\n",
    "display(weird_famids)\n",
    "print(weird_famids.count())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "located-egypt",
   "metadata": {},
   "source": [
    "So we have 11168 weird Fam IDs. \n",
    "\n",
    "We see all of these letter and number patterns in the data:\n",
    "* xx-11-xx\n",
    "* xx 11 xx\n",
    "* 11-xx-11\n",
    "* 11 xx 11\n",
    "* xx-11\n",
    "* xx 11\n",
    "* 11-xx\n",
    "* 11 xx\n",
    "\n",
    "These patterns are not consistent in the WRA or the FAR data. We'll have to make them consistent if we want them to match up. (The letter and number segments above can be longer than two characters.)\n",
    "\n",
    "Some of these weird values record that the family ID was unknown when the record was created; \"unk\", \"unk.\", or \"unknown\". Others may have been known at the time, but are now \\[illegible\\]. Only one row contains a note rather than a number.\n",
    "\n",
    "For our purposes today, we are going to count all of the \"unknown\" FIDs as NaN or \"not a number\", which says that effectively there is no useful value available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "parliamentary-travel",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "FAR Exit Dataset               object\n",
       "original_order                 object\n",
       "far_line_id                    object\n",
       "family_number                  object\n",
       "famid                          object\n",
       "last_name_corrected            object\n",
       "ln                             object\n",
       "last_name_original             object\n",
       "first_name_corrected           object\n",
       "first_name_original            object\n",
       "fn                             object\n",
       "other_names                    object\n",
       "ofn                            object\n",
       "date_of_birth                  object\n",
       "NEW-date_of_birth              object\n",
       "year_of_birth                  object\n",
       "byear                          object\n",
       "sex                            object\n",
       "NEW_marital_status             object\n",
       "citizenship                    object\n",
       "alien_registration_no.         object\n",
       "type_of_original_entry         object\n",
       "NEW-type_of_original_entry     object\n",
       "NEW-pre-evacuation_address     object\n",
       "pre-evacuation_state           object\n",
       "date_of_original_entry         object\n",
       "type_of_final_departure        object\n",
       "NEW-type_of_final_departure    object\n",
       "date_of_final_departure        object\n",
       "final_departure_state          object\n",
       "camp_address_original          object\n",
       "camp_address_block             object\n",
       "camp_address_barracks          object\n",
       "camp_address_room              object\n",
       "reference                      object\n",
       "notes                          object\n",
       "id                             object\n",
       "m_pseudoid                     object\n",
       "dtype: object"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "foofid = fardf.famid.str.replace(r'^([a-zA-Z]+)[\\s-](\\d+)$', lambda m: m.group(1).lower()+'-'+m.group(2), case=False, regex=True)\n",
    "foofid = foofid.str.replace(r'^(\\d+)[\\s-]([a-zA-Z]+)$', lambda m: m.group(2).lower()+'-'+m.group(1), case=False, regex=True)\n",
    "foofid = foofid.str.replace(r'^(\\d+)[\\s-]([a-zA-Z]+)[\\s-](\\d+)$', lambda m: m.group(1)+'-'+m.group(2).lower()+'-'+m.group(3), case=False, regex=True)\n",
    "foofid = foofid.str.replace(r'^([a-zA-Z]+)[\\s-](\\d+)[\\s-]([a-zA-Z]+)$', lambda m: m.group(1).lower()+'-'+m.group(2)+'-'+m.group(3).lower(), case=False, regex=True)\n",
    "\n",
    "#foofid[ foofid.str.contains(' ', na=False) ]\n",
    "display(fardf.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "foster-history",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>famid</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>40450</th>\n",
       "      <td>1932-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40451</th>\n",
       "      <td>1932-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40452</th>\n",
       "      <td>1932-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41944</th>\n",
       "      <td>1933-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41945</th>\n",
       "      <td>1933-07-01 00:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151724</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151725</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151726</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151727</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>151728</th>\n",
       "      <td>nan</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10606 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                      famid\n",
       "row                        \n",
       "40450   1932-07-01 00:00:00\n",
       "40451   1932-07-01 00:00:00\n",
       "40452   1932-07-01 00:00:00\n",
       "41944   1933-07-01 00:00:00\n",
       "41945   1933-07-01 00:00:00\n",
       "...                     ...\n",
       "151724                  nan\n",
       "151725                  nan\n",
       "151726                  nan\n",
       "151727                  nan\n",
       "151728                  nan\n",
       "\n",
       "[10606 rows x 1 columns]"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fardf[ fardf['famid'].str.contains('^\\d+$', na=True) == False ][['famid']]  ## weird Fam IDs should be gone"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "uniform-python",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fam ID NaN count: 0 / 151728\n"
     ]
    }
   ],
   "source": [
    "print('Fam ID NaN count: %s / %s' % (fardf['famid'].isna().sum(), fardf['famid'].count()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "honey-thinking",
   "metadata": {},
   "source": [
    "## Clean birth year data\n",
    "\n",
    "Some of the birth years are considered to be strings by Pandas. This includes values such as \"17 y\". We can clean this up with a regular expression that extracts the numeric digits. Since some of the birth years are missing and therefore NaN or \"not a number\", we will set the data type of the series to \"float\", since we can do the necessary birth year arithmetic on a float and NaN is considered a float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "particular-needle",
   "metadata": {},
   "outputs": [],
   "source": [
    "fardf['byear'] = fardf['byear'].str.replace('(\\d+).*', lambda m: m.group(1), regex=True)\n",
    "fardf['byear'] = fardf['byear'].astype(str).astype(float)\n",
    "wradf['byear'] = wradf['byear'].str.replace('(\\d+).*', lambda m: m.group(1), regex=True)\n",
    "wradf['byear'] = wradf['byear'].astype(str).astype(float)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "constant-gazette",
   "metadata": {},
   "source": [
    "## Test the `resolveFamilyMembers()` function on a single family\n",
    "\n",
    "Here was are modifying the FAR dataframe using the matches returned by the function. Since the index of the matches is the same index from the original FAR DF, we can simply assign the match series to a new column and the indices will be aligned for us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "resistant-divorce",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>famid</th>\n",
       "      <th>ln</th>\n",
       "      <th>fn</th>\n",
       "      <th>ofn</th>\n",
       "      <th>byear</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27510</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>agnes</td>\n",
       "      <td>harumi</td>\n",
       "      <td>1920.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27511</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>hama</td>\n",
       "      <td>nan</td>\n",
       "      <td>1900.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27512</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>janice</td>\n",
       "      <td>mikiye</td>\n",
       "      <td>1923.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27513</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>shuji</td>\n",
       "      <td>nan</td>\n",
       "      <td>1878.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27514</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>shuzo</td>\n",
       "      <td>nan</td>\n",
       "      <td>1874.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27515</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>tami</td>\n",
       "      <td>nan</td>\n",
       "      <td>1874.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27516</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>tomee</td>\n",
       "      <td>tomoye</td>\n",
       "      <td>1935.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      famid   ln      fn     ofn   byear\n",
       "row                                     \n",
       "27510  7423  abe   agnes  harumi  1920.0\n",
       "27511  7423  abe    hama     nan  1900.0\n",
       "27512  7423  abe  janice  mikiye  1923.0\n",
       "27513  7423  abe   shuji     nan  1878.0\n",
       "27514  7423  abe   shuzo     nan  1874.0\n",
       "27515  7423  abe    tami     nan  1874.0\n",
       "27516  7423  abe   tomee  tomoye  1935.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<h3>grid of possible matches:</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>row</th>\n",
       "      <th>27510</th>\n",
       "      <th>27511</th>\n",
       "      <th>27512</th>\n",
       "      <th>27513</th>\n",
       "      <th>27514</th>\n",
       "      <th>27515</th>\n",
       "      <th>27516</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>276</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>289</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "row  27510  27511  27512  27513  27514  27515  27516\n",
       "row                                                 \n",
       "73     0.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "80     0.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "113    0.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "258    0.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "260    0.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "276    0.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "289    0.0    0.0    0.0    0.0    0.0    0.0    0.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<h3>grid with similarity scores:</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>row</th>\n",
       "      <th>27510</th>\n",
       "      <th>27511</th>\n",
       "      <th>27512</th>\n",
       "      <th>27513</th>\n",
       "      <th>27514</th>\n",
       "      <th>27515</th>\n",
       "      <th>27516</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>276</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>289</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "row  27510  27511  27512  27513  27514  27515  27516\n",
       "row                                                 \n",
       "73     0.0    1.0    0.0    0.0    0.0    0.0    0.0\n",
       "80     1.0    0.0    0.0    0.0    0.0    0.0    0.0\n",
       "113    0.0    0.0    1.0    0.0    0.0    0.0    0.0\n",
       "258    0.0    0.0    0.0    1.0    0.0    0.0    0.0\n",
       "260    0.0    0.0    0.0    0.0    1.0    1.0    0.0\n",
       "276    0.0    0.0    0.0    0.0    1.0    1.0    0.0\n",
       "289    0.0    0.0    0.0    0.0    0.0    0.0    1.0"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<h3>The first, best scores for each column:</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "row\n",
       "27510     80\n",
       "27511     73\n",
       "27512    113\n",
       "27513    258\n",
       "27514    260\n",
       "27515    260\n",
       "27516    289\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<h3>Possibilities remaining:</h3>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>row</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>113</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>276</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>289</th>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: [73, 80, 113, 258, 260, 276, 289]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27510</th>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27511</th>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27512</th>\n",
       "      <td>113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27513</th>\n",
       "      <td>258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27514</th>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27515</th>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27516</th>\n",
       "      <td>289</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      match\n",
       "row        \n",
       "27510    80\n",
       "27511    73\n",
       "27512   113\n",
       "27513   258\n",
       "27514   260\n",
       "27515   260\n",
       "27516   289"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>match</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27510</th>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27511</th>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27512</th>\n",
       "      <td>113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27513</th>\n",
       "      <td>258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27514</th>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27515</th>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27516</th>\n",
       "      <td>289</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      match\n",
       "row        \n",
       "27510    80\n",
       "27511    73\n",
       "27512   113\n",
       "27513   258\n",
       "27514   260\n",
       "27515   260\n",
       "27516   289"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>famid</th>\n",
       "      <th>ln</th>\n",
       "      <th>fn</th>\n",
       "      <th>ofn</th>\n",
       "      <th>byear</th>\n",
       "      <th>wra row</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>27510</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>agnes</td>\n",
       "      <td>harumi</td>\n",
       "      <td>1920.0</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27511</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>hama</td>\n",
       "      <td>nan</td>\n",
       "      <td>1900.0</td>\n",
       "      <td>73</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27512</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>janice</td>\n",
       "      <td>mikiye</td>\n",
       "      <td>1923.0</td>\n",
       "      <td>113</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27513</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>shuji</td>\n",
       "      <td>nan</td>\n",
       "      <td>1878.0</td>\n",
       "      <td>258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27514</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>shuzo</td>\n",
       "      <td>nan</td>\n",
       "      <td>1874.0</td>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27515</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>tami</td>\n",
       "      <td>nan</td>\n",
       "      <td>1874.0</td>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27516</th>\n",
       "      <td>7423</td>\n",
       "      <td>abe</td>\n",
       "      <td>tomee</td>\n",
       "      <td>tomoye</td>\n",
       "      <td>1935.0</td>\n",
       "      <td>289</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      famid   ln      fn     ofn   byear wra row\n",
       "row                                             \n",
       "27510  7423  abe   agnes  harumi  1920.0      80\n",
       "27511  7423  abe    hama     nan  1900.0      73\n",
       "27512  7423  abe  janice  mikiye  1923.0     113\n",
       "27513  7423  abe   shuji     nan  1878.0     258\n",
       "27514  7423  abe   shuzo     nan  1874.0     260\n",
       "27515  7423  abe    tami     nan  1874.0     260\n",
       "27516  7423  abe   tomee  tomoye  1935.0     289"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Look at our person matching data problem w/in a family\n",
    "fam_far = fardf[fardf['famid']=='7423'][['famid','ln','fn','ofn','byear']]\n",
    "fam_wra = wradf[wradf['famid']=='7423'][['famid','ln','fn','ofn','byear']]\n",
    "display(fam_far)\n",
    "all_matchesdf2 = pd.DataFrame()\n",
    "matches = resolveFamilyMembers(fam_far, fam_wra, debug=True)\n",
    "display(matches)\n",
    "all_matchesdf2 = pd.concat([all_matchesdf2, matches])\n",
    "display(all_matchesdf2)\n",
    "fardf['wra row'] = all_matchesdf2.match\n",
    "fardf[fardf['famid']=='7423'][['famid','ln','fn','ofn','byear', 'wra row']]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "spread-presentation",
   "metadata": {},
   "source": [
    "## The Whole Enchilada\n",
    "\n",
    "Let's see if we can perform matching against the entire dataset. What could go wrong?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "supreme-polyester",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fams = 10\n",
    "skip = 5000\n",
    "count = 0\n",
    "all_matchesdf = pd.DataFrame()\n",
    "unmatched_famids = pd.DataFrame(columns=['dataset', 'famid'])\n",
    "for famid, frame in fardf.groupby(['famid']):  # , 'FAR Exit Dataset']):\n",
    "#     count = count + 1\n",
    "#     if count < skip:\n",
    "#         continue\n",
    "#     if count >= skip + fams:\n",
    "#         break\n",
    "    try:\n",
    "        wra_fam = wradf[wradf['famid'] == famid]\n",
    "        if wra_fam.empty:\n",
    "            unmatched_famids.loc[len(unmatched_famids.index)] = ['far', famid]\n",
    "            continue\n",
    "        matches = resolveFamilyMembers(frame, wra_fam, debug=False)\n",
    "        all_matchesdf = pd.concat([all_matchesdf, matches])\n",
    "    except KeyError as e:\n",
    "        display(e)\n",
    "unmatched_famids.to_csv('./unmatched_famids.csv')\n",
    "#fardf['wra row'] = all_matchesdf.match\n",
    "all_matchesdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "coastal-fighter",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "match    119953\n",
       "dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "#\n",
    "display(all_matchesdf.count())\n",
    "#fardf['wra row'] = all_matchesdf[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "common-problem",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>famid</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dataset</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>far</th>\n",
       "      <td>1490</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         famid\n",
       "dataset       \n",
       "far       1490"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unmatched_famids.groupby(['dataset']).count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "unavailable-interview",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>famid</th>\n",
       "      <th>ln</th>\n",
       "      <th>fn</th>\n",
       "      <th>ofn</th>\n",
       "      <th>byear</th>\n",
       "      <th>wra row</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7000</th>\n",
       "      <td>2324</td>\n",
       "      <td>yamada</td>\n",
       "      <td>mitsuzo</td>\n",
       "      <td>yugetsu</td>\n",
       "      <td>1903.0</td>\n",
       "      <td>100780</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7001</th>\n",
       "      <td>2324</td>\n",
       "      <td>yamada</td>\n",
       "      <td>mitzi</td>\n",
       "      <td>mitsue</td>\n",
       "      <td>1925.0</td>\n",
       "      <td>100782</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7002</th>\n",
       "      <td>2324</td>\n",
       "      <td>yamada</td>\n",
       "      <td>yuriko</td>\n",
       "      <td>nan</td>\n",
       "      <td>1929.0</td>\n",
       "      <td>101035</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7003</th>\n",
       "      <td>2325</td>\n",
       "      <td>segawa</td>\n",
       "      <td>chiyoko</td>\n",
       "      <td>angie</td>\n",
       "      <td>1918.0</td>\n",
       "      <td>76628</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7004</th>\n",
       "      <td>2325</td>\n",
       "      <td>segawa</td>\n",
       "      <td>chiyoko</td>\n",
       "      <td>nan</td>\n",
       "      <td>1918.0</td>\n",
       "      <td>76628</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7005</th>\n",
       "      <td>2325</td>\n",
       "      <td>segawa</td>\n",
       "      <td>sharleen</td>\n",
       "      <td>chizuko</td>\n",
       "      <td>1944.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7006</th>\n",
       "      <td>2325</td>\n",
       "      <td>segawa</td>\n",
       "      <td>ted</td>\n",
       "      <td>jr.</td>\n",
       "      <td>1941.0</td>\n",
       "      <td>76647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7007</th>\n",
       "      <td>2325</td>\n",
       "      <td>segawa</td>\n",
       "      <td>theodore</td>\n",
       "      <td>soichi</td>\n",
       "      <td>1941.0</td>\n",
       "      <td>76647</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7008</th>\n",
       "      <td>2325</td>\n",
       "      <td>segawa</td>\n",
       "      <td>theodore</td>\n",
       "      <td>nan</td>\n",
       "      <td>1914.0</td>\n",
       "      <td>76648</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7009</th>\n",
       "      <td>2326</td>\n",
       "      <td>osugi</td>\n",
       "      <td>harumi</td>\n",
       "      <td>nan</td>\n",
       "      <td>1914.0</td>\n",
       "      <td>70154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7010</th>\n",
       "      <td>2326</td>\n",
       "      <td>osugi</td>\n",
       "      <td>mikio</td>\n",
       "      <td>donald</td>\n",
       "      <td>1934.0</td>\n",
       "      <td>70161</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     famid      ln        fn      ofn   byear wra row\n",
       "row                                                  \n",
       "7000  2324  yamada   mitsuzo  yugetsu  1903.0  100780\n",
       "7001  2324  yamada     mitzi   mitsue  1925.0  100782\n",
       "7002  2324  yamada    yuriko      nan  1929.0  101035\n",
       "7003  2325  segawa   chiyoko    angie  1918.0   76628\n",
       "7004  2325  segawa   chiyoko      nan  1918.0   76628\n",
       "7005  2325  segawa  sharleen  chizuko  1944.0     NaN\n",
       "7006  2325  segawa       ted      jr.  1941.0   76647\n",
       "7007  2325  segawa  theodore   soichi  1941.0   76647\n",
       "7008  2325  segawa  theodore      nan  1914.0   76648\n",
       "7009  2326   osugi    harumi      nan  1914.0   70154\n",
       "7010  2326   osugi     mikio   donald  1934.0   70161"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fardf.dropna(subset=['famid'], inplace=True)\n",
    "fardf.loc[7000:7010, :][['famid','ln','fn','ofn','byear', 'wra row']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "antique-circular",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "unmatched with famid: 31775 / 151728\n",
      "unmatched with famid percent: 20.94208056522198\n"
     ]
    }
   ],
   "source": [
    "print( 'unmatched with famid: ' + str(fardf['wra row'].isna().sum()) + ' / '+ str(fardf['famid'].count()))\n",
    "print( 'unmatched with famid percent: ' + str(fardf['wra row'].isna().sum()/fardf['famid'].count()*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "southwest-persian",
   "metadata": {},
   "source": [
    "# Measure Accuracy against Gold Standard Dataset of 204 FAR records\n",
    "\n",
    "199 or the 204 FAR records were able to be matched up by hand with their WRA records. Now we want to see how many of these same records were matched by the algorithm so far."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "informed-injury",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "matched: 155 / 204\n",
      "gold matched: 199 / 204\n"
     ]
    }
   ],
   "source": [
    "cnt = 0\n",
    "matched = 0\n",
    "for index, row in gold.iterrows():\n",
    "    cnt += 1\n",
    "    try:\n",
    "        far_row = fardf[fardf['m_pseudoid'] == row['farid']]\n",
    "        wra_row = wradf.index[wradf['m_pseudoid'] == row['wraid']].tolist()[0]\n",
    "        if (wra_row != None) & (far_row['wra row'].tolist()[0] == wra_row):\n",
    "            matched += 1\n",
    "    except:\n",
    "        pass\n",
    "print('matched: %s / %s'%(matched, cnt))\n",
    "\n",
    "gold_unmatched = gold.farid.str.contains('NOT').sum()\n",
    "print('gold matched: %s / %s'%(cnt-gold_unmatched, cnt))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aboriginal-palestine",
   "metadata": {},
   "source": [
    "# Try resolving the remainder with family name soundex segments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "korean-passing",
   "metadata": {},
   "outputs": [],
   "source": [
    "from libindic.soundex import Soundex\n",
    "instance = Soundex()\n",
    "fardf['ln_soundex'] = fardf[fardf['ln'].isna() == False]['ln'].apply(instance.soundex)\n",
    "wradf['ln_soundex'] = wradf[wradf['ln'].isna() == False]['ln'].apply(instance.soundex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "extensive-addition",
   "metadata": {},
   "outputs": [],
   "source": [
    "sndx_matchesdf = pd.DataFrame()\n",
    "unmatched_lnsx = pd.DataFrame(columns=['dataset', 'lnsx'])\n",
    "unmatched_far = fardf[fardf['wra row'].isna()]\n",
    "for lnsx, frame in unmatched_far.groupby(['ln_soundex']):\n",
    "    try:\n",
    "        wra = wradf[wradf['ln_soundex'] == lnsx]\n",
    "        if wra.empty:\n",
    "            continue;\n",
    "        matches = resolveFamilyMembers(frame, wra, debug=False)\n",
    "        sndx_matchesdf = pd.concat([sndx_matchesdf, matches])\n",
    "        # display(fardf[fardf['famid']==fam][['famid','pid','fn','ofn','byear', 'wra row']])\n",
    "    except KeyError as e:\n",
    "        # display(e)\n",
    "        unmatched_lnsx.loc[len(unmatched_lnsx.index)] = ['far', lnsx]\n",
    "        continue\n",
    "unmatched_lnsx.to_csv('./unmatched_lnsx.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "female-timing",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "match    11837\n",
       "dtype: int64"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sndx_matchesdf.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "outdoor-machine",
   "metadata": {},
   "outputs": [],
   "source": [
    "fardf['wra row sndx'] = sndx_matchesdf.match"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "smart-harbor",
   "metadata": {},
   "outputs": [],
   "source": [
    "fardf.to_csv('./MATCHED FAR.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "posted-there",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "FAR Exit Dataset               131790\n",
       "original_order                 131790\n",
       "far_line_id                    131130\n",
       "family_number                  131669\n",
       "famid                          131790\n",
       "last_name_corrected            131789\n",
       "ln                             131790\n",
       "last_name_original             131694\n",
       "first_name_corrected           131783\n",
       "first_name_original            131783\n",
       "fn                             131790\n",
       "other_names                     48898\n",
       "ofn                            131790\n",
       "date_of_birth                  131200\n",
       "NEW-date_of_birth              131200\n",
       "year_of_birth                  131217\n",
       "byear                          131223\n",
       "sex                            131697\n",
       "NEW_marital_status             131683\n",
       "citizenship                    131707\n",
       "alien_registration_no.          49199\n",
       "type_of_original_entry         131718\n",
       "NEW-type_of_original_entry     131710\n",
       "NEW-pre-evacuation_address     129817\n",
       "pre-evacuation_state           129503\n",
       "date_of_original_entry         131720\n",
       "type_of_final_departure        131177\n",
       "NEW-type_of_final_departure    131176\n",
       "date_of_final_departure        131182\n",
       "final_departure_state          117194\n",
       "camp_address_original           52351\n",
       "camp_address_block              62573\n",
       "camp_address_barracks           62137\n",
       "camp_address_room               62105\n",
       "reference                         565\n",
       "notes                            2893\n",
       "id                             131790\n",
       "m_pseudoid                     131130\n",
       "wra row                        131790\n",
       "ln_soundex                     131790\n",
       "wra row sndx                    11837\n",
       "dtype: int64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fardf[~fardf['wra row'].isna() | ~fardf['wra row sndx'].isna()].count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "artificial-adjustment",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.13140620056944008"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1- 131790 / 151728"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "raised-boutique",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "19938"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "151728 - 131790"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "defensive-frame",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "matched: 168 / 204\n",
      "famid errors: 183 / 204\n",
      "soundex errors: 0 / 204\n",
      "gold matched: 199 / 204\n"
     ]
    }
   ],
   "source": [
    "cnt = 0\n",
    "matched = 0\n",
    "famid_errors = 0\n",
    "soundex_errors = 0\n",
    "for index, row in gold.iterrows():\n",
    "    cnt += 1\n",
    "    try:\n",
    "        far_row = fardf[fardf['m_pseudoid'] == row['farid']]\n",
    "        wraidx = wradf.index[wradf['m_pseudoid'] == row['wraid']].tolist()[0]\n",
    "        if (far_row['wra row'].tolist()[0] == wraidx) | (far_row['wra row sndx'].tolist()[0] == wraidx):\n",
    "            matched += 1\n",
    "        if (far_row['wra row'].tolist()[0] != wraidx): # & (~far_row['wra row'].isna()):\n",
    "            famid_errors += 1\n",
    "        if (far_row['wra row sndx'].tolist()[0] != wraidx): # & (~far_row['wra row sndx'].isna()):\n",
    "            famid_errors += 1\n",
    "    except:\n",
    "        pass\n",
    "print('matched: %s / %s'%(matched, cnt))\n",
    "print('famid errors: %s / %s'%(famid_errors, cnt))\n",
    "print('soundex errors: %s / %s'%(soundex_errors, cnt))\n",
    "\n",
    "gold_unmatched = gold.farid.str.contains('NOT').sum()\n",
    "print('gold matched: %s / %s'%(cnt-gold_unmatched, cnt))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "secondary-savings",
   "metadata": {},
   "source": [
    "## Family ID discrepancies\n",
    "\n",
    "We can see above that many family IDs were found in one dataset, but not found in the other dataset. If you examine these IDs you can see that formatting differences are common, with hypens and spaces used interchangeably. Also the letter codes are sometimes before or after the numeric portion of the ID.\n",
    "\n",
    "We are going to need to create consistency for the resolved to work on these families. Then we'll see how many unmatched family IDs are left.\n",
    "\n",
    "To create consistency we need to capture the letter and the number portions of the IDs and reformat them one consistent way. We'll choose xx-1111 as that target format. We're only going to reformat IDs that have a letter portion, leaving the plain integers alone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "defensive-tucson",
   "metadata": {},
   "outputs": [],
   "source": [
    "alldf[ alldf['famid'] =='15170'][['fn', 'famid', 'ofn']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "differential-extra",
   "metadata": {},
   "outputs": [],
   "source": [
    "alldf[ alldf['fn'] =='harry'][['fn', 'famid', 'ofn', 'sn']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "entertaining-authorization",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}