{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Library and Archives Canada, Naturalization Records, 1915-1946\n",
"## Harvest records by country\n",
"\n",
"See the LAC site for more [details of the database](http://www.bac-lac.gc.ca/eng/discover/immigration/citizenship-naturalization-records/naturalized-records-1915-1951/Pages/introduction.aspx) and how it was created.\n",
"\n",
"This notebook helps you create a dataset with records of people from a specific country.\n",
"\n",
"**Problems and limitations:**\n",
"\n",
"* the database returns a **maximum of 2000 results** for any query;\n",
"* I'd thought you might be able to get around this by using wildcard searches in the `surname` field, but despite the example given on the search page, there are no wildcard searches, instead all search terms are treated as substrings, matching anywhere in a field — so a surname search for 'Lee' returns 'Batslee', 'Fleenor' etc;\n",
"* it seems that results are ordered by the `item id`, which appear to be assigned alphabetically by date, but there's no way of being sure about this — so the first 2000 results are *probably* the earliest results;\n",
"* there doesn't seem to be any way of finding out what country names (or variations thereof) are in use, so you need to play around with the web interface first to find out which values work;\n",
"* wives and children of a naturalised man are not assigned a `country` value, so they won't be picked up by a `country` search — see below for a way of possibly overcoming this...\n",
"\n",
"**Results:**\n",
"\n",
"Once harvested you can save the harvested data as a CSV file. The CSV file will contain the following columns:\n",
"\n",
"* `item_id`\n",
"* `surname`\n",
"* `given_names`\n",
"* `country`\n",
"* `relation`\n",
"* `year`\n",
"* `reference`\n",
"* `page`\n",
"* `pdf_id`\n",
"* `pdf_url`\n",
"\n",
"Here's the results for a harvest of 'China':\n",
"\n",
"* Search for 'China' — [lac-naturalisations-China.csv](lac-naturalisations-China.csv)\n",
"* Search for 'China' supplemented with family members — [lac-naturalisations-China-with-families.csv](lac-naturalisations-China-with-families.csv)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting things up"
]
},
{
"cell_type": "code",
"execution_count": 162,
"metadata": {},
"outputs": [],
"source": [
"# Import the bits and pieces that we need\n",
"import re\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"import time\n",
"from tqdm import tqdm_notebook\n",
"import pandas as pd\n",
"from IPython.display import display, HTML, FileLink"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Set some variables\n",
"s = requests.Session()\n",
"SEARCH_URL = 'http://www.bac-lac.gc.ca/eng/discover/immigration/citizenship-naturalization-records/naturalized-records-1915-1951/Pages/list-naturalization-1915-1939.aspx'\n",
"ITEM_URL = 'http://www.bac-lac.gc.ca/eng/discover/immigration/citizenship-naturalization-records/naturalized-records-1915-1951/Pages/item-naturalization-1915-1939.aspx'"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"# Define some functions\n",
"\n",
"def process_page(soup):\n",
" '''\n",
" Extract data from a page of results.\n",
" '''\n",
" results = []\n",
" try:\n",
" for row in soup.find('table', class_='result_table').find('tbody').find_all('tr'):\n",
" cells = row.find_all('td')\n",
" results.append({\n",
" 'item_id': cells[0].get_text(),\n",
" 'surname': cells[1].string.strip(),\n",
" 'given_names': cells[2].string.strip(),\n",
" 'country': cells[3].string.strip()\n",
" })\n",
" except AttributeError:\n",
" pass\n",
" # No results\n",
" return results\n",
"\n",
"def process_row(soup, label):\n",
" '''\n",
" Get value from the row with the given label.\n",
" '''\n",
" try:\n",
" value = soup.find('div', class_='genapp_item_display_label', string=re.compile(label)).find_next_sibling('div').string.strip()\n",
" except AttributeError:\n",
" value = ''\n",
" return value\n",
"\n",
"def process_item(item):\n",
" '''\n",
" Get data from an individual item page.\n",
" '''\n",
" response = s.get(ITEM_URL, params={'IdNumber': item['item_id']})\n",
" soup = BeautifulSoup(response.text, 'lxml')\n",
" for label in ['Year', 'Page', 'Reference', 'Relation']:\n",
" item[label.lower()] = process_row(soup, label)\n",
" pdf_link = soup.find('a', href=re.compile(r'&op=pdf'))\n",
" item['pdf_id'] = pdf_link.string.strip()\n",
" item['pdf_url'] = pdf_link['href']\n",
" return item\n",
"\n",
"def get_total_results(soup):\n",
" '''\n",
" Get the toal number of results for a search.\n",
" '''\n",
" results_info = soup.find('div', class_='search_term_value').string.strip()\n",
" total_results = re.search(r'^\\d+', results_info).group(0)\n",
" return int(total_results)\n",
"\n",
"def harvest_results_by_country(country):\n",
" '''\n",
" Harvest search results for the supplied country.\n",
" Return a maximum of 2000 results.\n",
" '''\n",
" items = []\n",
" params = {\n",
" 'CountryEn': country,\n",
" 'p_ID': 0\n",
" }\n",
" response = s.get(SEARCH_URL, params=params)\n",
" soup = BeautifulSoup(response.text, 'lxml')\n",
" total_results = get_total_results(soup)\n",
" with tqdm_notebook(total=total_results) as pbar:\n",
" for page in range(0, total_results, 15):\n",
" params['p_ID'] = page\n",
" response = s.get(SEARCH_URL, params=params)\n",
" soup = BeautifulSoup(response.text, 'lxml')\n",
" results = process_page(soup)\n",
" for result in tqdm_notebook(results, leave=False):\n",
" items.append(process_item(result))\n",
" time.sleep(0.5)\n",
" time.sleep(0.5)\n",
" pbar.update(len(results))\n",
" return items "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Running the harvest"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ebef5bc194a04830bba1050f4de492e2",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=482), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=15), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=2), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Start the harvest\n",
"# Substitute your own country value here\n",
"country = 'China'\n",
"items = harvest_results_by_country(country)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Viewing the results"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(items)"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" given_names | \n",
" item_id | \n",
" page | \n",
" pdf_id | \n",
" pdf_url | \n",
" reference | \n",
" relation | \n",
" surname | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" China | \n",
" Charlie | \n",
" 2711 | \n",
" 364 | \n",
" P22-23_364 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
" Canadian Gazette 1922-1923 | \n",
" | \n",
" Fern | \n",
" 1922-1923 | \n",
"
\n",
" \n",
" 1 | \n",
" China | \n",
" Mah Qong | \n",
" 3997 | \n",
" 389 | \n",
" P22-23_389 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
" Canadian Gazette 1922-1923 | \n",
" | \n",
" Hing | \n",
" 1922-1923 | \n",
"
\n",
" \n",
" 2 | \n",
" China | \n",
" Jim Lee | \n",
" 4910 | \n",
" 406 | \n",
" P22-23_406 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
" Canadian Gazette 1922-1923 | \n",
" | \n",
" Ko | \n",
" 1922-1923 | \n",
"
\n",
" \n",
" 3 | \n",
" China | \n",
" Frank Ho | \n",
" 5426 | \n",
" 416 | \n",
" P22-23_416 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
" Canadian Gazette 1922-1923 | \n",
" | \n",
" Lem | \n",
" 1922-1923 | \n",
"
\n",
" \n",
" 4 | \n",
" China | \n",
" Chin Jeng | \n",
" 5560 | \n",
" 419 | \n",
" P22-23_419 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
" Canadian Gazette 1922-1923 | \n",
" | \n",
" Ling | \n",
" 1922-1923 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country given_names item_id page pdf_id \\\n",
"0 China Charlie 2711 364 P22-23_364 \n",
"1 China Mah Qong 3997 389 P22-23_389 \n",
"2 China Jim Lee 4910 406 P22-23_406 \n",
"3 China Frank Ho 5426 416 P22-23_416 \n",
"4 China Chin Jeng 5560 419 P22-23_419 \n",
"\n",
" pdf_url \\\n",
"0 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"1 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"2 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"3 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"4 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"\n",
" reference relation surname year \n",
"0 Canadian Gazette 1922-1923 Fern 1922-1923 \n",
"1 Canadian Gazette 1922-1923 Hing 1922-1923 \n",
"2 Canadian Gazette 1922-1923 Ko 1922-1923 \n",
"3 Canadian Gazette 1922-1923 Lem 1922-1923 \n",
"4 Canadian Gazette 1922-1923 Ling 1922-1923 "
]
},
"execution_count": 156,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"482"
]
},
"execution_count": 157,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# How many results?\n",
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save as CSV"
]
},
{
"cell_type": "code",
"execution_count": 164,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"lac-naturalisations-China.csv
"
],
"text/plain": [
"/Users/tim/mycode/glam-workbench/lac/notebooks/lac-naturalisations-China.csv"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Reorder columns\n",
"df = df[['item_id', 'surname', 'given_names', 'country', 'relation', 'year', 'reference', 'page', 'pdf_id', 'pdf_url']]\n",
"# Change id to numeric so we can use it to order\n",
"df['item_id'] = pd.to_numeric(df['item_id'])\n",
"df = df.replace('NULL', '')\n",
"df = df.sort_values(by=['item_id'])\n",
"df.to_csv('lac-naturalisations-{}.csv'.format(country), index=False)\n",
"display(FileLink('lac-naturalisations-{}.csv'.format(country)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding wives and children\n",
"\n",
"As noted above, the wives and children of a naturalised man aren't assigned a `country` value and so will be missing from the harvested data. While there are no explicit links between a naturalised man and his family, by making a couple of assumptions we can attempt to add data for wives and children.\n",
"\n",
"Assumptions:\n",
"\n",
"* wives and children will have the same surname as the naturalised man;\n",
"* wives and children will appear immediately after tha naturalised man in the registers and will therefore be assigned sequential item ids.\n",
"\n",
"These assumptions are based on some random poking around in the PDFs, but I can't be sure they will hold in every case.\n",
"\n",
"Based on these assumptions, the methodology for adding records is:\n",
"\n",
"* loop through all harvested records\n",
"* for each record search for the surname\n",
"* process the search results — as everything's treated as a substring and there's no relevance ranking we have to go through all the results which seems really inefficient, but...\n",
"* if surnames match, ids are sequential, and the country field is empty, then it looks like we have a family member — grab the full details for this record\n",
"\n",
"This is really slow and inefficient because if a naturalised man has no family we end up looking through all (or the first 2000) results for their surname. Also, if there are more than 2000 results for a surname then we might miss family members — there's no way of knowing..."
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [],
"source": [
"def harvest_families(items):\n",
" '''\n",
" Attempts to add family members to the search results for a given country.\n",
" '''\n",
" new_items = items.copy()\n",
" for item in tqdm_notebook(items):\n",
" # Are there more records to processs?\n",
" more = True\n",
" # Have family members been found?\n",
" found = False\n",
" # Look for the id following the man's id\n",
" current_id = int(item['item_id']) + 1\n",
" page = 0\n",
" while more:\n",
" response = s.get(SEARCH_URL, params={'Surname': item['surname'], 'p_ID': page})\n",
" soup = BeautifulSoup(response.text)\n",
" # Check for results\n",
" try:\n",
" rows = soup.find('table', class_='result_table').find('tbody').find_all('tr')\n",
" except AttributeError:\n",
" more = False\n",
" else:\n",
" # Process the rows on a page\n",
" for row in rows:\n",
" cells = row.find_all('td')\n",
" # Check that the record has a sequential id and no country\n",
" if (int(cells[0].get_text()) == current_id) and (cells[1].string.strip() == item['surname']) and (cells[3].string == None):\n",
" new_item = {\n",
" 'item_id': cells[0].get_text(),\n",
" 'surname': cells[1].string.strip(),\n",
" 'given_names': cells[2].string.strip(),\n",
" 'country': ''\n",
" }\n",
" new_item = process_item(new_item)\n",
" new_items.append(new_item)\n",
" current_id += 1\n",
" # We've found a family member\n",
" found = True\n",
" time.sleep(0.5)\n",
" else:\n",
" # If we've already found family members\n",
" if found:\n",
" # It seems there are no more family members, so let's get out of the loops\n",
" more = False\n",
" break\n",
" page += 15\n",
" time.sleep(0.5)\n",
" return new_items"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b2d1848fd288471085b6ea0db1688cdf",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(IntProgress(value=0, max=482), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"new_items = harvest_families(items)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's look at the new dataset"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame(new_items)"
]
},
{
"cell_type": "code",
"execution_count": 143,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"626"
]
},
"execution_count": 143,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# How many altogether now?\n",
"len(df2)"
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"144"
]
},
"execution_count": 145,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# How many wives and children are there\n",
"len(df2.loc[df2['country'] == ''])"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_id | \n",
" surname | \n",
" given_names | \n",
" country | \n",
" relation | \n",
" year | \n",
" reference | \n",
" page | \n",
" pdf_id | \n",
" pdf_url | \n",
"
\n",
" \n",
" \n",
" \n",
" 482 | \n",
" 6485 | \n",
" Mow | \n",
" Chewkong Wong | \n",
" | \n",
" Minor child | \n",
" 1922-1923 | \n",
" Canadian Gazette 1922-1923 | \n",
" 437 | \n",
" P22-23_437 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
"
\n",
" \n",
" 483 | \n",
" 7539 | \n",
" Poy | \n",
" Hong Auk | \n",
" | \n",
" Minor child | \n",
" 1922-1923 | \n",
" Canadian Gazette 1922-1923 | \n",
" 457 | \n",
" P22-23_457 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P22-23_... | \n",
"
\n",
" \n",
" 484 | \n",
" 13606 | \n",
" Foo | \n",
" Mah Kwack Hong | \n",
" | \n",
" Minor child | \n",
" 1923-1924 | \n",
" Canadian Gazette 1923-1924 | \n",
" 282 | \n",
" P23-24_282 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P23-24_... | \n",
"
\n",
" \n",
" 485 | \n",
" 13607 | \n",
" Foo | \n",
" Mah Kwack Kee | \n",
" | \n",
" Minor child | \n",
" 1923-1924 | \n",
" Canadian Gazette 1923-1924 | \n",
" 282 | \n",
" P23-24_282 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P23-24_... | \n",
"
\n",
" \n",
" 486 | \n",
" 13608 | \n",
" Foo | \n",
" Mah Kwack Lem | \n",
" | \n",
" Minor child | \n",
" 1923-1924 | \n",
" Canadian Gazette 1923-1924 | \n",
" 282 | \n",
" P23-24_282 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P23-24_... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_id surname given_names country relation year \\\n",
"482 6485 Mow Chewkong Wong Minor child 1922-1923 \n",
"483 7539 Poy Hong Auk Minor child 1922-1923 \n",
"484 13606 Foo Mah Kwack Hong Minor child 1923-1924 \n",
"485 13607 Foo Mah Kwack Kee Minor child 1923-1924 \n",
"486 13608 Foo Mah Kwack Lem Minor child 1923-1924 \n",
"\n",
" reference page pdf_id \\\n",
"482 Canadian Gazette 1922-1923 437 P22-23_437 \n",
"483 Canadian Gazette 1922-1923 457 P22-23_457 \n",
"484 Canadian Gazette 1923-1924 282 P23-24_282 \n",
"485 Canadian Gazette 1923-1924 282 P23-24_282 \n",
"486 Canadian Gazette 1923-1924 282 P23-24_282 \n",
"\n",
" pdf_url \n",
"482 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"483 http://central.bac-lac.gc.ca/.item/?id=P22-23_... \n",
"484 http://central.bac-lac.gc.ca/.item/?id=P23-24_... \n",
"485 http://central.bac-lac.gc.ca/.item/?id=P23-24_... \n",
"486 http://central.bac-lac.gc.ca/.item/?id=P23-24_... "
]
},
"execution_count": 139,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.loc[df2['country'] == ''].head()"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_id | \n",
" surname | \n",
" given_names | \n",
" country | \n",
" relation | \n",
" year | \n",
" reference | \n",
" page | \n",
" pdf_id | \n",
" pdf_url | \n",
"
\n",
" \n",
" \n",
" \n",
" 499 | \n",
" 34211 | \n",
" Chin | \n",
" Lee Shee | \n",
" | \n",
" Wife | \n",
" 1925-1926 | \n",
" Canadian Gazette 1925-1926 | \n",
" 336 | \n",
" P25-26_336 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P25-26_... | \n",
"
\n",
" \n",
" 500 | \n",
" 34810 | \n",
" Dai | \n",
" Mar Sea | \n",
" | \n",
" Wife | \n",
" 1925-1926 | \n",
" Canadian Gazette 1925-1926 | \n",
" 347 | \n",
" P25-26_347 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P25-26_... | \n",
"
\n",
" \n",
" 501 | \n",
" 36165 | \n",
" Fong | \n",
" Lee See | \n",
" | \n",
" Wife | \n",
" 1925-1926 | \n",
" Canadian Gazette 1925-1926 | \n",
" 373 | \n",
" P25-26_373 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P25-26_... | \n",
"
\n",
" \n",
" 502 | \n",
" 36168 | \n",
" Food | \n",
" Lem | \n",
" | \n",
" Wife | \n",
" 1925-1926 | \n",
" Canadian Gazette 1925-1926 | \n",
" 373 | \n",
" P25-26_373 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P25-26_... | \n",
"
\n",
" \n",
" 503 | \n",
" 37593 | \n",
" Hing | \n",
" Aarr Pong | \n",
" | \n",
" Wife | \n",
" 1925-1926 | \n",
" Canadian Gazette 1925-1926 | \n",
" 400 | \n",
" P25-26_400 | \n",
" http://central.bac-lac.gc.ca/.item/?id=P25-26_... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_id surname given_names country relation year \\\n",
"499 34211 Chin Lee Shee Wife 1925-1926 \n",
"500 34810 Dai Mar Sea Wife 1925-1926 \n",
"501 36165 Fong Lee See Wife 1925-1926 \n",
"502 36168 Food Lem Wife 1925-1926 \n",
"503 37593 Hing Aarr Pong Wife 1925-1926 \n",
"\n",
" reference page pdf_id \\\n",
"499 Canadian Gazette 1925-1926 336 P25-26_336 \n",
"500 Canadian Gazette 1925-1926 347 P25-26_347 \n",
"501 Canadian Gazette 1925-1926 373 P25-26_373 \n",
"502 Canadian Gazette 1925-1926 373 P25-26_373 \n",
"503 Canadian Gazette 1925-1926 400 P25-26_400 \n",
"\n",
" pdf_url \n",
"499 http://central.bac-lac.gc.ca/.item/?id=P25-26_... \n",
"500 http://central.bac-lac.gc.ca/.item/?id=P25-26_... \n",
"501 http://central.bac-lac.gc.ca/.item/?id=P25-26_... \n",
"502 http://central.bac-lac.gc.ca/.item/?id=P25-26_... \n",
"503 http://central.bac-lac.gc.ca/.item/?id=P25-26_... "
]
},
"execution_count": 140,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Just wives\n",
"df2.loc[df2['relation'] == 'Wife'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save to CSV"
]
},
{
"cell_type": "code",
"execution_count": 163,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"lac-naturalisations-China-with-families.csv
"
],
"text/plain": [
"/Users/tim/mycode/glam-workbench/lac/notebooks/lac-naturalisations-China-with-families.csv"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df2 = df2[['item_id', 'surname', 'given_names', 'country', 'relation', 'year', 'reference', 'page', 'pdf_id', 'pdf_url']]\n",
"df2['item_id'] = pd.to_numeric(df2['item_id'])\n",
"df2 = df2.replace('NULL', '')\n",
"df2 = df2.sort_values(by=['item_id'])\n",
"df2.to_csv('lac-naturalisations-{}-with-families.csv'.format(country), index=False)\n",
"display(FileLink('lac-naturalisations-{}-with-families.csv'.format(country)))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}