{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# NSW State Archives Index Explorer\n", "\n", "NSW State Archives provides a lot of rich descriptive data in its [online indexes](https://www.records.nsw.gov.au/archives/collections-and-research/guides-and-indexes/indexes-a-z). But there's so much data it can be hard to understand what's actually in each index. This notebook tries to help by generating an overview of an index, summarising the contents of each field.\n", "\n", "When you select an index from the dropdown list, the Index Explorer loads a CSV file containing [data harvested from the index](harvest-indexes.ipynb). It then looks at each column in the dataset, tries to identify the type of data inside, and attempts to tell you something useful about it.\n", "\n", "Given all the possible variations in recording and formatting data, there will be oddities and errors. But hopefully this will provide you with a useful starting point for further exploration.\n", "\n", "The Index Explorer is a slightly-modified version of the [GLAM CSV Explorer](https://glam-workbench.github.io/csv-explorer/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%capture\n", "import os\n", "import statistics\n", "import warnings\n", "from urllib.error import HTTPError\n", "from urllib.parse import urljoin\n", "\n", "import altair as alt\n", "import ipywidgets as widgets\n", "import pandas as pd\n", "from IPython.display import HTML, display\n", "from pandas.errors import ParserError\n", "from slugify import slugify\n", "from wordcloud import WordCloud\n", "\n", "# alt.renderers.enable('notebook')\n", "# alt.data_transformers.enable('json', urlpath='files')\n", "alt.data_transformers.enable(\"data_server\")\n", "# alt.data_transformers.enable('data_server_proxied', urlpath='.')\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%capture\n", "# Load environment variables if available\n", "%load_ext dotenv\n", "%dotenv" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%javascript\n", "// This is necessary to stop the output area folding up\n", "IPython.OutputArea.prototype._should_scroll = function(lines) {return false}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is where the results go...\n", "results = widgets.Output()\n", "status = widgets.Output()\n", "pd.set_option(\"display.max_columns\", 10)\n", "\n", "\n", "def read_csv(url, header=0, encoding=0):\n", " \"\"\"\n", " Loop through some encoding/parsing options to see if we can get the CSV to open properly.\n", " \"\"\"\n", " encodings = [\"ISO-8859-1\", \"latin-1\"]\n", " headers = [None]\n", " try:\n", " if encoding > 0 and header > 0:\n", " df = pd.read_csv(\n", " url,\n", " sep=None,\n", " engine=\"python\",\n", " na_values=[\"-\", \" \"],\n", " encoding=encodings[encoding - 1],\n", " header=headers[header - 1],\n", " )\n", " elif encoding > 0:\n", " df = pd.read_csv(\n", " url,\n", " sep=None,\n", " engine=\"python\",\n", " na_values=[\"-\", \" \"],\n", " encoding=encodings[encoding - 1],\n", " )\n", " elif header > 0:\n", " df = pd.read_csv(\n", " url,\n", " sep=None,\n", " engine=\"python\",\n", " na_values=[\"-\", \" \"],\n", " header=headers[header - 1],\n", " )\n", " else:\n", " df = pd.read_csv(url, sep=None, engine=\"python\", na_values=[\"-\", \" \"])\n", " except UnicodeDecodeError:\n", " if encoding == len(encodings):\n", " raise\n", " else:\n", " return read_csv(url=url, header=header, encoding=encoding + 1)\n", " except ParserError:\n", " if header == len(headers):\n", " raise\n", " else:\n", " return read_csv(url=url, header=header + 1, encoding=encoding)\n", " else:\n", " return df\n", "\n", "\n", "def analyse_csv(b):\n", " \"\"\"\n", " Try to open the CSV file, and start the analysis.\n", " \"\"\"\n", " results.clear_output()\n", " status.clear_output()\n", " error = \"\"\n", " with results:\n", " index = select_csv.value\n", " title = index[\"title\"]\n", " filename = f\"{index['slug']}.csv\"\n", " url = urljoin(\n", " \"https://media.githubusercontent.com/media/wragge/srnsw-indexes/master/data/\",\n", " filename,\n", " )\n", " # url = f\"indexes/{filename}\"\n", " html = f\"
{url} ({row[\"file_size\"]})
'\n", " display(HTML(html))\n", " # display(status)\n", " status.append_stdout(\"Downloading data...\")\n", " try:\n", " df = read_csv(url)\n", " except UnicodeDecodeError:\n", " error = \"Unicode error: unable to read the CSV!\"\n", " except ParserError:\n", " error = \"Parser error: unable to read the CSV!\"\n", " except HTTPError:\n", " error = \"File not found!\"\n", " except:\n", " error = \"Unable to read the CSV!\"\n", " status.clear_output()\n", " if error:\n", " display(HTML(f'{error}
'))\n", " else:\n", " rows, cols = df.shape\n", " size = \"This columns contains a small number of words that combine letters and numbers. They're probably collection identifiers. Here's some examples:
This look like it contains categories. Let's look at the {category_count} most common.
\"\n", " )\n", " )\n", " display_categories(df, col)\n", " else:\n", " try:\n", " display(HTML(\"This look like it contains text.
\"))\n", " display_wordcloud(df, col, collocates)\n", " except ValueError:\n", " pass\n", " if unique_ratio < unique_cutoff:\n", " display(\n", " HTML(\n", " f\"Less than {unique_cutoff:.2%} of the values are unique, let's look at the {category_count} most common.
\"\n", " )\n", " )\n", " display_categories(df, col)\n", " has_number = df[col].str.contains(r\"\\b\\d+\\b\", regex=True)\n", " # Check for dates\n", " if has_year.sum() / value_count > cutoff and mixed is False:\n", " html = \"Most of the values in this column include a number that looks like a year. It might be useful to convert them to dates.
\"\n", " df[\"{}_years_extracted\".format(col)] = df[col].str.extract(\n", " r\"\\b(1[7-9]{1}\\d{2}|20[0-1]{1}\\d{1})\\b\"\n", " )\n", " if df[\"{}_years_extracted\".format(col)].nunique(dropna=True) > 1:\n", " df[\"{}_date_converted\".format(col)] = pd.to_datetime(\n", " df[\"{}_years_extracted\".format(col)], format=\"%Y\", utc=True\n", " )\n", " html += \"{:,} of {:,} values in this column were successfully parsed as dates.
\".format(\n", " df[\"{}_date_converted\".format(col)].dropna().size, value_count\n", " )\n", " details = {}\n", " details[\"Earliest date\"] = (\n", " df[\"{}_date_converted\".format(col)].min().strftime(\"%Y-%m-%d\")\n", " )\n", " details[\"Latest date\"] = (\n", " df[\"{}_date_converted\".format(col)].max().strftime(\"%Y-%m-%d\")\n", " )\n", " display(HTML(html))\n", " display_details(details)\n", " display_dates(df, \"{}_date_converted\".format(col))\n", " # Check for numbers\n", " elif has_number.sum() / value_count > cutoff and mixed is False:\n", " html = \"Most of the values in this column include a number. It might be useful to extract the values.
\"\n", " df[\"{}_numbers_extracted\".format(col)] = df[col].str.extract(r\"\\b(\\d+)\\b\")\n", " if df[\"{}_numbers_extracted\".format(col)].nunique(dropna=True) > 2:\n", " df[\"{}_numbers_extracted\".format(col)] = pd.to_numeric(\n", " df[\"{}_numbers_extracted\".format(col)],\n", " errors=\"coerce\",\n", " downcast=\"integer\",\n", " )\n", " details = {}\n", " details[\"Highest value\"] = df[\"{}_numbers_extracted\".format(col)].max()\n", " details[\"Lowest value\"] = (\n", " df[\"{}_numbers_extracted\".format(col)].dropna().min()\n", " )\n", " display(HTML(html))\n", " display_details(details)\n", " display_numbers(df, \"{}_numbers_extracted\".format(col), unique_count)\n", "\n", "\n", "def date_field(df, col, value_count, year_count, details, html):\n", " default_dates = pd.to_datetime(\n", " df[col], infer_datetime_format=True, errors=\"coerce\", utc=True\n", " )\n", " default_dates_count = default_dates.dropna().size\n", " dayfirst_dates = pd.to_datetime(\n", " df[col],\n", " infer_datetime_format=True,\n", " errors=\"coerce\",\n", " dayfirst=True,\n", " yearfirst=True,\n", " utc=True,\n", " )\n", " dayfirst_dates_count = dayfirst_dates.dropna().size\n", " if (default_dates_count / value_count > date_cutoff) and (\n", " default_dates_count >= dayfirst_dates_count\n", " ):\n", " df[\"{}_date_converted\".format(col)] = default_dates\n", " elif (dayfirst_dates_count / value_count > date_cutoff) and (\n", " dayfirst_dates_count >= default_dates_count\n", " ):\n", " df[\"{}_date_converted\".format(col)] = dayfirst_dates\n", " else:\n", " # It's not a known date format, so let's just get the years\n", " df[\"{}_years_extracted\".format(col)] = df[col].str.extract(\n", " r\"\\b(1[7-9]{1}\\d{2}|20[0-1]{1}\\d{1})\\b\"\n", " )\n", " df[\"{}_date_converted\".format(col)] = pd.to_datetime(\n", " df[\"{}_years_extracted\".format(col)], format=\"%Y\", utc=True\n", " )\n", " html += \"This looks like it contains dates.
\"\n", " html += \"{:,} of {:,} values in this column were successfully parsed as dates.
\".format(\n", " df[\"{}_date_converted\".format(col)].dropna().size, value_count\n", " )\n", " details[\"Earliest date\"] = (\n", " df[\"{}_date_converted\".format(col)].min().strftime(\"%Y-%m-%d\")\n", " )\n", " details[\"Latest date\"] = (\n", " df[\"{}_date_converted\".format(col)].max().strftime(\"%Y-%m-%d\")\n", " )\n", " display(HTML(html))\n", " display_details(details)\n", " display_dates(df, \"{}_date_converted\".format(col))\n", "\n", "\n", "def url_field(df, col, details, html):\n", " display_details(details)\n", " html += (\n", " \"It looks like this column contains urls. Here are some examples:
This column only contains one value:
\"\n", " html += \"{}\".format(\n", " df[col].loc[df[col].first_valid_index()]\n", " )\n", " display(HTML(html))\n", "\n", "\n", "def number_field(df, col, value_count, unique_count, unique_ratio, details, html):\n", " has_year = df.loc[(df[col] >= 1700) & (df[col] <= 2019)]\n", " if (has_year.size / value_count) > date_cutoff:\n", " df[\"{}_date_converted\".format(col)] = pd.to_datetime(\n", " df[col], format=\"%Y\", utc=True, errors=\"coerce\"\n", " )\n", " html += \"
This looks like it contains dates.
\"\n", " html += \"{:,} of {:,} values in this column were successfully parsed as dates.
\".format(\n", " df[\"{}_date_converted\".format(col)].dropna().size, value_count\n", " )\n", " details[\"Earliest date\"] = (\n", " df[\"{}_date_converted\".format(col)].dropna().min().strftime(\"%Y-%m-%d\")\n", " )\n", " details[\"Latest date\"] = (\n", " df[\"{}_date_converted\".format(col)].dropna().max().strftime(\"%Y-%m-%d\")\n", " )\n", " display(HTML(html))\n", " display_details(details)\n", " display_dates(df, \"{}_date_converted\".format(col))\n", " else:\n", " details[\"Highest value\"] = df[col].max()\n", " details[\"Lowest value\"] = df[col].dropna().min()\n", " display_details(details)\n", " if unique_ratio > cutoff:\n", " html = \"{:.2%} of the values in this column are unique, so it's probably some sort of identifier.\".format(\n", " unique_ratio\n", " )\n", " display(HTML(html))\n", " if unique_count <= 20:\n", " display_categories(df, col)\n", " else:\n", " display_numbers(df, col, unique_count)\n", " # Check for geocoordinates?\n", "\n", "\n", "def display_details(details):\n", " details_df = pd.DataFrame.from_dict(details, orient=\"index\", columns=[\" \"])\n", " details_df.rename_axis(\"Summary\", axis=\"columns\", inplace=True)\n", " details_df = details_df.style.set_table_styles(\n", " [dict(selector=\"th\", props=[(\"text-align\", \"left\")])]\n", " )\n", " display(details_df)\n", "\n", "\n", "def analyse_columns(df):\n", " enriched_df = df.copy()\n", " # out = widgets.Output()\n", " for index, col in enumerate(enriched_df.columns):\n", " display(\n", " HTML(\n", " '{}
All the values in this column are unique, perhaps it\"\n", " \"s some form of identifier.
\"\n", " )\n", " if unique_count == 1:\n", " unique_field(enriched_df, col, details, html)\n", " # Check it's a string field\n", " elif enriched_df[col].dtype == \"object\":\n", " word_counts = enriched_df[col].dropna().str.split().str.len().fillna(0)\n", " # median_word_count = statistics.median(word_counts)\n", " # Check for the presence of years\n", " # year_count = enriched_df[col].str.count(r'\\b1[7-9]{1}\\d{2}\\b|\\b20[0-1]{1}\\d{1}\\b').sum()\n", " if enriched_df[col].str.startswith(\"http\", na=False).sum() > 1:\n", " url_field(enriched_df, col, details, html)\n", " # elif median_word_count <= 4:\n", " # How many have words that combine letters and numbers?\n", " else:\n", " # How many start with words (and no numbers in the first two words)?\n", " starts_with_words = enriched_df[col].str.contains(\n", " r\"^[a-zA-Z]+$|^(?:\\b[a-zA-Z]{2,}\\b\\W*){2}\", regex=True\n", " )\n", " # How many have patterns that look like years?\n", " has_year = enriched_df[col].str.contains(\n", " r\"\\b1[7-9]{1}\\d{2}|20[0-1]{1}\\d{1}\\b\", regex=True\n", " )\n", " # If most don't start with words...\n", " # This filters out titles or names that might include dates.\n", " if (\n", " value_count - starts_with_words.sum()\n", " ) / value_count > date_cutoff:\n", " # If most contain years...\n", " if (has_year.sum() / value_count) > date_cutoff:\n", " date_field(\n", " enriched_df,\n", " col,\n", " value_count,\n", " has_year.sum(),\n", " details,\n", " html,\n", " )\n", " else:\n", " textplus_field(\n", " enriched_df,\n", " col,\n", " value_count,\n", " unique_count,\n", " unique_ratio,\n", " word_counts,\n", " has_year,\n", " details,\n", " html,\n", " )\n", " else:\n", " textplus_field(\n", " enriched_df,\n", " col,\n", " value_count,\n", " unique_count,\n", " unique_ratio,\n", " word_counts,\n", " has_year,\n", " details,\n", " html,\n", " )\n", " elif enriched_df[col].dtype in [\"int64\", \"float64\"]:\n", " number_field(\n", " enriched_df,\n", " col,\n", " value_count,\n", " unique_count,\n", " unique_ratio,\n", " details,\n", " html,\n", " )\n", " else:\n", " html = \"This column is empty.\"\n", " display(HTML(html))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "csvs = pd.read_csv(\n", " \"https://media.githubusercontent.com/media/wragge/srnsw-indexes/master/indexes.csv\"\n", ")\n", "# csvs = pd.read_csv(\"indexes.csv\")\n", "csvs.sort_values(by=[\"title\"], inplace=True)\n", "options = []\n", "for row in csvs.itertuples():\n", " slug = row.url.strip(\"/\").split(\"/\")[-1]\n", " options.append((row.title, {\"title\": row.title, \"slug\": slug}))\n", "\n", "\n", "def clear_all(b):\n", " select_csv.value = options[0][1]\n", " results.clear_output()\n", "\n", "\n", "select_csv = widgets.Dropdown(\n", " options=options, description=\"\", disabled=False, layout=widgets.Layout(width=\"80%\")\n", ")\n", "\n", "clear_button = widgets.Button(\n", " description=\"Clear\",\n", " disabled=False,\n", " button_style=\"\", # 'success', 'info', 'warning', 'danger' or ''\n", " tooltip=\"Clear current data\",\n", " icon=\"\",\n", ")\n", "\n", "analyse_button = widgets.Button(\n", " description=\"Analyse CSV\",\n", " disabled=False,\n", " button_style=\"primary\", # 'success', 'info', 'warning', 'danger' or ''\n", " tooltip=\"Analyse CSV\",\n", " icon=\"\",\n", ")\n", "\n", "clear_button.on_click(clear_all)\n", "analyse_button.on_click(analyse_csv)\n", "select_note = widgets.HTML(\"Select an index:\")\n", "select_tab = widgets.VBox([select_note, select_csv])\n", "# tab = widgets.Tab(children=[select_tab])\n", "# tab.set_title(0, 'Select CSV')\n", "display(\n", " widgets.VBox([select_tab, widgets.HBox([analyse_button, clear_button]), results])\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# IGNORE -- THIS CELL IS FOR AUTOMATED TESTING ONLY\n", "if os.getenv(\"GW_STATUS\") == \"dev\":\n", " select_csv.value = options[2][1]\n", " analyse_button.click()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "\n", "Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.net/).\n", "\n", "Work on this notebook was supported by the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "voila": { "template": "material" } }, "nbformat": 4, "nbformat_minor": 4 }