{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Step - 1 : Legacy of Slavery Certificates of Freedom Collection - Context Based Data Exploration and Cleaning Step\n", "### Applying Computational Thinking and Histrorical, Cultural Context based approach to an Archival Dataset Collection\n", "* **Student Contributors:** Rajesh GNANASEKARAN, Alexis HILL, Phillip NICHOLAS, Lori PERINE\n", "* **Faculty Mentor:** Richard MARCIANO\n", "* **Community Mentor:** Maya DAVIS, Christopher HALEY (Maryland State Archives), Lyneise WILLIAMS (VERA Collaborative), Mark CONRAD (NARA)\n", "* **Source Available:** https://github.com/cases-umd/Legacy-of-Slavery\n", "* **License:** [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/)\n", "* [Lesson Plan for Instructors](./lesson-plan.ipynb)\n", "* **Related Publications:**\n", " * **IEEE Big Data 2020 CAS Workshop:** [Computational Treatments to Recover Erased Heritage: A Legacy of Slavery Case Study (CT-LoS)](https://ai-collaboratory.net/wp-content/uploads/2020/11/Perine.pdf)\n", "* **More Information:**\n", " * **SAA Outlook March/April 2021:** (Coming Soon)\n", "\n", "## Introduction\n", "This module is based on a case study involving [The \"Legacy of Slavery Project\"](http://slavery.msa.maryland.gov/) archival records from the Maryland State Archives. The Legacy of Slavery in Maryland is a major initiative of the Maryland State Archives. The program seeks to preserve and promote the vast universe of experiences that have shaped the lives of Maryland’s African American population. Over the last 18 years, some 420,000 individuals have been identified and data has been assembled into 16 major databases. The [DCIC](http://dcic.umd.edu) has now partnered with the Maryland State Archives to help interpret this data and reveal hidden stories.\n", "\n", "We, as a team of students as part of a 2-day [Datathon 2019 at Maryland State Archives](https://ai-collaboratory.net/projects/ct-los/student-led-datathon-at-the-maryland-state-archives/) interacted with the historical data set collection, \"Certificates of Freedom\" from the Maryland State Archives compiled database.\n", "\n", "We organized the data exploration and cleaning around [David Weintrop’s model of computation thinking](https://link.springer.com/content/pdf/10.1007%2Fs10956-015-9581-5.pdf) and worked based on a [questionnaire] (TNA_Questionnaire.ipynb) developed by The National Archives, London, UK to document each step of our process. \n", "\n", "\n", "\n", "### **C**omputational Thinking Practices\n", "* Data Practices\n", " * Collecting Data\n", " * Creating Data\n", " \n", "### **E**thics and Values Considerations\n", " * Historical and Cultural Context Based Exploration and Cleaning\n", " * Understanding the sensitivity of the data\n", "\n", "### **A**rchival Practices\n", " * Digital Records and Access Systems\n", "\n", "### Learning Goals\n", "A step-by-step understanding of using computational thinking practices on a digitally archived Maryland State Archives Legacy of Slavery dataset collection\n", "\n", "## Step 1: Context Based Data Exploration and Cleaning Process\n", "\n", "We followed a case study methodology for this project to achieve the objective of exploring, analyzing and visualizing the dataset collections downloaded from the Maryland State of Archives database. As the dataset collections were available as downloadable csv files, the technical tasks addressed by our group were to identify the right tools that could be used to consume the csv files for exploratory analysis, cleaning and visualization purposes. Below are the steps for data exploration and cleaning process using Python programming language on the Certificates of Freedom dataset. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Acquiring or Accessing the Data\n", "The data for this project was originally crawled from the Maryland State Archives **Legacy of Data** collections \n", "The data source is included in this module as a comma-separated values file. The link below will take you to a view the data file:\n", "* [LoS_CoF.csv](Datasets/LoS_CoF.csv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To process a csv file in Python, one of the first steps is to import a Python library called as 'pandas' which would help the program convert the csv file into a dataframe format or commonly called as a table format. We import the library into the program as below:" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [], "source": [ "# Importing libraries - pandas used for data science/data analysis and machine learning tasks and numpy - which provides support for multi-dimensional arrays\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the pandas library, create a new dataframe in the name 'df' using read_csv function as shown below: After creating the dataframe, use the print() function to display the top 10 rows loaded in the dataframe." ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " DataID DataItem County Owner_FirstName Owner_LastName \\\n", "0 AR7-46 1 AA Ann Ailsworth \n", "1 AR7-46 2 AA Ann Ailsworth \n", "2 AR7-46 3 AA Ann Ailsworth \n", "3 AR7-46 4 AA William Alexander \n", "4 AR7-46 5 AA Thomas Allen \n", "5 AR7-46 6 AA Thomas Allen \n", "6 AR7-46 7 AA James Alleson \n", "7 AR7-46 8 AA Mary Alwell \n", "8 AR7-46 9 AA Mary Armiger \n", "9 AR7-46 10 AA Mary Atcock \n", "\n", " Witness Date Freed_FirstName Freed_LastName Alias ... \\\n", "0 NaN NaN Keziah Cromwell NaN ... \n", "1 Zachariah Duvall 1811-06-24 Resiah Cromwell NaN ... \n", "2 Jenifer Duvall 1811-06-24 Kesiah Cromwell NaN ... \n", "3 NaN 1815-03-28 Handy McCeomey NaN ... \n", "4 NaN 1837-07-10 Nancy Ennis NaN ... \n", "5 NaN 1837-08-03 Jim Sharpe NaN ... \n", "6 NaN 1826-10-28 Belly NaN NaN ... \n", "7 NaN 1844-11-08 Howard Davis NaN ... \n", "8 NaN 1819-01-27 Abigail NaN NaN ... \n", "9 Jacob Franklin, Jr. 1812-12-30 Ned NaN NaN ... \n", "\n", " Folder Document Page Entry DatasetName \\\n", "0 NaN NaN 42686.0 12.0 FF \n", "1 NaN NaN 24.0 3.0 FF \n", "2 55.0 NaN NaN NaN FF \n", "3 NaN NaN 50.0 2.0 FF \n", "4 NaN NaN 257.0 1.0 FF \n", "5 NaN NaN 257.0 2.0 FF \n", "6 NaN NaN 242.0 1.0 FF \n", "7 NaN NaN 372.0 1.0 FF \n", "8 NaN NaN 126.0 2.0 FF \n", "9 NaN NaN 31.0 3.0 FF \n", "\n", " Notes isWorking isError \\\n", "0 NaN 0 0 \n", "1 NaN 0 0 \n", "2 Freed by will of Mrs. Ann Ailsworth. 0 0 \n", "3 Freed by manumission, dated 27 March 1815. Rai... 0 0 \n", "4 Freed by petition to Anne Arundel County Court... 0 0 \n", "5 Freed by petition to Anne Arundel County Court... 0 0 \n", "6 Freed by manumission, dated 28 Oct 1826. Raise... 0 0 \n", "7 son of Nelly. Freed by manumission, dated 12 A... 0 0 \n", "8 along with Richard G. Stetton. Freed by manumi... 0 0 \n", "9 NaN 0 0 \n", "\n", " ChangeDate CreateDate \n", "0 39:20.3 39:20.3 \n", "1 39:20.3 39:20.3 \n", "2 39:20.3 39:20.3 \n", "3 39:20.3 39:20.3 \n", "4 39:20.3 39:20.3 \n", "5 39:20.3 39:20.3 \n", "6 39:20.3 39:20.3 \n", "7 39:20.3 39:20.3 \n", "8 39:20.3 39:20.3 \n", "9 39:20.3 39:20.3 \n", "\n", "[10 rows x 28 columns]\n" ] } ], "source": [ "# creating a data frame which is a table-like data structure that could read csv files, flat files, and other delimited data.\n", "# Converting input data into a data frame is a key starting point with Python programming language for big data analytics\n", "# Below command reads in the Certificates of Freedom dataset which should already be loaded in a folder called 'Datasets' as LoS_CoF.csv\n", "df = pd.read_csv(\"Datasets\\LoS_CoF.csv\") \n", "# Below command prints the first 10 records after the data is copied from the csv file\n", "print(df.head(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We anticipated errors and misinterpretation of names, numbers, etc. since this database was mostly transcribed manually by hand from the physical or scanned copies of the Certificates of Freedom. Our approach was to individually explore and clean the data column-by-column utilizing the text and numerical operation functions in Python programming language for this purpose mostly. We looked at the dataset holistically at first, identifying features that allowed us to generate meaningful stories or visualizations. Upon confirmation of the features list, we analyzed each of them in detail to document bad data and eliminate them if possible, modify data types, exclude them from the final visualizations if found to be invalid, etc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Context Based data exploration and cleaning\n", "As the team members were from a diverse group of technology, historical, and archivist background, there were options to work individually all along or to work in groups all along, but we decided to do a hybrid setup of analyzing alone and reporting the results back to the group for discussion. With respect to the analysis performed on the dataset, decisions were data-driven or historical facts driven. For instance, to address the feature in CoF dataset - Prior Status Column: Research was conducted to determine the prior status of those who were categorized as a “Descendant of a white female woman” as shown below from the set of unique categories. Source: Wikipedia - History of slavery in Maryland. This research was beneficial in identifying what group certain observations belong to." ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [], "source": [ "# df is the data frame variable which stores the entire dataset in a table form. Below command converts the specific column or feature 'PriorStatus' as Categorical type instead of String for manipulation\n", "df[\"PriorStatus\"]=df[\"PriorStatus\"].astype('category')" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{nan, 'slave', 'Free', 'Slave ', 'Unknown; Free Born', 'Free; Slave', 'Free born', 'Slave; Slave', 'Enslaved', 'born free', 'Freeborn', 'Free ', 'Born Free', '?', 'BornFree', 'free born', 'Unknown', 'Descendant of a white female woman', 'Slave', 'Unknown; Slave', 'Free born ', 'John', 'Free Born', 'S;ave', 'Born free'}\n" ] } ], "source": [ "# After conversion, let's print the number of categories available for that particular feature from the dataset\n", "print(set(df[\"PriorStatus\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Slave' 'Born Free' nan 'Unknown' 'Free']\n" ] } ], "source": [ "# As could be seen above, there are various types of Prior Status that are similar in nature. the value 'nan' in Python means it has no values.\n", "# Below set of commands form a component in Python called as a function. Functions are a block of commands which could be used to perform the same action every time they are called.\n", "# The below function converts the input parameter to the right Prior Status category based on some conditional statements.\n", "def fix_prior_status(status):\n", " # initiate variables to hold the literal value\n", " free = \"free\"\n", " born = \"born\"\n", " slave = \"ave\"\n", " descend = \"Descend\"\n", " # conditional statements to use in-built 'find' function to check if the prior status passed has the value of the literal checked, and if so the status would be modified as mentioned\n", " # in the 'return' statement\n", " if status.find(born) != -1:\n", " # it should also be noted that indentation is a key requirement with Python, not where the return statement starts after the 'if'\n", " return \"Born Free\"\n", " else:\n", " # nested if's are possible in Python to conditionally control the else logic\n", " if status.find(slave) != -1:\n", " return \"Slave\"\n", " else:\n", " if status.find(descend) != -1:\n", " return \"Born Free\"\n", " else:\n", " if status.find(free) != -1:\n", " return \"Free\"\n", " else:\n", " return \"Unknown\"\n", "# Below command starts with the beginning indentation indicating a new set of commands outside of the function, even if its in the same cell block like shown here.\n", "# The 'apply' function applies the function definted above to the data frame's each records' Prior Status field avlue. \n", "df[\"PriorStatus\"] = df[\"PriorStatus\"].apply(fix_prior_status)\n", "# The 'unique' in-built function prints out the distinct values of the transformed or modified prior status of the data frame\n", "print(df[\"PriorStatus\"].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Through researching the literature, conversations with historians and experts in the field, discussions with archivists from the Maryland State Archives, the team members followed a set of steps where certain unique characteristics of a particular feature for instance were identified and shared with the entire group for their inputs before finalizing the results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some other examples include identifying issues with the columns like Date issued for the Cof, County as explained below:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Issues with Date Issued for the CoF\n", "Through healthy discussions on what-if scenarios as most of the data were historical and we were bringing each of our expertise into the conversations, several insights were gleaned for specific columns which were vital to this Project. Also there were discussions on how data should be presented, collected, and analyzed without impacting the sensitivity of the people involved, especially since this set of collection was unique.\n", "\n", " One of them is the date, there were different formats of date captured in the transcribed collection. This field is to indicate the date when the certificate of freedom was prepared and signed. There were a number of issues with this date field in the original dataset. Different date formats -- There were around 600 records with NULL value, a bunch of them with just YYYYMM format, most of them in the format YYYY-MM-DD and YYYYMMDD format. " ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 23057\n", "unique 9956\n", "top 1832-05-28\n", "freq 296\n", "Name: Date, dtype: object" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Below command prints out the descriptive details of the column 'Date'\n", "df[\"Date\"].describe()" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "598" ] }, "execution_count": 139, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Below command list the number of null or na values in the 'Date' column of the data frame\n", "df[\"Date\"].isna().sum()" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([nan, '1811-06-24', '1815-03-28', ..., '18430912', '18430913',\n", " '18430916'], dtype=object)" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Below command displays an array of unique date values in the 'Date' column\n", "df[\"Date\"].unique()" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | DataID | \n", "DataItem | \n", "County | \n", "Owner_FirstName | \n", "Owner_LastName | \n", "Witness | \n", "Date | \n", "Freed_FirstName | \n", "Freed_LastName | \n", "Alias | \n", "... | \n", "Folder | \n", "Document | \n", "Page | \n", "Entry | \n", "DatasetName | \n", "Notes | \n", "isWorking | \n", "isError | \n", "ChangeDate | \n", "CreateDate | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23307 | \n", "AR7-46 | \n", "23310 | \n", "BA | \n", "Geo | \n", "Gillingham | \n", "NaN | \n", "184006 | \n", "Jeremiah W. | \n", "Brown | \n", "Jerry | \n", "... | \n", "NaN | \n", "NaN | \n", "224.0 | \n", "5.0 | \n", "FF | \n", "Freed by manumission, dated 15 June 1824, reco... | \n", "0 | \n", "0 | \n", "37:45.8 | \n", "03:44.1 | \n", "
| 23308 | \n", "AR7-46 | \n", "23311 | \n", "BA | \n", "NaN | \n", "NaN | \n", "NaN | \n", "184006 | \n", "Rachael | \n", "Brown | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "224.0 | \n", "6.0 | \n", "FF | \n", "NaN | \n", "0 | \n", "0 | \n", "37:45.8 | \n", "05:54.2 | \n", "
2 rows × 28 columns
\n", "| \n", " | Date | \n", "DateFormatted | \n", "
|---|---|---|
| 0 | \n", "None | \n", "NaT | \n", "
| 1 | \n", "1811-06-24 | \n", "1811-06-24 | \n", "
| 2 | \n", "1811-06-24 | \n", "1811-06-24 | \n", "
| 3 | \n", "1815-03-28 | \n", "1815-03-28 | \n", "
| 4 | \n", "1837-07-10 | \n", "1837-07-10 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 23650 | \n", "18430826 | \n", "1843-08-26 | \n", "
| 23651 | \n", "18430905 | \n", "1843-09-05 | \n", "
| 23652 | \n", "18430912 | \n", "1843-09-12 | \n", "
| 23653 | \n", "18430913 | \n", "1843-09-13 | \n", "
| 23654 | \n", "18430916 | \n", "1843-09-16 | \n", "
23655 rows × 2 columns
\n", "| \n", " | DataID | \n", "DataItem | \n", "County | \n", "Owner_FirstName | \n", "Owner_LastName | \n", "Witness | \n", "Date | \n", "Freed_FirstName | \n", "Freed_LastName | \n", "Alias | \n", "... | \n", "Document | \n", "Page | \n", "Entry | \n", "DatasetName | \n", "Notes | \n", "isWorking | \n", "isError | \n", "ChangeDate | \n", "CreateDate | \n", "DateFormatted | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22633 | \n", "AR7-46 | \n", "22635 | \n", "BA | \n", "NaN | \n", "NaN | \n", "Oakley Haddoway | \n", "189390417 | \n", "Joseph | \n", "Caldwell | \n", "NaN | \n", "... | \n", "NaN | \n", "130.0 | \n", "3.0 | \n", "FF | \n", "Raised in Talbot County. Thos kell, clerk | \n", "0 | \n", "0 | \n", "37:47.1 | \n", "38:31.2 | \n", "NaT | \n", "
1 rows × 29 columns
\n", "