# Step - 2: Legacy of Slavery Certificates of Freedom Collection - Context Based Data Preparation Step
### Applying Computational Thinking and Histrorical, Cultural Context based approach to an Archival Dataset Collection
* **Student Contributors:** Rajesh GNANASEKARAN, Alexis HILL, Phillip NICHOLAS, Lori PERINE
* **Faculty Mentor:** Richard MARCIANO
* **Community Mentor:** Maya DAVIS, Christopher HALEY (Maryland State Archives), Lyneise WILLIAMS (VERA Collaborative), Mark CONRAD (NARA)
* **Source Available:** https://github.com/cases-umd/Legacy-of-Slavery
* **License:** [Creative Commons - Attribute 4.0 Intl](https://creativecommons.org/licenses/by/4.0/)
* [Lesson Plan for Instructors](./lesson-plan.ipynb)
* **Related Publications:**
 * **IEEE Big Data 2020 CAS Workshop:** [Computational Treatments to Recover Erased Heritage: A Legacy of Slavery Case Study (CT-LoS)](https://ai-collaboratory.net/wp-content/uploads/2020/11/Perine.pdf)
* **More Information:**
 * **SAA Outlook March/April 2019:** [Turning Data into People in Maryland's Slave Records](https://twitter.com/archivists_org/status/1116132520255479809)

We organized the data preparation step around [David Weintrop’s model of computation thinking] (https://link.springer.com/content/pdf/10.1007%2Fs10956-015-9581-5.pdf) and worked based on a [questionnaire] (TNA_Questionnaire.ipynb) developed by The National Archives, London, UK to document this step as well. 

![CT-STEM taxonomy](taxonomy.png "David W.'s CT Taxonomy")

### **C**omputational Thinking Practices
* Data Practices
 * Creating Data
 * Mainpulating Data
* Systems Thinking Practices
 * Thinking in Levels

### **E**thics and Values Considerations
 * Historical and Cultural Context Based Exploration and Cleaning
 * Understanding the sensitivity of the data

### **A**rchival Practices
 * Digital Records and Access Systems

### Learning Goals
A step-by-step understanding of using computational thinking practices on a digitally archived Maryland State Archives Legacy of Slavery dataset collection

## Context Based data exploration and cleaning
Using the diverse team members' contextual knowledge, we prepared the cleaned data from previous step through steps below. Some notable columns which were prepared were Height of the Enslaved person and Age.

In [1]:
import pandas as pd
import numpy as np

In [26]:
#code to import the csv saved from the previous step
df = pd.read_csv("Datasets\LoS_Clean_Output.csv") 

## Height Feature 
This field is to indicate the height of the individual freed in feet and inches. 

In [27]:
#code to pull the error above
df["Height"]

0          5' 3"
1          5' 3"
2           5'3"
3        5'7.75"
4         4'9.5"
          ...   
23650    5'8.25"
23651       5'9"
23652     5'7.5"
23653       5'7"
23654       5'5"
Name: Height, Length: 23655, dtype: object

In [172]:
#code to pull the specific error above
df["Height"].describe()

count     22325
unique      533
top        5'4"
freq       1134
Name: Height, dtype: object

In [182]:
#code to split the Height into feet and inches
# As could be seen above, there are various types of Prior Status that are similar in nature. the value 'nan' in Python means it has no values.
# Below set of commands form a component in Python called as a function. Functions are a block of commands which could be used to perform the same action every time they are called.
# The below function converts the input parameter to the right Prior Status category based on some conditional statements.
import re
r = re.compile(r"([0-9]+)'([0-9]*\.?[0-9]+|)")
def format_height(el):
    el_new =el.replace(" ","")
    m = r.match(el_new)
    if m == None:
        return float('NaN')
    else:
        return int(m.group(1))*12 + (0 if not m.group(2) else float(m.group(2)))
    
# Some of the records have been transcribed as mixed fractions rather than decimal values. These values have to be converted to inches using different Python formula as discussed below.    
from fractions import Fraction
def format_height_type2(el):  
    el_new =el.replace('"',"")
    el_new =el_new.split('\'')
    el_new = [word for line in el_new for word in line.split()]
    if not el_new:
        return float('NaN')
    else:
        return int(el_new[0])*12 + (float(el_new[1]) + float(Fraction(el_new[2])))    
# Below command starts with the beginning indentation indicating a new set of commands outside of the function, even if its in the same cell block like shown here.
# The 'apply' function applies the function definted above to the data frame's each records' Prior Status field avlue. 
df["Height_Inches"] = df["Height"].astype(str).apply(lambda x: format_height_type2(x) if x.find('/') != -1 else format_height(x))
# import shlex
# height_f = df["Height"][0].split("'")
# height_f
# shlex.split(df["Height"][0])

In [176]:
df["Height_Inches"].describe()

count    22320.00000
mean        64.36112
std          4.13348
min         21.50000
25%         62.00000
50%         64.50000
75%         67.00000
max        145.00000
Name: Height_Inches, dtype: float64

In [177]:
df["Height_Inches"].unique()

array([ 63.        ,  67.75      ,  57.5       ,  61.5       ,
        66.5       ,  61.        ,  66.25      ,  66.        ,
                nan,  70.        ,  62.        ,  63.25      ,
        63.75      ,  64.5       ,  65.25      ,  65.75      ,
        67.5       ,  67.        ,  68.        ,  68.5       ,
        60.        ,  69.        ,  73.        ,  63.5       ,
        61.25      ,  64.        ,  65.        ,  71.        ,
        65.5       ,  64.75      ,  59.        ,  62.25      ,
        62.75      ,  69.25      ,  58.        ,  59.25      ,
        70.5       ,  62.5       ,  60.5       ,  64.25      ,
        59.5       ,  61.75      ,  72.        ,  69.75      ,
        59.75      ,  60.75      ,  60.125     ,  61.24      ,
        67.125     ,  66.75      ,  67.25      ,  69.5       ,
        68.75      ,  60.25      ,  61.375     ,  57.        ,
        71.25      ,  56.        ,  58.25      ,  57.75      ,
        68.25      ,  71.5       ,  66.6       ,  64.12

A study by (Margo & Steckel, 1982), which performed an analysis of the height vs age from the EnSlaved Mainfest data of around 50000+ enslaved people shipped between 1811 and 1861 to ports like Baltimore, Richmond and other cities from the Port of Savannah. According to this study, the average heights of tallest enslaved people was around 67 inches. In the same study where another set of Enslaved People's appraisal records showed the maximum height was found to be around 72 inches. Found below are the images from this study showing the different heights by age.
![Height_by_age](Pics\Height_Enslaved_Manifests.PNG "Height_Enslaved_Manifests")

![Max_Height_by_Weight](Pics\Height_Mississippi_Chart.PNG "Height_Mississippi_Chart")

The above charts raise doubts on the unique values we observed to be higher than 80 inches and lesser than 5 inches. Thus, separating these records from the dataframe below shows the different representation of the Height during transcription.

In [191]:
df.loc[(df["Height_Inches"]>80)|(df["Height_Inches"]<5),['DataItem','Height','Height_Inches']]

Unnamed: 0,DataItem,Height,Height_Inches
5032,5034,"4' 44.75""",92.75
5625,5627,"5'2 1.4""",81.4
6197,6199,"5'85.""",145.0
15694,15696,"9'.75""",108.75


The above values have to be manually handled by looking into the scanned documents and finding their right values as discussed below:

One of entries shown above where the height was mentioned as 4 feet 44.75 inches belonged to the enslaved person Milly Farmer c477-2, page 200, upon looking at the scanned document it was really captured as 4 feet 11.75 inches as found below from the document:

![HeightIssue4](Pics\CoF_Data_Preparation_Height_incorrect4.PNG "CoF Height Error4")

Also, it should be noted that there is record which was entered with a height of 9'.75" which clearly seems like an impossible value. This had to be handled by attempting to manually look at the Certificate of Freedom record from the scanned documents. Upon analyzing we found that there was no CoF scanned document found for this person (Cof ID: 15696). It mentions that under note that this person was manumitted but we could not find the documents under Manumitted records as well. Hence, the height record was changed as NaN for this record.

In [187]:
df.loc[(df["Height_Inches"].isna())&(df["Height"].notna()),["DataItem","Height"]]

Unnamed: 0,DataItem,Height
7859,7861,"5 5"""
11492,11494,illegible
12837,12839,"5"""
15175,15177,"5"""
16964,16966,"5""1"""


Other data capture issues were corrected by looking at the original scanned CoF as shown below: the height was noted as 5 5” which was in fact 5” 5’ - 5 feet 5 inches
![HeightIssue2](Pics\CoF_Data_Preparation_Height_incorrect2.PNG "CoF Height Error2")

From above code result, we identify that there are some invalid representations of the height where the transcribers did not follow the procedures to enter single quotes for Feet and double quotes for inches. These have to be manually handled as well.

In [195]:
#code to manually update the issues identified above with the corrected value in inches
# We directly use the dataitem id as shown above to update the records.
df["Height_Inches"]
df.loc[(df["DataItem"]==5034), "Height_Inches"] =59.75
df.loc[(df["DataItem"]==5627), "Height_Inches"] =63.40
df.loc[(df["DataItem"]==6199), "Height_Inches"] =None
df.loc[(df["DataItem"]==15696), "Height_Inches"] =None
df.loc[(df["DataItem"]==7861), "Height_Inches"] =65.00
df.loc[(df["DataItem"]==11494), "Height_Inches"] =None
df.loc[(df["DataItem"]==12839), "Height_Inches"] =60.00
df.loc[(df["DataItem"]==15177), "Height_Inches"] =60.00
df.loc[(df["DataItem"]==16966), "Height_Inches"] =61.00

In [197]:
df.loc[(df["Height_Inches"]>80)|(df["Height_Inches"]<5),['DataItem','Height','Height_Inches']]
df.loc[(df["Height_Inches"].isna())&(df["Height"].notna()),["DataItem","Height","Height_Inches"]]

Unnamed: 0,DataItem,Height,Height_Inches
6197,6199,"5'85.""",
11492,11494,illegible,
15694,15696,"9'.75""",


The height feature is now prepared for further usage.

## Age feature 
Age field was originally in the text type format, converted to number, and converted all the decimals which was entered as it is from the original document listed as months into a 12 month per year relative decimal value, for example, the original CoF noted the enslaved person as 18 months old, the dataset had this value as 0.18 under the age column which actually should be 1.5 years old.

In [None]:
# code to pull the above 

In [None]:
# code to conver the age errors

For one case which was listed to be as 100 years old, upon checking the CoF original document, it's unclear as the document shows something like eighty & twenty years as highlighted below: This is also noted in the notes section as “Age given as eighty and twenty years. Could potentially be 28 years, not 100.”

![AgeIssue](Pics\CoF_Data_Preparation_Age_incorrect.PNG "CoF Age Error")

In [4]:
# save the output file
# save the output to the csv
dfo = pd.DataFrame(df)
dfo.to_csv('Datasets\LoS_Prep_Output.csv', index=False)

# Notebooks

The below module is organized into a sequential set of Python Notebooks that allows to interact with the Legacy of Slavery's Certificates of Freedom collection by exploring, cleaning, preparing, visualizing and analysing it from historical context perspective. 

3. [Certificates Of Freedom: Context Based Data Visualization and Analysis](LoS_CoF_Data_Viz.ipynb)