# Appendix Datathon Questionnaire developed by the TNA (The National Archives, UK)

## Introduction
Appendix 2: Applying the Template for Documentation to the Certificate of Freedom (CoF) Collection

### General Questions
* **What is the overall topic/area being addressed by the group?**
We extracted, analyzed, cleaned and created visualizations from the Certificates of Freedom dataset obtained from the Maryland State Archives’ Legacy of Slavery Database.

* **What are the specific challenges raised by this topic?** 
We anticipated errors and misinterpretation of names, numbers, etc. since this database was mostly transcribed manually by hand from the physical or scanned copies of the Certificates of Freedom. 

* **Which challenges are you addressing?**
We looked at the dataset holistically at first, identifying features that allowed us to generate meaningful stories or visualizations. Upon confirmation of the features list, we would analyze each of them in detail to document bad data and eliminate them if possible, modify data types, exclude them from the final visualizations if found to be invalid, etc. 

### Approach and decision making
* ** What is the approach being taken?** 
Consider methodological aspects, technical aspects, etc. 
We followed a case study methodology for this project to achieve the objective of exploring, analyzing and visualizing the dataset collections downloaded from the Maryland State of Archives database. 

* ** In particular, what technical questions/tasks are being addressed in your group?**
As the dataset collections were available as downloadable csv files, the technical tasks addressed by our group were to identify the right tools that could be used to consume the csv files for exploratory analysis, cleaning and visualization purposes. The group decided to use the open source tool Open Refine based on previous success with it for exploration, cleaning, and use tableau software for visualization purposes.

* ** How was the approach selected? Were there alternatives that were considered but not followed up?** 
Our approach was to individually clean the data column-by-column utilizing the text and numerical facet features in OpenRefine, then (2) combine the cleaned columns in GoogleSheets, and (3) finally visualize the cleaned data file utilizing Tableau software. Our original approach for Step 2 was to save all of our cleaned files in UMD Box Service, then assign one person to combine columns from the multiple cleaned files at once. However, GoogleSheets offered a more collaborative space that allowed all group members to work on simultaneously. 

### Document each stage of the decision process:
* ** What decisions are taken?**
As the dataset collections contained features that could be individually worked upon, based on the expertise of each team member, the features were divided among each of us to research, explore, clean and report on the findings from the assigned modules. Later as a group, brainstorming from individual results, providing feedback, and to glean insights from the findings.

* ** What are the options?**
As the team members were from a diverse group of technology, historical, and archivist background, there were options to work individually all along or to work in groups all along, but we decided to do a hybrid setup of analyzing alone and reporting the results back to the group for discussion.

* ** On what grounds is a particular decision made (evidence, criteria, ...)**
With respect to the analysis performed on the dataset, decisions were data-driven or historical facts driven. For instance, to address the feature in CoF dataset - Prior Status Column: Research was conducted to determine the prior status of those who were categorized as a “Descendant of a white female woman”. Source: Wikipedia - History of slavery in Maryland. This research was beneficial in identifying what group certain observations belong to.


* **What specific steps is your team taking?**
Through researching the literature, conversations with historians and experts in the field, discussions with archivists from the Maryland State Archives, the team members followed a set of steps where certain unique characteristics of a particular feature for instance were identified and shared with the entire group for their inputs before finalizing the results.

### What was your experience of combining methodologies from different disciplines?**
* ** Did you note any incompatibilities?
We did not note any incompatibilities between team members although there were healthy discussions on what-if scenarios as most of the data were historical and we were bringing each of our expertise into the conversations.

* ** Did you develop any new methodological pathways?  
No.

* ** How did you divide up the work within your group?  Was this division related to the disciplinary background? 
Every group member worked individually on cleaning one to three columns, however, some tasks were allocated to each group member based on disciplinary background, which is outlined below: 
Rajesh: Height, Date, Age, (Open Refine, Tableau and Jupyter Notebooks)
Jeneva: Complexion and Experienced with JupyterNotebooks, OpenRefine, & Tableau 
Philip: First & Last Names of former enslaved & owners and provided historical knowledge and context of prior status and complexion (Open Refine)
Alexis: Prior Status (Open Refine)

### Obstacles and issues - What obstacles/issues are you encountering?
* ** Organizational issues
We mostly collaborated in-person for this project at the Maryland State Archives facility which immensely supported this activity and allowed us to access their data for exploration and analysis. As such there were no organizational issues with this Project.

* ** Interdisciplinary issues
As there were members from different backgrounds, there were discussions on how data should be presented, collected, and analyzed without impacting the sensitivity of the people involved, especially since this set of collection was unique.

* ** Issues arising from the data
Date Feature -- This field is to indicate the date when the certificate of freedom was prepared and signed. There were a number of issues with this date field in the original dataset.
Different date formats -- There were around 600 records with NULL value, a bunch of them with just YYYYMM format, most of them in the format YYYY-MM-DD and YYYYMMDD format.
Another instance of data entry error was for c290 page 185 Charles W Jones as shown below with the date captured as 1840516 instead of 18400516

There was this unique case where it seemed like a clear data entry error as shown below: The date was captured as 189390417 but the actual date is 18390417 identified by looking at the scanned copy of the CoF for c290 page 130 for Joseph Caldwell, the county is found to be Talbot from the original ad but the data was entered as Baltimore County only for CoF but Census was captured correctly as Talbot county

There were other instances one of which is shown below where there was no date but only month and year captured on the original CoF itself for c290 page 224 - Jeremiah Brown	

One of the important limitations while working with excel with dates older than 01/01/1900 was that the dates are not calculated and translated correctly. Hence after proper formatting the dates were loaded to Tableau and created a calculated field to handle the dates as shown below:

Height Feature -- This field is to indicate the height of the individual freed in feet and inches. There were some issues with this field mainly with some invalid values and data entry format errors.
We used a delimiter option to parse the field into feet and inches columns separately. 
Data Entry issues -- Upon analyzing we found that there was a record with a height of 9 feet, with first name Abraham witnessed by James Wetheral, on checking the original CoF under series c931, we found that there was no CoF found for this person although there were other names found for Abraham. It mentions that under note that this person was manumitted but could not find the documents under Manumitted records as well
Another issue with the illegible information on the original CoF as shown below for Mahala Robinson c165 page 35, it is also unclear as to why this whole certificate was striked out

Other data capture issues were corrected by looking at the original scanned CoF as shown below: the height was noted as 5 5” which was in fact 5” 5’ - 5 feet 5 inches

Cleaned up other issues related to invalid values for feet and inches like “,:,;, for inches, invalid values for inches more than 12 were also corrected, inches reported as fractions were also converted to its appropriate numeric value
Found this issue about inches being reported as feet below:

There is another entry where the height was mentioned as 4 feet 44.75 inches for Milly Farmer c477-2, page 200, whereas it was really 4 feet 11.75 inches as found below from the document:

Age feature
Age field was originally in the text type format, converted to number, and converted all the decimals which was entered as it is from the original document listed as months into a 12 month per year relative decimal value, for example, the original CoF noted the enslaved person as 18 months old, the dataset had this value as 0.18 under the age column which actually should be 1.5 years old
For one case which was listed to be as 100 years old, upon checking the CoF original document, it's unclear as the document shows something like eighty & twenty years as highlighted below: This is also noted in the notes section as “Age given as eighty and twenty years. Could potentially be 28 years, not 100.”

Technical or tool issues
There were issues related to limitations of tools used for example, MS Spreadsheet was unable to process dates prior to 1/1/1900 in a proper way which led us to use Tableau for date formatting. Open Refine had issues with the date formatting as well as it could not parse the date into date, month and year format from a character field. 

Working in interdisciplinary teams
What disciplines are represented in your working group?
Archivists, Historians, Information technology professionals, computer scientists, researchers, students, professors.

* ** What benefits/opportunities arising from interdisciplinarity have you noticed in your group?** 
We noticed a number of aha moments during the project as we were able to uncover certain insights unique to the collections that we would not have otherwise by virtue of working with multidisciplinary teams. These led to bonding between the team members from diverse groups who would have had no chance to meet and discuss these topics if not for this opportunity. This project also benefited the team members in understanding different perspectives of data and historical analysis.

* ** What challenges arising from interdisciplinarity have you noticed in your group?** 
We had a good number of back and forth discussions between team members especially with the contextual background of the data being analysed as these collections were historical. 

* ** Are there any terminological or other confusions arising from working in interdisciplinary teams?**
For non-historians and archivists, terms like ‘provenance’ were new and a good knowledge sharing experience. For non-IT team members, understanding how data could be sliced and diced to create visualizations that unearthed new insights was a good learning experience.

### Moving forward - Do you see specific possibilities arising from your investigations?
* ** Ideas for next steps**
As next steps, we have plans to understand more about linking the data collections so we could create networks of connected data elements that could create insights not seen or understood before.

* ** Things you might want to try but haven’t**
We would like to do more natural language processing analysis on certain features like notes and comments as the transcribers had entered valuable information into this feature. Also, more in-depth research on the reasons and rationale behind using different words to determine “Prior Status” and “Complexion”.

* ** Out-of-the-box**
Please capture any ideas or discussions that arose from your group that do not fit into any of the previous sections.
Questions from the discussion related to the collection:
Were the scars used for identification purposes in terms of determining which slave belongs to which owner?
There was a spike in the number of Certificates of Freedom (CoF) from 1831 to 1832, then COF issues ceased around the year 1860. Is this because slavery was coming to an end? 
What is the significance of the differences in the prior status column? There are many records, including “Born Free”, “Free Born”, “Slave”, “Enslaved”, and “Descendant of a white female woman”. Are there differences between these statuses?
Skin complexion is very subjective, so how should we divide and classify the multiple different skin tones recorded?
Resources to read:
A Guide to the History of Slavery in Maryland (MSA) (Read sections below)
III. Africans to African Americans
VI. Slavery and Freedom in the New Nation
