{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Context Based Data Manipulation and Analysis Process - Part 1\n",
    "\n",
    "According to (Weintrop et al., 2015) \"Data manipulation includes sorting, filtering, cleaning, normalizing, and joining disparate datasets. There are many strategies that can be employed when analyzing data for use in a scientific or mathematical context, including looking for patterns or anomalies, defining rules to categorize data, and identifying trends and correlations.\" Below are the steps performed for data manipulation and analysis process using Python programming language on the Certificates of Freedom dataset. The modules are split into two parts - Part 1 and Part 2.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Acquiring or Accessing the Data\n",
    "The data for this project was originally crawled from the Maryland State Archives **Legacy of Data** collections. The data source is included in this module as a comma-separated values file. The link below will take you to a view the data file:\n",
    "* [LoS_CoF.csv](Datasets/LoS_CoF.csv)\n",
    "\n",
    "The dataset has 23,655 rows of data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To process a csv file in Python, one of the first steps is to import a Python library called as 'pandas' which would help the program convert the csv file into a dataframe format or commonly called as a table format. We import the library into the program as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing libraries - pandas used for data science/data analysis and machine learning tasks and numpy - which provides support for multi-dimensional arrays\n",
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the pandas library, we created a new dataframe in the name 'df' using read_csv function as shown below: After creating the dataframe, the print() function is used to display the top 10 rows loaded in the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DataID</th>\n",
       "      <th>DataItem</th>\n",
       "      <th>County</th>\n",
       "      <th>Owner_FirstName</th>\n",
       "      <th>Owner_LastName</th>\n",
       "      <th>Witness</th>\n",
       "      <th>Date</th>\n",
       "      <th>Freed_FirstName</th>\n",
       "      <th>Freed_LastName</th>\n",
       "      <th>Alias</th>\n",
       "      <th>...</th>\n",
       "      <th>Folder</th>\n",
       "      <th>Document</th>\n",
       "      <th>Page</th>\n",
       "      <th>Entry</th>\n",
       "      <th>DatasetName</th>\n",
       "      <th>Notes</th>\n",
       "      <th>isWorking</th>\n",
       "      <th>isError</th>\n",
       "      <th>ChangeDate</th>\n",
       "      <th>CreateDate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>1</td>\n",
       "      <td>AA</td>\n",
       "      <td>Ann</td>\n",
       "      <td>Ailsworth</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Keziah</td>\n",
       "      <td>Cromwell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>42686.0</td>\n",
       "      <td>12.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>2</td>\n",
       "      <td>AA</td>\n",
       "      <td>Ann</td>\n",
       "      <td>Ailsworth</td>\n",
       "      <td>Zachariah Duvall</td>\n",
       "      <td>1811-06-24</td>\n",
       "      <td>Resiah</td>\n",
       "      <td>Cromwell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>24.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>3</td>\n",
       "      <td>AA</td>\n",
       "      <td>Ann</td>\n",
       "      <td>Ailsworth</td>\n",
       "      <td>Jenifer Duvall</td>\n",
       "      <td>1811-06-24</td>\n",
       "      <td>Kesiah</td>\n",
       "      <td>Cromwell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>55.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>FF</td>\n",
       "      <td>Freed by will of Mrs. Ann Ailsworth.</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>4</td>\n",
       "      <td>AA</td>\n",
       "      <td>William</td>\n",
       "      <td>Alexander</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1815-03-28</td>\n",
       "      <td>Handy</td>\n",
       "      <td>McCeomey</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>50.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>Freed by manumission, dated 27 March 1815. Rai...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>5</td>\n",
       "      <td>AA</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Allen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1837-07-10</td>\n",
       "      <td>Nancy</td>\n",
       "      <td>Ennis</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>257.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>Freed by petition to Anne Arundel County Court...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>6</td>\n",
       "      <td>AA</td>\n",
       "      <td>Thomas</td>\n",
       "      <td>Allen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1837-08-03</td>\n",
       "      <td>Jim</td>\n",
       "      <td>Sharpe</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>257.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>Freed by petition to Anne Arundel County Court...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>7</td>\n",
       "      <td>AA</td>\n",
       "      <td>James</td>\n",
       "      <td>Alleson</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1826-10-28</td>\n",
       "      <td>Belly</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>242.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>Freed by manumission, dated 28 Oct 1826. Raise...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>8</td>\n",
       "      <td>AA</td>\n",
       "      <td>Mary</td>\n",
       "      <td>Alwell</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1844-11-08</td>\n",
       "      <td>Howard</td>\n",
       "      <td>Davis</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>372.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>son of Nelly. Freed by manumission, dated 12 A...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>9</td>\n",
       "      <td>AA</td>\n",
       "      <td>Mary</td>\n",
       "      <td>Armiger</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1819-01-27</td>\n",
       "      <td>Abigail</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>126.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>along with Richard G. Stetton. Freed by manumi...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>10</td>\n",
       "      <td>AA</td>\n",
       "      <td>Mary</td>\n",
       "      <td>Atcock</td>\n",
       "      <td>Jacob Franklin, Jr.</td>\n",
       "      <td>1812-12-30</td>\n",
       "      <td>Ned</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>39:20.3</td>\n",
       "      <td>39:20.3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10 rows × 28 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   DataID  DataItem County Owner_FirstName Owner_LastName  \\\n",
       "0  AR7-46         1     AA             Ann      Ailsworth   \n",
       "1  AR7-46         2     AA             Ann      Ailsworth   \n",
       "2  AR7-46         3     AA             Ann      Ailsworth   \n",
       "3  AR7-46         4     AA         William      Alexander   \n",
       "4  AR7-46         5     AA        Thomas            Allen   \n",
       "5  AR7-46         6     AA          Thomas          Allen   \n",
       "6  AR7-46         7     AA           James        Alleson   \n",
       "7  AR7-46         8     AA            Mary         Alwell   \n",
       "8  AR7-46         9     AA            Mary        Armiger   \n",
       "9  AR7-46        10     AA            Mary         Atcock   \n",
       "\n",
       "               Witness        Date Freed_FirstName Freed_LastName Alias  ...  \\\n",
       "0                  NaN         NaN          Keziah       Cromwell   NaN  ...   \n",
       "1     Zachariah Duvall  1811-06-24          Resiah       Cromwell   NaN  ...   \n",
       "2       Jenifer Duvall  1811-06-24          Kesiah       Cromwell   NaN  ...   \n",
       "3                  NaN  1815-03-28           Handy       McCeomey   NaN  ...   \n",
       "4                  NaN  1837-07-10           Nancy          Ennis   NaN  ...   \n",
       "5                  NaN  1837-08-03             Jim         Sharpe   NaN  ...   \n",
       "6                  NaN  1826-10-28           Belly            NaN   NaN  ...   \n",
       "7                  NaN  1844-11-08          Howard          Davis   NaN  ...   \n",
       "8                  NaN  1819-01-27         Abigail            NaN   NaN  ...   \n",
       "9  Jacob Franklin, Jr.  1812-12-30             Ned            NaN   NaN  ...   \n",
       "\n",
       "  Folder Document     Page Entry DatasetName  \\\n",
       "0    NaN      NaN  42686.0  12.0          FF   \n",
       "1    NaN      NaN     24.0   3.0          FF   \n",
       "2   55.0      NaN      NaN   NaN          FF   \n",
       "3    NaN      NaN     50.0   2.0          FF   \n",
       "4    NaN      NaN    257.0   1.0          FF   \n",
       "5    NaN      NaN    257.0   2.0          FF   \n",
       "6    NaN      NaN    242.0   1.0          FF   \n",
       "7    NaN      NaN    372.0   1.0          FF   \n",
       "8    NaN      NaN    126.0   2.0          FF   \n",
       "9    NaN      NaN     31.0   3.0          FF   \n",
       "\n",
       "                                               Notes isWorking  isError  \\\n",
       "0                                                NaN         0        0   \n",
       "1                                                NaN         0        0   \n",
       "2               Freed by will of Mrs. Ann Ailsworth.         0        0   \n",
       "3  Freed by manumission, dated 27 March 1815. Rai...         0        0   \n",
       "4  Freed by petition to Anne Arundel County Court...         0        0   \n",
       "5  Freed by petition to Anne Arundel County Court...         0        0   \n",
       "6  Freed by manumission, dated 28 Oct 1826. Raise...         0        0   \n",
       "7  son of Nelly. Freed by manumission, dated 12 A...         0        0   \n",
       "8  along with Richard G. Stetton. Freed by manumi...         0        0   \n",
       "9                                                NaN         0        0   \n",
       "\n",
       "   ChangeDate  CreateDate  \n",
       "0     39:20.3     39:20.3  \n",
       "1     39:20.3     39:20.3  \n",
       "2     39:20.3     39:20.3  \n",
       "3     39:20.3     39:20.3  \n",
       "4     39:20.3     39:20.3  \n",
       "5     39:20.3     39:20.3  \n",
       "6     39:20.3     39:20.3  \n",
       "7     39:20.3     39:20.3  \n",
       "8     39:20.3     39:20.3  \n",
       "9     39:20.3     39:20.3  \n",
       "\n",
       "[10 rows x 28 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# creating a data frame which is a table-like data structure that could read csv files, flat files, and other delimited data.\n",
    "# Converting input data into a data frame is a key starting point with Python programming language for big data analytics\n",
    "# Below command reads in the Certificates of Freedom dataset which should already be loaded in a folder called 'Datasets' as LoS_CoF.csv\n",
    "df = pd.read_csv(\"Datasets/LoS_CoF.csv\") \n",
    "# Below command prints the first 10 records after the data is copied from the csv file\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of these features, the below ones were chosen to be cleaned and manipulated for use in the following steps for simplicity purposes. \n",
    "1. Date -- This indicates the date of issue of Certificate of Freedom\n",
    "2. Prior Status -- Prior status of the Enslaved person before issue of the CoF document\n",
    "3. Height -- Height of the Enslaved person\n",
    "4. Age -- Age at the time of document issue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We anticipated errors and misinterpretation of names, numbers, etc. since this database was mostly transcribed manually by hand from the physical or scanned copies of the Certificates of Freedom. Our approach was to individually explore and clean the aforementioned columns utilizing the text and numerical operation functions in Python programming language for this purpose mostly. We looked at the dataset holistically at first, identifying features that allowed us to generate meaningful stories or visualizations. Upon confirmation of the features list, we analyzed each of them in detail to document bad data and eliminate them if possible, modify data types, exclude them from the final visualizations if found to be invalid, etc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This project involved team members from a diverse group of technology, historical, and archivist background. There were opportunities to work individually or to work in groups, but we decided to do a hybrid setup of analyzing alone and reporting the results back to the group for discussion. With respect to the analysis performed on the dataset, decisions were data-driven or historical facts driven."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Through researching the literature, conversations with historians and experts in the field, discussions with archivists from the Maryland State Archives, the team members followed a set of steps where certain unique characteristics of a particular feature for instance were identified and shared with the entire group for their inputs before finalizing the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Date Feature (Issue of CoF)\n",
    "Through healthy discussions on what-if scenarios as most of the data were historical and we were bringing each of our expertise into the conversations, several insights were gleaned for specific columns which were vital to this Project. Also there were discussions on how data should be presented, collected, and analyzed without impacting the sensitivity of the people involved, especially since this set of collection was unique.\n",
    "\n",
    " One of them is the date, there were different formats of date captured in the transcribed collection. This field is to indicate the date when the certificate of freedom was prepared and signed. There were a number of issues with this date field in the original dataset. Different date formats -- There were around 600 records with NULL value, a bunch of them with just YYYYMM format, most of them in the format YYYY-MM-DD and YYYYMMDD format. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count          23057\n",
       "unique          9956\n",
       "top       1832-05-28\n",
       "freq             296\n",
       "Name: Date, dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command prints out the descriptive details of the column 'Date'\n",
    "df[\"Date\"].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "598"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command list the number of null or na values in the 'Date' column of the data frame\n",
    "df[\"Date\"].isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([nan, '1811-06-24', '1815-03-28', ..., '18430912', '18430913',\n",
       "       '18430916'], dtype=object)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command displays an array of unique date values in the 'Date' column\n",
    "df[\"Date\"].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As could be seen above, there are different formats for the date column, some with missing month etc, some of these were manually verified for accuracy by checking the scanned documents from the MSA database as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['None', '1811-06-24', '1815-03-28', ..., '18430912', '18430913',\n",
       "       '18430916'], dtype=object)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command replaces all Null or nan values to the literal 'None' for ease of manipulation later in the process\n",
    "df[\"Date\"]=df[\"Date\"].fillna('None')\n",
    "df[\"Date\"].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Below is a key function that identifies the bad date records and places a 'NaT' (Not a good date) value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Below command creates a new column 'DateFormatted' on-the-fly (one of the cool things I like about python) and is copied with the results from the 'Date' column using a \n",
    "# transformation function called 'to_datetime()' by passing in the parameter 'error=coerce' which converts all erroneous date values into a string called 'NaT'\n",
    "df['DateFormatted'] = pd.to_datetime(df[\"Date\"], errors=\"coerce\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([                          'NaT', '1811-06-24T00:00:00.000000000',\n",
       "       '1815-03-28T00:00:00.000000000', ...,\n",
       "       '1843-08-22T00:00:00.000000000', '1843-08-24T00:00:00.000000000',\n",
       "       '1843-09-12T00:00:00.000000000'], dtype='datetime64[ns]')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command prints the unique converted date values from the newly created column and also displays 'NaT' for errorneous date values.\n",
    "df[\"DateFormatted\"].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>DateFormatted</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>None</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1811-06-24</td>\n",
       "      <td>1811-06-24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1811-06-24</td>\n",
       "      <td>1811-06-24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1815-03-28</td>\n",
       "      <td>1815-03-28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1837-07-10</td>\n",
       "      <td>1837-07-10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23650</th>\n",
       "      <td>18430826</td>\n",
       "      <td>1843-08-26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23651</th>\n",
       "      <td>18430905</td>\n",
       "      <td>1843-09-05</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23652</th>\n",
       "      <td>18430912</td>\n",
       "      <td>1843-09-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23653</th>\n",
       "      <td>18430913</td>\n",
       "      <td>1843-09-13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23654</th>\n",
       "      <td>18430916</td>\n",
       "      <td>1843-09-16</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>23655 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             Date DateFormatted\n",
       "0            None           NaT\n",
       "1      1811-06-24    1811-06-24\n",
       "2      1811-06-24    1811-06-24\n",
       "3      1815-03-28    1815-03-28\n",
       "4      1837-07-10    1837-07-10\n",
       "...           ...           ...\n",
       "23650    18430826    1843-08-26\n",
       "23651    18430905    1843-09-05\n",
       "23652    18430912    1843-09-12\n",
       "23653    18430913    1843-09-13\n",
       "23654    18430916    1843-09-16\n",
       "\n",
       "[23655 rows x 2 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command prints a sample of the output for the new columns 'Date' and 'DateFormatted' side-by-side to show how the original field values were transformed to a proper date\n",
    "# format and the bad values are given a 'NaT'\n",
    "df[['Date','DateFormatted']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'1845-19-05', '1856-01-00', '1836-09-31', '1821-00-00', '1855-01-00', '1829-04-31', '1819-04-00', '1819-12-00', '184004', '1819-00-00', '184006', '1823-00-00', 'Kamer', '1822-00-00', '1854-00-00', '1831-04-31', '1816-19-06', '1853-00-00', '1858-09-31', '1816-17-23', '1840516', '1859-00-00', '1815-16-20', '189390417', '1802-30-30', '1830-04-31', '1846-05-00', '1819-07-00', '1842-00-00', '1819-03-00', '1855-00-00', 'None', '1858-12-00', '1819-15-04', '1856-02-30', '18404024', '1847-00-00', '1858-08-00', '1854-08-00', '1860-09-00', '1807-07-33', '1819-15-10'}\n",
      "Number of Bad date records 657\n",
      "Number of unique items in the Bad date 42\n"
     ]
    }
   ],
   "source": [
    "x = 0\n",
    "bad_date=[]\n",
    "# Below function is a loop function which processes each value of the new column 'DateFormatted' to check for invalid value marker 'NaT' and if found, it picks up the original \n",
    "# value from the 'Date' column and appends to a list. Once all the records are checked, it prints the unique values of this list using the 'set' function and the total number of \n",
    "# bad ones\n",
    "for i in range(len(df['DateFormatted'])):\n",
    "    if pd.isna(df['DateFormatted'][i]):\n",
    "        bad_date.append(df['Date'][i])\n",
    "        x += 1\n",
    "print(set(bad_date))\n",
    "print(\"Number of Bad date records\", x)\n",
    "print(\"Number of unique items in the Bad date\", len(set(bad_date)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>DataID</th>\n",
       "      <th>DataItem</th>\n",
       "      <th>County</th>\n",
       "      <th>Owner_FirstName</th>\n",
       "      <th>Owner_LastName</th>\n",
       "      <th>Witness</th>\n",
       "      <th>Date</th>\n",
       "      <th>Freed_FirstName</th>\n",
       "      <th>Freed_LastName</th>\n",
       "      <th>Alias</th>\n",
       "      <th>...</th>\n",
       "      <th>Document</th>\n",
       "      <th>Page</th>\n",
       "      <th>Entry</th>\n",
       "      <th>DatasetName</th>\n",
       "      <th>Notes</th>\n",
       "      <th>isWorking</th>\n",
       "      <th>isError</th>\n",
       "      <th>ChangeDate</th>\n",
       "      <th>CreateDate</th>\n",
       "      <th>DateFormatted</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23307</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>23310</td>\n",
       "      <td>BA</td>\n",
       "      <td>Geo</td>\n",
       "      <td>Gillingham</td>\n",
       "      <td>NaN</td>\n",
       "      <td>184006</td>\n",
       "      <td>Jeremiah W.</td>\n",
       "      <td>Brown</td>\n",
       "      <td>Jerry</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>224.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>Freed by manumission, dated 15 June 1824, reco...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>37:45.8</td>\n",
       "      <td>03:44.1</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23308</th>\n",
       "      <td>AR7-46</td>\n",
       "      <td>23311</td>\n",
       "      <td>BA</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>184006</td>\n",
       "      <td>Rachael</td>\n",
       "      <td>Brown</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>224.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>FF</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>37:45.8</td>\n",
       "      <td>05:54.2</td>\n",
       "      <td>NaT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2 rows × 29 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       DataID  DataItem County Owner_FirstName Owner_LastName Witness    Date  \\\n",
       "23307  AR7-46     23310     BA             Geo     Gillingham     NaN  184006   \n",
       "23308  AR7-46     23311     BA             NaN            NaN     NaN  184006   \n",
       "\n",
       "      Freed_FirstName Freed_LastName  Alias  ... Document   Page  Entry  \\\n",
       "23307     Jeremiah W.          Brown  Jerry  ...      NaN  224.0    5.0   \n",
       "23308         Rachael          Brown    NaN  ...      NaN  224.0    6.0   \n",
       "\n",
       "      DatasetName                                              Notes  \\\n",
       "23307          FF  Freed by manumission, dated 15 June 1824, reco...   \n",
       "23308          FF                                                NaN   \n",
       "\n",
       "      isWorking isError  ChangeDate  CreateDate  DateFormatted  \n",
       "23307         0       0     37:45.8     03:44.1            NaT  \n",
       "23308         0       0     37:45.8     05:54.2            NaT  \n",
       "\n",
       "[2 rows x 29 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Below command displays the specific records that was identified as erroneously entered. The inner command 'df[]' first converts the 'Date' feature to a 'String' data type, and then uses another\n",
    "# in-built function to filter the records that match with the supplied criteria and the outer 'df[]' displays the results of that filtered records from the inner dataframe.\n",
    "df[df['Date'].astype(str).str.strip()==\"184006\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In two of the instances, as seen below, the day of issue has not been found to be legible or visible, hence the MSA transcriber may have not been able recorded the date. There was no date but only month and year captured on the original CoF itself for c290 page 224 - Jeremiah Brown\n",
    "\n",
    "![DateIssue1](Pics/CoF_Data_Clean_Empty_Date1.PNG \"CoF Date Issue 1\")\n",
    "![DateIssue2](Pics/CoF_Data_Clean_Empty_Date1.PNG \"CoF Date Issue 2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another instance of data entry error was for c290 page 185 Charles W Jones as shown below with the date captured as 1840516 instead of 18400516\n",
    "\n",
    "![DateIssue3](Pics/CoF_Data_Clean_Incorrect_Date.PNG \"CoF Date Issue 3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All these 657 bad date records were identified and marked with 'NaT' and would not be used for further processing. However, the entire record was not removed as there could be other useful features that could provide us with some good insights. These 'NaT' date records would be shared with the MSA, to fix their Source of Record. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prior Status \n",
    "This feature indicates the previous status of the enslaved person while the CoF document was being issued. This holds historical importance and needed carefule inputs from the historians. To address the issues with this feature in CoF dataset - Prior Status Column: Research was conducted to determine the prior status of those who were categorized as a “Descendant of a white female woman” as shown below from the set of unique categories. Source: Wikipedia - History of slavery in Maryland. This research was beneficial in identifying what group certain observations belong to."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![PriorStatus](Pics/CoF_Data_Clean_Prior_Status.PNG \"CoF Prior Status\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# df is the data frame variable which stores the entire dataset in a table form. Below command converts the specific column or feature 'PriorStatus' as Categorical type instead of String for manipulation\n",
    "df[\"PriorStatus\"]=df[\"PriorStatus\"].astype('category')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below list shows the different formats of the Prior Status as transcribed and it shows that these entries have to be grouped together by customized cleaning process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{nan, 'slave', 'Born Free', 'Free born', 'free born', 'Free Born', 'Descendant of a white female woman', 'S;ave', 'Slave; Slave', 'Enslaved', 'Free', '?', 'born free', 'Unknown; Free Born', 'Unknown', 'Freeborn', 'Slave', 'Slave ', 'Free; Slave', 'John', 'Unknown; Slave', 'Born free', 'Free  ', 'Free born ', 'BornFree'}\n"
     ]
    }
   ],
   "source": [
    "# After conversion, let's print the number of categories available for that particular feature from the dataset\n",
    "print(set(df[\"PriorStatus\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Below is a key IF function that tries to find for a good prior status value in each column and converts the value to a standard value. Others are given a 'Unknown'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Unknown' 'Slave' 'Born Free' nan 'Free']\n"
     ]
    }
   ],
   "source": [
    "# As could be seen above, there are various types of Prior Status that are similar in nature. the value 'nan' in Python means it has no values.\n",
    "# Below set of commands form a component in Python called as a function. Functions are a block of commands which could be used to perform the same action every time they are called.\n",
    "# The below function converts the input parameter to the right Prior Status category based on some conditional statements.\n",
    "def fix_prior_status(status):\n",
    "    # initiate variables to hold the literal value\n",
    "    free = \"free\"\n",
    "    born = \"born\"\n",
    "    enslaved = \"slave\"\n",
    "    descend = \"Descend\"\n",
    "    # conditional statements to use in-built 'find' function to check if the prior status passed has the value of the literal checked, and if so the status would be modified as mentioned\n",
    "    # in the 'return' statement\n",
    "    if status.find(born) != -1:\n",
    "        # it should also be noted that indentation is a key requirement with Python, not where the return statement starts after the 'if'\n",
    "        return \"Born Free\"\n",
    "    else:\n",
    "        # nested if's are possible in Python to conditionally control the else logic\n",
    "        if status.find(enslaved) != -1:\n",
    "            return \"Slave\"\n",
    "        else:\n",
    "            if status.find(descend) != -1:\n",
    "                return \"Born Free\"\n",
    "            else:\n",
    "                if status.find(free) != -1:\n",
    "                    return \"Free\"\n",
    "                else:\n",
    "                    return \"Unknown\"\n",
    "# Below command starts with the beginning indentation indicating a new set of commands outside of the function, even if its in the same cell block like shown here.\n",
    "# The 'apply' function applies the function definted above to the data frame's each records' Prior Status field avlue. \n",
    "df[\"PriorStatusFormatted\"] = df[\"PriorStatus\"].apply(fix_prior_status)\n",
    "# The 'unique' in-built function prints out the distinct values of the transformed or modified prior status of the data frame\n",
    "print(df[\"PriorStatusFormatted\"].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As with the date field, those values that did not fall into one of these good categories they were assigned a 'Unknown' value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Below commands help us to save the modified dataframe into a new output csv file which could be used in further steps of processing in the next notebook modules.\n",
    "dfo = pd.DataFrame(df)\n",
    "dfo.to_csv('Datasets/LoS_Clean_Output_Mod1.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebooks\n",
    "\n",
    "The below module is organized into a sequential set of Python Notebooks that allows to interact with the Legacy of Slavery's Certificates of Freedom collection by exploring, cleaning, preparing, visualizing and analysing it from historical context perspective. \n",
    "\n",
    "2. [Certificates Of Freedom: Context Based Data Manipulation and Analysis -Part 2](LoS_CoF_Module2.ipynb)\n",
    "3. [Certificates Of Freedom: Context Based Data Visualization](LoS_CoF_Module3.ipynb)**(If you get a 500 Internal Server Error, please refresh the browser to render this module again)**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Click here to go the Next Module](LoS_CoF_Module2.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}