# MHC_V20 Data Cleaning and Wrangling
## -- as a part of the OSCR proejct
Leo (Lizhou) Fan
Acknowledgements: Dr. Ashley Sanders Garcia & Dr. Miles Chen

### Import packags and data

In [2]:
import pandas as pd
import numpy as np
import re

In [3]:
f = open('MHC-v20.txt', 'r')
mhc20 = f.readlines()
f.close()

In [87]:
mhc20[0:50]

['\ufeffMichigan historical collections I Michigan Historical Commission. ',
 'Lansing : The Commission, 1915-1929. ',
 'http://hdl.handle.net/2027/ucl.a0002813681 ',
 'HathiTrust ',
 'www.hathitrust.org ',
 'Public Domain, Google-digitized ',
 'http://www.hathitrust.0rg/access_use#pd-g00gle ',
 'We have determined this work to be in the public domain, meaning that it is not subject to copyright. Users are free to copy, use, and redistribute the work in part or in whole. It is possible that current copyright holders, heirs or the estate of the authors of individual portions of the work, such as illustrations or photographs, assert copyrights over these portions. Depending on the nature of subsequent use that is made, additional rights may need to be obtained independently of anything we can address. The digital images and OCR of this work were produced by Google, Inc. (indicated by a watermark on each page in the PageTurner). Google requests that the images and OCR not be re-hosted, re

- Original PDF refer to
- Our Goal/Ideology of Cleaning: 
    - information retain?
    - focus?
    - depend on manually cleaning as the last step?

join lines -> pattern to detect document -> metadata extract [table of content, beginning of each document] -> footnotes get rid of them

Projects and their Questions:
- Research question: role of emotion and sentiment play in decition making in American middle west 1800
    - role of individual actors
- "Big volumes" split them into document

References:
- https://programminghistorian.org/en/lessons/cleaning-ocrd-text-with-regular-expressions
- https://programminghistorian.org/en/lessons/generating-an-ordered-data-set-from-an-OCR-text-file

### Extract Metadata and Basic Cleaning
Regarding metadata, two things stands out:
- all capital letter
- Place + Time Combo

We can use these traits as the clue for both extracting metadata and seperate documents.

In [30]:
len(mhc20)

13352

In [33]:
# initiate
Text = mhc20

# clean all the \n and \t to " "
for line in Text:
    line = re.sub(r'\n|\t', ' ', line)

In [35]:
# Create an empty list to fill with lines of corrected metadata
CleanTitle = []
CleanTitleNo = [] # line number of the title extracted
CleanTimePlace = []
CleanTimePlaceNo = [] # line number of the time and place extracted

# checks each line in the imported text file for all the following patterns
for i in range(len(Text)):
    
    ### title with people involved
    # all capitals with the word " TO " could be a title
    # as the extracts are all letter from one person to some other(s)
    line11 = re.search(r' TO ', Text[i])
    if line11:
        CleanTitle.append(Text[i])
        CleanTitleNo.append(i)
    
    ### Time and Place as a whole
    # Time is ideally of the format Month Day Year or Day Month Year
    # the four digits could be the signal
    # The word(s) before time in the same line is the place
    line21 = re.search(r' \d{4}', Text[i])
    if line21 and len(Text[i])<50:
        CleanTimePlace.append(Text[i])
        CleanTimePlaceNo.append(i)

In [86]:
CleanTitle[0:50]

['PREFACE TO REPRINT ',
 'COPIES OF PAPERS ON FILE IN THE DOMINION ARCHIVES AT OTTAWA, CANADA, PERTAINING TO MICHIGAN, AS FOUND IN THE HALDIMAND AND OTHER OFFICIAL PAPERS ',
 'CAPT. ALEXANDER GRANT TO BRIG. GEN. II. WATSON POWELL1 ',
 'SIR HENRY CLINTON TO GEN. FREDERICK HALDIMAND ',
 'MAJOR ARENT S. DE PEYSTER TO BRIG. GEN. H. WATSON POWELL ',
 'FRANCIS BROWN TO CAPT. GRANT ',
 'FRANCIS BROWN1 TO CAPT. ALEXANDER GRANT ',
 'BRIG. GEN. H. WATSON POWELL TO GEN. FREDERICK HALDIMAND ',
 'MAJOR ARENT S. DE PEYSTER TO BRIG. GEN. H. WATSON POWELL ',
 'CAPT. ALEXANDER GRANT TO BRIG. GEN. H. WATSON POWELL ',
 'GEN. FREDERICK HALDIMAND TO CHARLES BEMBRIDGE. WARRANT TO CAPT. LA MOTTE FOR PAYMENT TO DETROIT VOLUNTEERS ',
 'COMMISSION TO JEHU HAY ',
 'MAJOR GENERAL R1EDESEL* 2 TO GEN. FREDERICK HALDIMAND ',
 'GEN. FREDERICK HALDIMAND TO SIR HENRY CLINTON ',
 'GEN. FREDERICK HALDIMAND TO BRIG. GEN. H. WATSON POWELL ',
 'GEN. FREDERICK HALDIMAND TO BRIG. GEN. II. WATSON POWELL ',
 'BRIG. GEN. II. WAT

In [89]:
CleanTimePlace[0:50]

['Lansing : The Commission, 1915-1929. ',
 'REPRINT 1912 ',
 'Lansing, Mich., Dec. 15. 1892 ',
 '“ 1773   324 ',
 '“ 1794   325 ',
 'Sir 1782 ',
 'Detroit 20h March 1782. ',
 'Detroit 12 April 1782 Francis Brow n ',
 'Niagara April 14h 1782. Sir ',
 'Endorsed From A 1782 ',
 'Detroit April 22nd 1782. Sir ',
 'Sorel April 26h 1782 Sir ',
 'Sir, 28h April 1782. ',
 'To B. Genl. Powell 1782. 28h April ',
 'Niagara Apl 30h 1782 ',
 'Sir, Montreal 7th May 1782 ',
 '(Copy) 1782 ',
 'Niagara May 7h 1782. ',
 'Detroit May 15th 1782. Sir ',
 'Montreal 15th May 1782 ',
 'Sir 16 May 1782 ',
 'Three Rivers 17h May Sir 1782. ',
 'Sir Montreal 27h May 1782. ',
 'Endorsed To 1782 ',
 'Sir, 31st May 1782. ',
 '-This occurred Dec., 1782. ',
 'Chumbly 1 June 1782. 503.50 98     3995     ',
 'Oswego 26 Mar 1782 30.  2000         ',
 'Niagara 30 Apl 1782. 133.30  12322         ',
 'Montreal, 3d June 1782. ',
 'Detroit June 6h 1782. ',
 'Detroit June the 7h 1782 Sir ',
 ' Montreal, June 9h 1782. ',
 'Sandu

In [41]:
len(CleanTimePlace)-len(CleanTitle) # there are more possible time and place than titles

211

In [42]:
len(CleanTitle) # about 580 documents are there, haven't checked

580

In [45]:
### save metadata into csv files
Title = pd.DataFrame()
Title["title"] = CleanTitle
Title["line"] = CleanTitleNo
TimePlace = pd.DataFrame()
TimePlace["time_place"] = CleanTimePlace
TimePlace["line"] = CleanTimePlaceNo

In [90]:
Title.head()

Unnamed: 0,title,line
0,PREFACE TO REPRINT,36
1,COPIES OF PAPERS ON FILE IN THE DOMINION ARCHI...,90
2,CAPT. ALEXANDER GRANT TO BRIG. GEN. II. WATSON...,95
3,SIR HENRY CLINTON TO GEN. FREDERICK HALDIMAND,108
4,MAJOR ARENT S. DE PEYSTER TO BRIG. GEN. H. WAT...,154


In [91]:
TimePlace.head()

Unnamed: 0,time_place,line
0,"Lansing : The Commission, 1915-1929.",1
1,REPRINT 1912,18
2,"Lansing, Mich., Dec. 15. 1892",31
3,“ 1773 324,67
4,“ 1794 325,75


### Seperate Documents
using title as signals, comebine the lines after it and save it into the same file.

In [78]:
# combine the lines into the same line after a title
CobText = []
cobtext = ''
start = 37
for i in range(start,len(Text)):
    if i in CleanTitleNo or i==(len(Text)-1):
        CobText.append(cobtext)
        cobtext = ''
    else:
        cobtext += Text[i]

In [79]:
len(CobText)

580

In [80]:
len(CleanTitleNo)

580

In [81]:
# save the title, start of line, and then combines text into a dataframe
CleanText = pd.DataFrame()
CleanText["start_line"]= CleanTitleNo
CleanText["title"]= CleanTitle
CleanText["text"]= CobText

In [92]:
CleanText.head()

Unnamed: 0,start_line,title,text
0,36,PREFACE TO REPRINT,In reprinting this volume an effort has been m...
1,90,COPIES OF PAPERS ON FILE IN THE DOMINION ARCHI...,Note.—Care has been taken in publishing the fo...
2,95,CAPT. ALEXANDER GRANT TO BRIG. GEN. II. WATSON...,Extracts of a Letter from Capt. Grant2 to Brig...
3,108,SIR HENRY CLINTON TO GEN. FREDERICK HALDIMAND,Quadruplicate } of Letter sent in > Cypher ove...
4,154,MAJOR ARENT S. DE PEYSTER TO BRIG. GEN. H. WAT...,Extracts from Major De Peyster’s letter dated ...


In [83]:
CleanText.to_csv("MHC_V20_Cleaned.csv",index=False)

### More TO DO
- Solve minor problems
    - not a title
    - extract time and place regarding comparative location to title
- Remove footnotes
- More cleaning of irregular/wrong chracters
- Extract more metadata from the preface and content lists
- Analysis work...

Leo (Lizhou) Fan, Jan 20th, 2020. All rights reserved.