# A Computational Story of Crime through the Morgenthau Press Conference Collection 
by JB, KK, JO Group Bad Data Battalion 

## Data Overview, Telling Our Story
In 1934, Henry Morgenthau, Jr. was appointed Secretary of the Treasury by President Franklin D. Roosevelt. Morgenthau used this position to investigate organized crime and government corruption, but the federal law enforcement system was fragmented and uncoordinated (Wikipedia, 2022a). Morgenthau's investigations eventually led to the prosecution of Al Capone, and political bosses such as Thomas Pendergast and Charles Carrollo (Reppetto, 2005, p. 195).

President Roosevelt was not in support of Prohibition and on December 5th, 1933 the 21st Amendment was enacted, which repealed the 20th Amendment and legalized the manufacture, sale, and consumption of liquor in the United States once again. In order to help with the new legal alcohol industry, Roosevelt used an executive order to create the Federal Alcohol Control Administration (FACA). The Department of Agriculture and Department of the Treasury helped guide the process until the Federal Alcohol Administration (FAA) Act went into effect in 1935. The Department of the Treasury established itself as a vital component of alcohol regulation, upholding the sentiment of the FAA Act into the present day while allowing the FAA to operate independently within the Treasury Department (TTB, 2013). Morgenthau was intent on putting a stop to illegal substances’ manufacture, sale, and import into the United States. In 1934 he went on record to state that “any method will be used to get dope peddlers, smugglers, etc.” (FDR Library, 2022) in one of the three collection index cards under the subject ‘Wire Tapping.’ 

Largely due to Prohibition’s failure to enforce its laws, Morgenthau elected to combine Treasury agencies in a way that concentrated efforts to stop the import of illegal alcohol-related substances as well as narcotics, which is reflected in our dataset as early as 1934. The outcome was the creation of the Committee for the Coordination of Treasury Law Enforcement Activities in 1935, made up of the leaders and sub-leaders of the Coast Guard, the Customs Service, the Alcohol Tax Unit of the Bureau of Internal Revenue, the Bureau of Narcotics, and the Secret Service (Phillips, 1963, pp. 369-370). 

The Federal Bureau of Investigation (FBI) established a national police force in 1935 and J. Edgar Hoover was put in charge of this force in 1936. Hoover was not one to share power and feared that the consolidated Treasury agencies would overshadow *his* agency – the FBI. Roosevelt, as well as much of the rest of the country, was increasingly terrified of communism and felt that communist sympathizers had no place in the United States. Great measures were put in place to investigate government employees’ loyalties, especially pertaining to paying income tax.

The Treasury years between 1937 and 1945 focused heavily on tax evasion and taxation, the former being the reason for Al Capone’s imprisonment in 1931. In 1937, President Roosevelt addressed Congress on the topic of tax evasion. It was reported that large numbers of income tax went unreported as evidence from a 1936 study conducted by Morgenthau. Secretary Morgenthau wrote a letter to President Roosevelt outlining eight types of tax avoidance, which he felt were “sufficient to show that there [was] a well-defined purpose and practice on the part of some taxpayers to defeat the intent of Congress to tax incomes in accordance with ability to pay” (Roosevelt, “Message to Congress” 1937). Morgenthau ended his letter to the President by stating that he felt Congress would make the right decision and give the Treasury the authority to complete the tax evasion investigation, which ultimately increased the power of the federal government, a common theme during Roosevelt’s administration. In order to keep the country moving after the Great Depression, Morgenthau believed in the importance of reducing the overall deficit by increasing taxes for individuals that could afford it (Wikipedia, 2022b).

Morgenthau later donated his 840-volume diary and press conference transcripts to the Franklin D. Roosevelt Presidential Library & Museum. The press conference index was used to visualize connections between said historical information and the mention of crime throughout Morgenthau’s press conference collection. (JO, KK)


## Dataset Exploration related to Scale and Levels within Collection 
Isabella Diamond, the Treasury Librarian, created a subject index of hundreds of Morgenthau’s press conferences beginning in 1936. 

The digital press conference transcripts are arranged in a single, chronological series of 27 volumes that mimic the physical collection. Each volume begins with a title page featuring the volume number and its inclusive dates. The title page is followed by a volume-specific table of contents (TOC) that reflects the “various subject headings, sub-headings, and cross-references assigned by Isabella Diamond according to her custom schema” (Carter et al., 2022, p. 841). Diamond also created index card files according to this **custom schema**, which provide document-level access across the volumes by subject. Diamond’s index cards are arranged alphabetically and feature the following characteristics or anatomy (as shown in **Figure 1**):

- A subject entry
- A citation for a Press Conference Volume number
  -  Some index cards feature the volume number in the upper right-hand corner
  -  In some instances, ‘volume’ is referred to as ‘book.’ Additionally, both Arabic and Roman numerals are used to denote a specific volume/book number. It is important to note that this story uses the nomenclature ‘book’ to describe specific volumes.
- A summary of book context related to the subject
- Relevant page numbers corresponding to the book
- Document date(s)
- Occasional cross references to other index subjects, indicated by a “see” note. (JB)

**Figure 1**. An Example Index Card with its Characteristics Labeled. 
<img src="BDBFIG1.png" alt="labeled card" title="FIGURE 1">

After performing a collection analysis, the group determined that our primary research objective would be to manipulate the dataset of index cards to effectively **determine** patterns related to crime during Morgenthau’s tenure, and then create visual representations of the data to **analyze** patterns in crime that reflect the historical context of the collection. 

The group used the following research questions to aid in the analysis of this collection:

- Which crime subject is the most prevalent?
- What year experienced the most crime-related index cards?
- Is there a trend in the amount of crime-related index cards over time? (JO)

## Data Cleaning and Preparation 
To meet the research objectives, a five-phase research methodology was implemented (as shown in **Figure 2**). First, a collection survey was completed to determine what index cards in the collection were directly related to crime or criminal activities. Once all related cards were found, their raw data was extracted in the form of a text (.txt) file. Though the FDR Library used Adobe Acrobat Pro as an optimal character recognition (OCR) software for the index card PDFs, the action did not guarantee 100% accuracy when extracting the text from the cards (Carter et al., 2022). As such, a quality check (Phase 3) was performed on the text file to ensure that the extraction process was accurate. Our text file was then parsed and manipulated using OpenRefine, which is an open-source software package for cleaning, manipulating, and transforming data (Delpeuch, 2022). The last phase of our methodology, visualization, used the manipulated data from OpenRefine to create visualizations that aided in analyzing the data and allowed patterns to emerge that tied back to the background context for this collection. (JB)

**Figure 2**. The Five-Phase Research Methodology Process
<img src="BDBFIG2.png" alt="research methodology process" title="FIGURE 2">

The following table (**Table 1**) details the software and tools used to execute this research methodology. (JB, KK)

<img src="BDBFIG3.jpg" alt="Table of software and tools" title="TABLE 1">

## Modeling: Computation and Transformation

<div style="text-align: center"> Phase 1: Collection Survey </div>
The objective of the collection survey phase was to determine what index cards in the collection were directly related to crime. Via systematic review of the seven Series 1 PDFs, totaling 2,275 individual index cards, 55 cards related to crime were identified and organized in the following table (**Table 2**).

<img src="BDBFIG4.jpg" alt="crime cards" title="TABLE 2">

When the collection survey was complete and all crime-related index cards found, a PDF document was created that only featured our 55 cards. To aid in text recognition, a batch process in Adobe Photoshop was used to increase the contrast of each PDF page by +75. (JB)

<div style="text-align: center"> Phase 2: Extraction </div>
The objective of the extraction phase was to extract the raw data from the PDF of crime-related index cards into a file format that could be parsed by OpenRefine. Multiple file formats can be loaded into OpenRefine, such as comma-separated values (CSV), Excel spreadsheets (XLS or XLSX), and text files (.txt) (Delpeuch, 2022). It was decided to convert the raw data from the PDF into a text file due to that file type’s stability and simplicity. An open-source website called PDF to Text was used to convert the PDF into a text file. (JB)

<div style="text-align: center"> Phase 3: Quality Check </div>
The objective of the quality check phase was to ensure the accuracy of the raw data that was extracted from the PDF. As previously mentioned, the PDF files in the collection were OCR’d by the Franklin D. Roosevelt Presidential Library & Museum. In addition to increasing the contrast of the PDF before text extraction, the action did aid in text recognition, but many small manual changes were needed. Additionally, many of the book numbers located in the upper right-hand corner of the index cards were missed during the extraction. If any such numbers remained, they were deleted from the text file and instead the card’s related book number for the subject entry (i.e., “Book: X”) remained.

After reviewing the OpenRefine documentation, it became clear that data manipulation inside the software was more efficient when the data were separated into columns. To mimic this column structure in the text file, each index card’s data was placed on one line and the desired columns separated by the | symbol. The following standard format was used for each card in the text file:

<div style="text-align: center"> [Subject Term] | [Description/Text] | [Book and Page No.] | [Date] </div>

**Figure 3** illustrates the difference between the raw four-line output of the index card data and the edited one-line output with column separator. (JB)

**Figure 3**. Increasing the Usability of the Text File by Condensing the Raw Four-Line Output into One Line and Designating Columns Using the | Symbol.

<img src="BDBFIG5.png" alt="card usability" title="FIGURE 3">

<div style="text-align: center"> Phase 4: Manipulation </div>
The goal of the manipulation phase was to tidy the data to aid in visualization. According to the Computational Archival Science Education System (CASES), data manipulation can consist of “sorting, filtering, cleaning, normalizing, and joining disparate datasets” (CASES, 2022). Before the manipulation process began, the text file was imported into OpenRefine, using the default settings as shown in **Figure 4**. (JB)

**Figure 4**. Import Settings.

<img src="BDBFIG6.png" alt="import settings" title="FIGURE 4">

The following steps detail the manipulation process conducted in OpenRefine: (JB)

1. Split Column 1 into several columns according to a separator (Figure 5-offline). 
2. Rename the resulting columns: Subject | Description | Book and Page No. | Date
3. Use a text filter on the Subject column to remove any blank rows (Figure 6-offline). -The regular expression ^ was used to accomplish this.
4. Complete a value.trim() text transform on all columns to remove any leading and trailing white space. 
5. Split the Book and Page No. cell into two cells, one cell for the book number and one cell for the page number (Figure 7-offline).
6. Rename the Book and Page No. 1 column to ‘Book No.’
7. Rename the Book and Page No. 2 column to ‘Page No.’
8. Use the replace function in OpenRefine to edit the Book No. column.
9. Complete a custom text transform on the Book No. column to change the Roman numerals to Arabic numerals (Figure 9-offline).
10. Complete a value.trim() text transform on the Book No. column to remove any leading and trailing white space.
11. Complete a custom text transform on the Date column to make the date format uniform (Figure 10-offline). -The following regular expression was used: value.replace(“-”, ”/”)
12. Change the years in the Date column from two digits to four digits (Figure 11-offline). For example, 34 was changed to 1934. 
13. Change the data type of the Book No. column to a number.
14. Replace the - symbol in the Subject column for uniformity.
15. Isolate the blank cells in the Date column. Then use the information from the Book column to determine the correct year for each blank. -For example, Book 2 only contains information for the year 1934. If the date column in this row was blank, the year 1934 was inserted into the blank.
16. Split the Date column to extract only the year. -The following separator was used: \b(\d{1,2}\/\d{1,2}/\)  .  -Only the year was extracted because the group decided to visualize the data only using the year and not the corresponding month and day.
17. Rename the newly separated column ‘Year.’
18. Complete a value.trim() text transform on all columns to remove any leading and trailing white space.

<div style="text-align: center"> Phase 5: Visualization </div>

**Figure 12** illustrates the tidied data that resulted from the manipulation phase. The manipulations resulted in six columns, including the following:

- Subject
- Description
- Book No.
- Page No.
- Date
- Year

Recall that the group decided to use the Year column for visualization purposes, and not the corresponding month and day. Our tidy data was exported from OpenRefine directly into an Excel file for visualization purposes.

**Figure 12**. Tidy Data in OpenRefine.

<img src="BDBFIG14.png" alt="tidy data" title="FIGURE 12">

The following three visualizations resulted from the tidy data: 

* A word cloud (**Figure 13**). -Related research question: Which crime subject is the most prevalent?
* A pie chart (**Figure 14**). -Related research question: Which crime subject is the most prevalent?
* A histogram (**Figure 15**). -Related research questions: What year experienced the most crime-related index cards? Is there a trend in the amount of crime-related index cards over time?

The first visualization is a word cloud, which “displays how frequently words appear in a given body of text, by making the size of each word proportional to its frequency” (The Data Visualisation Catalogue, 2022). The word cloud in Figure 13 was created using Word Cloud Generator, which is an open-source web software, and shows a visual representation of the subject headings of the crime-related index cards. The visual provides an efficient way to show how often the word “liquor” was used compared to another word, like “silver.”

**Figure 13**. A Word Cloud Generated from the Crime-Related Index Cards.

<img src="BDBFIG15.jpg" alt="crime word cloud" title="FIGURE 13">

The next visualization is a pie chart (**Figure 14**), which helps “show proportions and percentages between categories by dividing a circle into proportional segments” (The Data Visualisation Catalogue, 2022). This type of chart was chosen to provide the viewer with a quick idea of the proportional distribution of the crime-related index cards. It is important to note that this pie chart features percentages of crime categories that were assigned by the group to increase comprehension of the visualization. The subject entries that make up each crime category are explained in the Data Analysis section below.

**Figure 14**. Percentage of Assigned Categories of Crime.

<img src="BDBFIG16.png" alt="pie chart percentages" title="FIGURE 14">

A histogram “visualizes the distribution of data over a continuous interval” (The Data Visualisation Catalogue, 2022). As such, this type of graph was chosen because the group wanted to determine the distribution of crime by year. Our histogram (**Figure 15**) displays the number of crime-related index cards per year.

**Figure 15**. Number of Crime-Related Index Cards Per Year

<img src="BDBFIG17.png" alt="histogram cards" title="FIGURE 15">

<div style="text-align: center"> Data Analysis </div>
The visualizations presented above effectively summarize the group’s research objectives. Additionally, review of the individual crime-related index cards established a link between the cards and the historical context. For example, subject entries regarding ‘Political Activities’ mean in part “prohibiting political activities by Treasury employees” (FDR Library, 2022). Putting a stop to corruption in government was as close to Morgenthau’s desk as dismantling organized crime. Two of the three Senators referenced on the index cards with the subject entry ‘Political Activities’ mention crimes related to pulling funding for their reelection campaigns from illegal avenues. The third Senator evidently closed the Curtis Bay plant in California; more information from the transcripts of the press conferences is needed for further analysis.

The two Series 1 index cards in 1936 with the subject entry of ‘Agency-Related’ reflect the combination of these five agencies’ resources to enforce the country’s laws. Also in 1936, a shift occurred in the focus for illicit vices as Morgenthau publicly announced that alcohol smuggling had been wiped out and opiate narcotics, such as heroin and opium, must be stopped (Schaffer Library of Drug Policy, 2022).

The first visualization presented is a word cloud, as shown in **Figure 13**, that demonstrates the frequency of the index card’s subject entry terms. It is clear that ‘liquor’, ‘evasion’, and ‘tax’ are the three most prominent terms, which coincides with the general two categories of this dataset; organized crime and government corruption.

The group attempted to visualize the subject entry categories on a pie chart, but the number of unique entries resulted in a messy visual; therefore the group decided to assign categories to various entries that were related to the same type of crime and present those categories on a pie chart (**Figure 14**). In some cases, descriptions were added to the index cards that enabled further categorization of the already labeled subject entries to see the larger patterns that were present. **Table 3** below details which subject entries fit into each crime category.

<img src="BDBFIG15.jpg" alt="Assigned Crime Categories" title="TABLE 3">

**Figure 14** helps show the frequency of occurrence for the assigned categories in the dataset. It is clear that alcohol-related index cards consume almost half (45.5%) of the total number of cards; however, the percentages of tax and investigations, 18.2% and 14.5% respectively, are also substantial. The investigations being referred to are in regard to tax evasion crimes as Morgenthau fought against government corruption and strongly believed in the 16th Amendment. 

When the frequency of crime-related index cards is plotted over time (**Figure 15**), the year 1936 has the greatest frequency at 17 index cards. The year 1936 seems to be a transition point where the Treasury shifts their focus from alcohol and drugs to income tax evasion. It was during this time that President Roosevelt delivered an address to Congress regarding the seriousness of tax evasion, and Morgenthau strongly supported consolidating five agencies under one umbrella to combat the issues of organized crime and evasion of taxes. (KK, JO)

## Ethics and Values Considerations
We practice with our computational story the Society of American Archivists (SAA) code of ethics and core values, specifically by:
- Interpreting documentation of past events through the use of primary source materials; 
- Promoting the use and understanding of the historical record (Advocacy), and 
- Recognizing that primary sources allow people to examine past events and gain insight into human experiences (History and Memory) (SAA, 2020). (KK)

## Summary and Suggestions
The Press Conferences Collection was analyzed to determine patterns in the index cards related to crime. The group’s research objectives centered around the discovery of patterns in crime (speeches) during Morgenthau’s tenure as well as how such information coincides with historical context. The crime-related index cards were extracted from the collection, manipulated, and then visualized to view patterns in the data. Visualizations clearly show that the two most prevalent crime-related topics during Morgenthau’s tenure dealt with alcohol and taxes. Historical context supports that finding with Morgenthau’s position on tax evasion and the illegal manufacture of alcohol.

To expand upon our analysis, the group recommends the extraction of Series 2 Press Conference transcripts that are related to crime. Their added details will likely clarify inconsistencies in the index cards, and result in a more thorough data analysis. For example, (Diamond’s schema for) the index cards distinguish terms such as ‘Liquor’ from ‘Liquor, evasion of tax on’ and ‘Whiskey.’ Additional research needs to be done to determine if these categories should remain separate or be combined into an overarching category. For the sake of this project, these alcohol-related entries were combined into the crime category of ‘alcohol.’ A few more involved entries, such as ‘Tax, Evasion’ and ‘Political Activities,’ require more background research and support from the press conference transcripts.

While the data visualizations show a definite decline in crime after 1936, which correlates with the actions Morgenthau influenced in his campaign against corruption, more research would enumerate the specific actions and laws passed that helped reduce the instances of crime. Moreover, the types of crime in this dataset were limited, and it may be assumed that because of Morgenthau’s narrow interest on fiscal responsibility and the Treasury, other types of crime, such as violent crime, sexual assault and sex trafficking, domestic violence, child abuse, etc., were handled on a local level. (JB, JO, KK) 

## References
* Alcohol and Tobacco Tax and Trade Bureau (TTB). (2013). “Federal Alcohol Administration Act of 1935.” TTBgov. https://www.ttb.gov/trade-practices/federal-alcohol-administration-act-historical-background* Carter, K. S., Gondek, A., Underwood, W., Randby, T., & Marciano, R. (2022). Using AI and ML to optimize information discovery in under-utilized, Holocaust-related records. *AI & Society*, 37: 837-858. https://doi.org/10.1007/s00146-021-01368-w
* CASES. (2022). CASE projects. Computational Archival Science Educational System. https://cases.umd.edu/
* Delpeuch, A. (2022, July 22). “Starting a project.” OpenRefine documentation. https://docs.openrefine.org/manual/starting
* Franklin D. Roosevelt Presidential Library & Museum (FDR Library). (2022). Press Conferences of Henry Morgenthau, Jr., 1933-1945 [Data set]. Franklin D. Roosevelt Presidential Library & Museum. http://www.fdrlibrary.marist.edu/archives/collections/franklin/index.php?p=collections/findingaid&id=536
* Phillips, M. (1963, September). A study of the Office of Law Enforcement Coordination, U.S. Treasury Department. *Journal of Criminal Law and Criminology* 54(3): 369-377. Retrieved October 17, 2022 from https://scholarlycommons.law.northwestern.edu/jclc/vol54/iss3/15/
* Reppetto, T. (2004). *American mafia: A history of its rise to power*. H. Holt.
* Roosevelt, F. (1937, June 1). “Message to Congress on tax evasion prevention” [Transcript]. https://www.presidency.ucsb.edu/documents/message-congress-tax-evasion-prevention
* Schaffer Library of Drug Policy. (2022). “New York Times August 14, 1936.” Drug Library. https://druglibrary.net/schaffer/History/e1920/nyt08141936.htm
* Society of American Archivists (SAA). (2020). SAA Core Values statement and Code of Ethics. Society of American Archivists. Retrieved October 19, 2022, from https://www2.archivists.org/statements/saa-core-values-statement-and-code-of-ethics
* The Data Visualisation Catalogue. (2022). “Histogram.” The Data Visualisation Catalogue. https://datavizcatalogue.com/methods/histogram.html
* Wikipedia. (2022, September 29). Henry Morgenthau Jr. https://en.wikipedia.org/wiki/Henry_Morgenthau_Jr.#cite_note-gunther1950-40 and https://en.wikipedia.org/wiki/Henry_Morgenthau_Jr.#cite_note-diary-19 (a:reference 40 to [Roosevelt in Retrospect](https://archive.org/details/rooseveltinretro00gunt) by Gunther 1950: 102 about his donation to FDR Library; b:reference 19 to Morgenthau Diary [#50 May 9, 1939](http://www.burtfolsom.com/wp-content/uploads/2011/Morgenthau.pdf)).
