# Datasets

This is a growing repository of datasets broadly related to culture and the humanities. The sources of the datasets as well as brief descriptions and example uses can be found below.

If you'd like to add a dataset or an example use case to this page, please open [an issue on GitHub](https://github.com/melaniewalsh/Intro-Cultural-Analytics/issues) or email me at melwalsh@uw.edu

## Film üé¨


### Hollywood Film Dialogue By Character Gender and Age
(1925-2015)

Get the data: {download}`Download Hollywood Film Dialogue data <../data/Pudding/Pudding-Film-Dialogue-Clean.csv>`  
Original source: Hannah Anderson and Matt Daniels, *The Pudding*

Brief description:  

This {download}`CSV file <../data/Pudding/Pudding-Film-Dialogue-Clean.csv>` is a consolidated and slightly modified version of data shared by Hannah Anderson and Matt Daniels for their *Pudding* article, ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/). The original datasets can be found [on GitHub here](https://github.com/matthewfdaniels/scripts/). 

For 2,000 films from 1925 to 2015, the dataset includes information about characters' names, genders, ages, how many words they spoke in each film, as well as the release year of each film and how much money the film grossed. Anderson and Daniels determined character age and gender (which they code as binary) based on corresponding IMDB  information for actors. They acknowledge this is an imperfect approach. For more on the compilation of the dataset and their methodology, see [FAQ for the ‚ÄúFilm Dialogue, By Gender‚Äù Project](https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751).

Example uses:
- ["Film Dialogue from 2,000 screenplays, Broken Down by Gender and Age"](https://pudding.cool/2017/03/film-dialogue/), *The Pudding*
- ["Pandas Basics Part 3"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part3.html), *Introduction to Cultural Analytics & Python*

Preview dataset

In [1]:
import pandas as pd
pd.set_option("max_rows", 150)
pd.read_csv("../data/Pudding/Pudding-Film-Dialogue-Clean.csv").sample(10)

Unnamed: 0,title,release_year,character,gender,words,proportion_of_dialogue,age,gross,script_id
19885,Noah,2014,Shem,man,547,0.089997,22.0,107.0,7659
15107,Star Trek: Generations,1994,Lursa,woman,130,0.014183,,157.0,5035
21000,Star Wars: Episode VII - The Force Awakens,2015,C-3Po,man,138,0.020702,69.0,927.0,8099
600,Broken Arrow,1996,Secretary Of De,man,291,0.034809,53.0,138.0,779
21607,Brick Mansions,2014,Grandfather Col,man,182,0.021237,78.0,21.0,8410
10975,Terminator Salvation,2009,Morrison,man,133,0.032102,,145.0,3516
20247,Die Hard: With a Vengeance,1995,Simon Gruber,man,1185,0.178814,47.0,200.0,7818
10402,S1m0ne,2002,Viktor Taransky,man,10520,0.408766,62.0,14.0,3345
3600,10 Things I Hate About You,1999,Kat Stratford,woman,4718,0.239736,18.0,65.0,1512
9879,Reindeer Games,2000,Jumpy,man,964,0.04214,56.0,37.0,3190


---

## Literature üìö

### *Lost in the City* (1993), Edward P. Jones
HathiTrust Extracted Features

Get the data: {download}`Download Lost in the City data <../texts/literature/Lost-in-the-City-HTRC-Extracted-Features.zip>`   
Original source: [HathiTrust Digital Library](https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329)

Brief description:  

The {download}`Lost in the City HTRC Extracted Features zip file <../texts/literature/Lost-in-the-City-HTRC-Extracted-Features.zip>` contains word frequencies per page‚Äîor "extracted features"‚Äîmade available by the [HathiTrust Digital Library](https://wiki.htrc.illinois.edu/pages/viewpage.action?pageId=79069329) for Edward P. Jones's short story collection *Lost in the City*. I have also added each short story title for the correct corresponding pages to the dataset.

There are three CSV files in the zip file. "Lost-in-the-City-HTRC-Extracted-Features(PerPage).csv" contains lowercased word frequencies per page as well as part-of-speech information (49,330 rows). "Lost-in-the-City-HTRC-Extracted-Features(PerStory).csv" contains lowercased word frequencies per story as well as part-of-speech information (20,748 rows). "Lost-in-the-City-HTRC-Extracted-Features(5PerStory).csv" contains pre-processed lowercased word frequencies (with stopwords and punctuation removed) for words that appear more than 5 times in a story as well as part-of-speech information (1,364 rows).

Example uses:
- ["TF-IDF With HathiTrust Data"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF-HathiTrust.html), *Introduction to Cultural Analytics & Python*

Preview dataset

In [78]:
pd.read_csv("../texts/literature/Lost-in-the-City-HTRC-Extracted-Features/Lost-in-the-City-HTRC-Extracted-Features(PerStory).csv").sample(10)

Unnamed: 0,story,lowercase,pos,count
5836,04: Young Lions,stay,VB,2
20317,Back Matter,poignant,JJ,1
15767,11: Gospel,own,JJ,2
16088,11: Gospel,stool,NN,1
10747,07: The Sunday Following Mother's Day,seeing,VBG,1
6441,05: The Store,born,VBN,1
6310,05: The Store,around,IN,14
13443,09: His Mother's House,outta,RB,1
16317,11: Gospel,windows,NNS,1
18492,13: A Dark Night,quiet,JJ,5


___

### African American Literature
(1853-1923)

Get the data: {download}`Download African American Literature data <../texts/literature/African-American-Literature-1853-1923.zip>`  
Source: Amardeep Singh

Brief description:  

The {download}`African American Literature (1853-1923) zip file <../texts/literature/African-American-Literature-1853-1923.zip>` contains 100 works of fiction and poetry by African American writers between 1853-1923. It also contains an Excel file of metadata about the publisher, publication data, and publication place of each text. This corpus was compiled and shared by Amardeep Singh. You can read more about this corpus and its creation in [Singh's blog post about the African American Literature corpus](http://www.electrostani.com/2020/07/announcing-open-access-african-american.html).

Lastly, Singh asks users of this data to adhere to the spirit of the [Colored Conventions Project's principles](https://coloredconventions.org/about-records/ccp-corpus/).

Preview dataset

In [57]:
pd.read_excel("../texts/literature/African American Literature Text Corpus/African American Literature Corpus Metadata-Amardeep Singh.xlsx")

Unnamed: 0,"Author (last, first)",Title,Year Published,Genre,Publisher,Location of Publisher,Location signed by author,Keywords,Derived From,Status and Links,Unnamed: 10
0,"Adams, Clayton",Ethiopia: the Land of Promise; A Book With a P...,1917,Fiction,Cosmopolitan Press,New York,,Black utopia; segregation; reconstruction,HathiTrust,https://catalog.hathitrust.org/Record/008407122,
1,"Anderson, William and Walter H. Stowers",Appointed: An American Novel,1894,Fiction,Detroit Law Printing Co.,Detroit,,Interracial friendship; Northerners going south,HathiTrust,https://catalog.hathitrust.org/Record/005568825,
2,"Andrews, W.T.","A Waif--A Prince; or, A Mother's Triumph",1895,Fiction,"Publishing House, Methodist Episcopal Church S...","Nashville, Tennessee",,Religious allegory; Egypt (Hebrews as oppresse...,History of Black Writing Corpus,Also see LOC: https://www.loc.gov/item/06002450/,
3,"Ashby, William M.",Redder Blood,1915,Fiction,Cosmopolitan Press,New York,,Passing; Interracial desire,History of Black Writing Corpus,https://catalog.hathitrust.org/Record/004237253,
4,"Bennett, John","Madam Margot, a Grotesque Legend of Old Charle...",1917,Fiction,Century Co.,New York,,Supernatural; Romance,History of Black Writing Corpus,https://catalog.hathitrust.org/Record/00858464...,
5,"Bibb, Eloise A.",Poems,1895,Poetry,Monthly Review Press,"Boston, Massachusetts",,Mentions Alice Dunbar-Nelson; Poem to Frederic...,Digital Schomburg,Also see American Verse Project: https://quod....,
6,"Blackson, Lorenzo D.",Rise and Progress of the Kingdoms of Light and...,1867,Fiction,"J. Nicholas, Printer","Philadelphia, Pennsylvania",,Christian; Allegory,History of Black Writing Corpus,Also see Archive.org: https://archive.org/deta...,
7,"Braithwaite, William Stanley",Lyrics of Life and Love,1904,Poetry,Herbert B. Turner co.,"Boston, Massachusetts",,,U-Michigan American Verse Project,,
8,"Braithwaite, William Stanley","House of Falling Leaves, With Other Poems",1908,Poetry,John W. Luce and Co,"Boston, Massachusetts",,,U-Michigan American Verse Project,,
9,"Brown, William Wells","Clotel; Or, the President's Daughter: A Narrat...",1853,Fiction,Partridge and Oakey,"London, England",,Slavery; Passing; Interracial; Fugitive Slave ...,History of Black Writing Corpus,Also see Documenting the American South: https...,


---

### Colonial South Asian Literature
(1850-1923)

Get the data: {download}`Download Colonial South Asian Literature data <../texts/literature/Colonial-South-Asian-Literature-1850-1923.zip>`  
Source: Amardeep Singh

Brief description:  

The {download}`Colonial South Asian Literature (1850-1923) zip file <../texts/literature/Colonial-South-Asian-Literature-1850-1923.zip>` contains ~100 works of literature by South Asian and British writers between 1853-1923. It also contains an Excel file of metadata about publication and the nationality of each other. This corpus was compiled and shared by Amardeep Singh. You can read more about this corpus and its creation in [Singh's blog post about the Colonial South Asian Literature corpus](http://www.electrostani.com/2020/08/text-corpus-colonial-south-asian.html).

Preview dataset

In [58]:
pd.read_excel("../texts/literature/Colonial South Asian Literature, 1850-1923/Colonial South Asian Literature Corpus.xlsx")

Unnamed: 0,Author,Title,Year,Original (if translated),Mode,Publisher Location,Publisher Name,Nationality of Author,Keywords,Origin,Unnamed: 10
0,"Arnold, W.D.","Oakfield; Or, Fellowship in the East",1855,,Fiction,Cambridge,"Metcalf and Company, Printers to the University",British,,HathiTrust,
1,"Bain, F.W.",A Hindoo Love Story,1898,,Fiction,London,Methuen & Co. Ltd.,British,"Translation from Sanskrti (""The Churning of th...",HathiTrust,
2,"Banerjea, S.B.",Tales of Bengal,1910,,Fiction,"New York, Bombay, Calcutta","Longmans, Green, and Co.",South Asian,Introduction by Francis H. Skrine,Gutenberg,
3,"Candler, Edmund",The Sepoy,1919,,Fiction,London,John Murray,British,,Gutenberg,
4,"Candler, Edmund",Abdication,1922,,Fiction,"London, Bombay",Constable & Co.,British,,HathiTrust,
5,"Candler, Edmund","Siri Ram, Revolutionist",1911,,Fiction,"London, Bombay",Constable & Co.,British,,HathiTrust,
6,"Candler, Edmund",Mantle of the East,1910,,Nonfiction,London,William Blackwood and Sons,British,,HathiTrust,
7,"Candler, Edmund",Year of Chivalry,1916,,Nonfiction,London,"Simpkin, Marshall, Hamilton, Kent & Co.",British,,HathiTrust,
8,"Chatterji, Bankim Chandra",Anandamath: Dawn Over india,1941,1882.0,Fiction,New York,Devin-Adair Company,South Asian,Translation; Translated by Basanta Koomar Roy....,HathiTrust,https://catalog.hathitrust.org/Record/000988658
9,"Chatterji, Bankim Chandra",Kapalkundala,1919,1866.0,Fiction,Calcutta,K.M. Bagchi,South Asian,Translation; translated by Devendra nath Ghose,Wikisource,


___

### txtLAB's Multilingual Novels
(1771-1932)

Get the data: [Link to Multilingual Novels data](https://figshare.com/articles/txtlab_Novel450/2062002/3)   
Source: Andrew Piper, McGill [txtLAB](https://txtlab.org/data-sets/)

Brief description:  

The [txtLAB's multilingual novels](https://figshare.com/articles/txtlab_Novel450/2062002/3) takes you to a repository where you can download a directory of 150 English-language novels, 150 German-language novels, and 150 French-language novels, which span from 1771 to 1932. Authors featured include Goethe, Franz Kafka, Hermann Melville, Mary Shelley, Kate Chopin, Virginia Woolf, Victor Hugo, Alexandre Dumas, and many more. These text files were compiled and shared by Andrew Piper and the txtLab.

Example uses:
- ["Novel Devotions: Conversional Reading, Computational Modeling, and the Modern Novel,"](http://piperlab.mcgill.ca/pdfs/Piper_NovelConversions.pdf) Andrew Piper

Preview dataset

In [51]:
print(open("../texts/literature/McGill_txtalb_Novel450/English/EN_1925_Woolf,Virginia_Mrs.Dalloway_Novel.txt").read())

MRS. DALLOWAY

 

by

 

Virginia Woolf

 

 

 

Mrs. Dalloway said she would buy the flowers herself.

For Lucy had her work cut out for her. The doors would be taken off their hinges; Rumpelmayer's men were coming. And then, thought Clarissa Dalloway, what a morning--fresh as if issued to children on a beach.

What a lark! What a plunge! For so it had always seemed to her, when, with a little squeak of the hinges, which she could hear now, she had burst open the French windows and plunged at Bourton into the open air. How fresh, how calm, stiller than this of course, the air was in the early morning; like the flap of a wave; the kiss of a wave; chill and sharp and yet (for a girl of eighteen as she then was) solemn, feeling as she did, standing there at the open window, that something awful was about to happen; looking at the flowers, at the trees with the smoke winding off them and the rooks rising, falling; standing and looking until Peter Walsh said, "Musing among the vegetables?

---

### Modernist Journal Data
(1890s-1920s)

Get the data: [Link to Modernist Journal data](https://sourceforge.net/projects/mjplab/files/)   
Source: [The Modernist Journals Project](https://modjourn.org/)

Brief description:  

The [Modernist journal data link](https://sourceforge.net/projects/mjplab/files/) takes you to a repository where you can download publication metadata for 14 modernist journals from the 1890s to the 1920s ‚Äî such as *Poetry Magazine*, *The Little Review*, and *The Crisis*. The Modernist Journals Project, which has digitized these journals, provides CSV and tab-separated text files that contain information for every contributor and every work published in the journals.  

Example uses:
- [Comparative Charts‚ÄïInvolving Contributions to The Egoist, The Little Review, and Others (1915-1919)](https://modjourn.org/comparative-charts%e2%80%95involving-contributions-to-the-egoist-the-little-review-and-others-1915-1919/), The Modernist Journals Project

Preview dataset

In [20]:
pd.read_csv("../data/Crisis_3.everycontributor.txt", delimiter="|").sample(5)

Unnamed: 0,contributor,title,genre,pages,volume,issue,date,journal title,journal subtitle,issue name,journal editor,publisher,journal location,issue length (pp),issue height (cm),issue width (cm)
268,"Clarana, Jos√©",The Schooling of the Negro,articles,133-136,6,3,1913-07-01,Crisis,A Record of the Darker Races,Educational Number,"Du Bois, W. E. Burghardt",National Association for the Advancement of C...,New York,52,22.9,15.2
1415,"Johnson, Georgia Douglas",A Sonnet in Memory of John Brown,poetry,169-169,24,4,1922-08-01,Crisis,A Record of the Darker Races,,"Du Bois, W. E. Burghardt",National Association for the Advancement of C...,New York,48,22.9,15.2
964,"Jackson, Virginia P.",Africa,poetry,166-166,17,4,1919-02-01,Crisis,A Record of the Darker Races,Reconstruction Number,"Du Bois, W. E. Burghardt",National Association for the Advancement of C...,New York,52,22.9,15.2
1208,"Jordan, Winifred Virginia",Values,poetry,15-15,21,1,1920-11-01,Crisis,A Record of the Darker Races,,"Du Bois, W. E. Burghardt",National Association for the Advancement of C...,New York,52,22.9,15.2
639,"Stevenson, Helen",How I Grew My Corn,articles,273-274,12,6,1916-10-01,Crisis,A Record of the Darker Races,[Children's Number],"Du Bois, W. E. Burghardt",National Association for the Advancement of C...,New York,52,22.9,15.2


---

### Seattle Public Library Circulation Data
(2005-present)

Get the data: {download}`Download Seattle Public Library Circulation Data (Books checked out > 10 times between 2010-2017 <../data/Seattle_Book_Checkouts_2010_2017.zip>` // [Link to All Seattle Public Library Circulation Data (2005-present)](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6/data)  
Source: [City of Seattle](http://www.seattle.gov/tech/initiatives/open-data/about-the-open-data-program)

Brief description:  
The [Seattle Public Library check-out data link](https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6/data) takes you to a database that contains circulation data about the Seattle Public Library system from 2005 until the present. You can filter this database or search for keywords (e.g., "James Baldwin") and export a file of the filtered data by clicking "Export" and your desired file type.

Example uses:
- ["James Baldwin Checkouts from the SPL,"](http://tweetsofanativeson.com/Seattle-Public-Library/) Melanie Walsh

Preview dataset

In [21]:
pd.read_csv("../data/Seattle_Book_Checkouts_2010_2017.csv").sample(5)

Unnamed: 0,UsageClass,CheckoutType,MaterialType,CheckoutYear,CheckoutMonth,Checkouts,Title,Creator,Subjects,Publisher,PublicationYear
259350,Physical,Horizon,BOOK,2012,3,14,The growing story / by Ruth Krauss ; illustrat...,"Krauss, Ruth","Seasons Juvenile fiction, Growth Juvenile fiction","HarperCollins,",2007
580809,Physical,Horizon,BOOK,2014,2,10,Frommers easyguide to New Orleans 2014,,New Orleans La Guidebooks,,
116767,Physical,Horizon,BOOK,2010,3,32,The privileges : a novel / Jonathan Dee.,"Dee, Jonathan","Rich people Fiction, Families Fiction, New Yor...","Random House,",2010
241495,Physical,Horizon,BOOK,2015,8,12,ABC school's for me! / written by Susan B. Kat...,"Katz, Susan B., 1971-","Alphabet Juvenile fiction, Schools Juvenile fi...","Scholastic Press,",[2015]
548316,Physical,Horizon,BOOK,2014,3,100,Uncataloged Folder or Bag--GWD,,,,


---

### Game of Thrones Character Relationships

Get the data: {download}`Download Game of Thrones Character data <../data/game-of-thrones-characters.zip>`  
Source: A. Beveridge and J. Shan

Brief description:  

The {download}`Game of Thrones Character Relationships zip file <../data/game-of-thrones-characters.zip>`  contains network data for character relationships within George R. R. Martin's *A Storm of Swords*, the third novel in his series *A Song of Ice and Fire* (also known as the HBO television adaptation *Game of Thrones*). This data was originally compiled by A. Beveridge and J. Shan for their article, ["Network of Thrones"](https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf).

The nodes csv contains 107 different characters, and the edges csv contains 353 weighted relationships between those characters, which were calculated based on how many times two characters' names appeared within 15 words of one another in the novel. For more on the methodology, see Beveridge and Shan's [original article](https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf).

Example uses:
- ["Network of Thrones,"](https://www.maa.org/sites/default/files/pdf/Mathhorizons/NetworkofThrones%20%281%29.pdf), A. Beveridge and J. Shan
- ["Network Analysis,"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Network-Analysis/Network-Analysis.html) *Introduction to Cultural Analytics & Python*

Preview dataset

In [60]:
pd.read_csv("../data/got-edges.csv").sample(10)

Unnamed: 0,Source,Target,Weight
15,Arya,Jaime,11
154,Jon,Aemon,30
150,Joffrey,Tommen,9
170,Jon,Stannis,9
152,Jojen,Meera,33
140,Joffrey,Gregor,5
241,Robb,Theon,11
250,Robert Arryn,Marillion,4
2,Aerys,Jaime,18
274,Sansa,Loras,14


---

## Politics üó≥Ô∏è & History üìú


### *The New York Times* Obituaries
(1852-2007)

Get the data: {download}`Download *New York Times* Obituaries data <../texts/history/NYT-Obituaries.zip>`     
Original source: [Matthew Lavin](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset)

Brief description:  

The {download}`*New York Times* obituaries zip file (.zip) of text files (.txt) <../texts/history/NYT-Obituaries.zip>` contains 379 *New York Times* obituaries (1852-2007) based on those collected by Matt Lavin for his *Programming Historian* tutorial, [Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf#lesson-dataset).

I re-scraped the 366 obituaries included in Lavin's tutorial so that the obituary subject's name and death year is included in each text file name. I also added 13 more ["Overlooked"](https://www.nytimes.com/interactive/2018/obituaries/overlooked.html) obituaries ‚Äî belated obituaries of remarkable women and minorities who did not receive a *NYT* obituary at the time of their death. Obituary subjects include academics, military generals, artists, athletes, activists, politicians, and businesspeople ‚Äî such as Ada Lovelace, Ulysses Grant, Marilyn Monroe, Virginia Woolf, Jackie Robinson, Marsha P. Johnson, Cesar Chavez, John F. Kennedy, Ray Kroc, and many more.

Example uses:
- [Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf), Matt Lavin
- ["Topic Modeling Text Files"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Text-Files.html), *Introduction to Cultural Analytics & Python*

Preview dataset

In [26]:
print(open("../texts/history/NYT-Obituaries/1852-Ada-Lovelace.txt").read())

A gifted mathematician who is now recognized as the first computer programmer.By CLAIRE CAIN MILLER

 A century before the dawn of the computer age, Ada Lovelace imagined the modern-day, general-purpose computer. It could be programmed to follow instructions, she wrote in 1843. It could not just calculate but also create, as it ‚Äúweaves algebraic patterns just as the Jacquard loom weaves flowers and leaves.‚Äù The computer she was writing about, the British inventor Charles Babbage‚Äôs Analytical Engine, was never built. But her writings about computing have earned Lovelace ‚Äî who died of uterine cancer in 1852 at 36 ‚Äî recognition as the first computer programmer. 

 The program she wrote for the Analytical Engine was to calculate the seventh Bernoulli number. (Bernoulli numbers, named after the Swiss mathematician Jacob Bernoulli, are used in many different areas of mathematics.) But her deeper influence was to see the potential of computing. The machines could go beyond calculati

---
        
### U.S. Inaugural Addresses
(1789-2017)

Get the data: {download}`Download U.S. Inaugural Addresses data <../texts/history/US_Inaugural_Addresses.zip>`

Brief description: 

The {download}`U.S. Inaugural Addresses zip file (.zip) of text files (.txt) <../texts/history/US_Inaugural_Addresses.zip>` contains U.S. Inaugural Addresses ranging from President George Washington (1789) to President Donald Trump (2017). Each text file is titled with a number, the corresponding last name of the U.S. President, and the corresponding year of the Inaugural Address.

Example uses:
- ["jsLDA: In-browser topic modeling,"](https://github.com/mimno/jsLDA) David Mimno

Preview dataset

In [65]:
print(open("../texts/history/US_Inaugural_Addresses/56_obama_2009.txt").read())

Barack Obama	1/20/2009	My fellow citizens, I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our Nation, as well as the generosity and cooperation he has shown throughout this transition.  Forty-four Americans have now taken the Presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet every so often, the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because we the people have remained faithful to the ideals of our forebears and true to our founding documents.  So it has been; so it must be with this generation of Americans.  That we are in the midst of crisis is now well understood. Our Nation is at war against a far-reaching network of violence and hatred. Our economy is badl

---

### Nobel Prize Winners
(1901-2017)

Get the data: {download}`Download Nobel Prize Winners data <../data/nobel-prize-winners.zip>`  
Original source: The European Data Portal and [the official Nobel Prize API](https://www.nobelprize.org/about/developer-zone-2/)

Brief description:  

The {download}`Nobel Prize winners CSV file (.CSV) <../data/nobel-prize-winners.zip>` contains information about 957 Nobel Prize winners from 1901 to 2017. This information includes the Nobel laureate's name, birth and death date (if applicable), birth and death location (plus **latitude and longitude coordinates** for the locations), the year they won the Nobel Prize, the category of the Nobel Prize, and the "motivation" for the Nobel Prize.

Nobel laureates include Marie Curie, Johannes Stark, Woodrow Wilson, Jane Addams, Rabindranath Tagore, John Steinbeck, Gabriel Garcia Marquez, Karl Ziegler, Toni Morrison, and many more.

Example uses: [Google Maps, QGIS, and Palladio maps](https://github.com/melaniewalsh/geospatial-lab) from a WUSTL digital humanities graduate seminar that I taught

Preview dataset

In [75]:
pd.read_csv("../data/nobel-prize-winners/nobel-prize-winners.csv").sample(10)

Unnamed: 0,name,born,died,bornCountry_original,bornCountry_now,bornCity_original,bornCity_now,bornLocation,bornLong,bornLat,...,diedLong,diedLat,diedCoordinates,gender,year,category,motivation,institutionName,institutionCity,institutionCountry
899,Pablo Neruda,1904-07-12,1973-09-23,Chile,Chile,Parral,Parral,"Parral, Chile",-71.822753,-36.140648,...,-70.669266,-33.44889,"-33.4488897, -70.6692655",male,1971,literature,"""for a poetry that with the action of an eleme...",,,
450,William Bradford Shockley,1910-02-13,1989-08-12,United Kingdom,United Kingdom,London,London,"London, United Kingdom",-0.127758,51.507351,...,-122.14302,37.441883,"37.4418834, -122.1430195",male,1956,physics,"""for their researches on semiconductors and th...",Semiconductor Laboratory of Beckman Instrument...,"Mountain View, CA",USA
790,Ludwig Quidde,1858-03-23,1941-03-04,Germany,Germany,Bremen,Bremen,"Bremen, Germany",8.801694,53.079296,...,6.143158,46.204391,"46.2043907, 6.1431577",male,1927,peace,,,,
741,Rabindranath Tagore,1861-05-07,1941-08-07,India,India,Calcutta,Calcutta,"Calcutta, India",88.363895,22.572646,...,88.363895,22.572646,"22.572646, 88.363895",male,1913,literature,"""because of his profoundly sensitive, fresh an...",,,
846,Henri Bergson,1859-10-18,1941-01-04,France,France,Paris,Paris,"Paris, France",2.352222,48.856614,...,2.352222,48.856614,"48.856614, 2.3522219",male,1927,literature,"""in recognition of his rich and vitalizing ide...",,,
20,James Franck,1882-08-26,1964-05-21,Germany,Germany,Hamburg,Hamburg,"Hamburg, Germany",9.993682,53.551085,...,9.915804,51.54128,"51.5412804, 9.9158035",male,1925,physics,"""for their discovery of the laws governing the...",Goettingen University,G√∂ttingen,Germany
763,Wolfgang Paul,1913-08-10,1993-12-07,Germany,Germany,Lorenzkirch,Lorenzkirch,"Lorenzkirch, Germany",13.243844,51.356003,...,7.098207,50.73743,"50.73743, 7.0982068",male,1989,physics,"""for the development of the ion trap technique""",University of Bonn,Bonn,Federal Republic of Germany
926,Count Maurice (Mooris) Polidore Marie Bernhard...,1862-08-29,1949-05-06,Belgium,Belgium,Ghent,Ghent,"Ghent, Belgium",3.717424,51.054342,...,7.261953,43.710173,"43.7101728, 7.2619532",male,1911,literature,"""in appreciation of his many-sided literary ac...",,,
773,Ernst Otto Fischer,1918-11-10,2007-07-23,Germany,Germany,Munich,Munich,"Munich, Germany",11.58198,48.135125,...,11.58198,48.135125,"48.1351253, 11.5819805",male,1973,chemistry,"""for their pioneering work, performed independ...",Technical University,Munich,Federal Republic of Germany
296,Ferid Murad,1936-09-14,0000-00-00,USA,USA,"Whiting, IN","Whiting, IN","Whiting, IN, USA",-87.494487,41.679758,...,,,,male,1998,medicine,"""for their discoveries concerning nitric oxide...",University of Texas Medical School at Houston,"Houston, TX",USA


---

### Refugee Arrivals to the U.S.
(2005-2015)

Get the data: {download}`Download U.S. Refugee Arrivals data <../data/us-refugee-arrivals.zip>`  
Original source: Department of State's Refugee Processing Center and [Jeremy Singer-Vine](https://github.com/BuzzFeedNews/2015-11-refugees-in-the-united-states)

Brief description:   

The {download}`U.S. Refugee Arrivals zip file <../data/us-refugee-arrivals.zip>` contains data about refugee arrivals to the United States between 2005 and 2015. This data was originally compiled from the Department of State's Refugee Processing Center by Jeremy Singer-Vine for his BuzzFeed article ["Where U.S. Refugees Come From ‚Äî And Go ‚Äî In Charts."](https://www.buzzfeednews.com/article/jsvine/where-us-refugees-come-from-and-go-in-charts#.vooNwy74jO)

The "refugee-arrivals-by-destination" csv contains information about the number of refugees who arrived in each U.S. city and state, the year that they arrived, and the country from which they arrived. The "refugee-arrivals-by-religion" csv contains information about the number of refugees who arrived in the U.S., the year in which they arrived, and their religious affiliation.

Example uses:
- ["Where U.S. Refugees Come From ‚Äî And Go"](https://www.buzzfeednews.com/article/jsvine/where-us-refugees-come-from-and-go-in-charts#.vooNwy74jO), Jeremy Singer-Vine
- [Tableau map](https://public.tableau.com/profile/melanie.walsh#!/vizhome/RefugeeArrivalstotheU_S_2005-2015/TotalRefugeeArrivalstoU_S_2005-2015) from a WUSTL digital humanities graduate seminar that I taught

Preview dataset

In [76]:
pd.read_csv("../data/us-refugee-arrivals/refugee-arrivals-by-destination.csv").sample(10)

Unnamed: 0,year,origin,dest_state,dest_city,arrivals
23778,2012,Bosnia and Herzegovina,Florida,Orlando,0
118184,2005,Sudan,Washington,Federal Way,2
73846,2008,Afghanistan,New York,Bronx,0
80869,2013,Yemen,New York,New York,0
92126,2006,China,Pennsylvania,Philadelphia,0
114051,2008,Bhutan,Washington,Des Moines,0
32673,2008,Nigeria,Georgia,Decatur,15
98287,2007,Colombia,Tennessee,Nashville,0
51835,2008,Burma,Massachusetts,Lexington,0
38155,2012,Iraq,Illinois,Tinley Park,0


---

### Irish Immigrants Admitted to NYC's Bellevue Almshouse
(1840s)

Get data: [Link to Bellevue Almshouse data](https://docs.google.com/spreadsheets/d/1uf8uaqicknrn0a6STWrVfVMScQQMtzYf5I_QyhB9r7I/edit#gid=2057113261)  
Source: Anelise Shrout, [Digital Almshouse Project](https://www.nyuirish.net/almshouse/)

Brief description:   

The [Bellevue Almshouse link](https://docs.google.com/spreadsheets/d/1uf8uaqicknrn0a6STWrVfVMScQQMtzYf5I_QyhB9r7I/edit#gid=2057113261) takes you to a Google spreadsheet that contains data about Irish-born immigrants who were admitted to the Bellevue Almshouse in the 1840s. The Bellevue Almshouse was part of New York City's public health system, a place where poor, sick, homeless, and otherwise marginalized people were sent ‚Äî sometimes voluntarily and sometimes forcibly. This dataset was transcribed from the almshouse's own admissions records by Anelise Shrout. For more information about this dataset, see [The Almshouse Records](https://www.nyuirish.net/almshouse/the-almshouse-records/)

Example uses:
- <a href=http://crdh.rrchnm.org/essays/v01-10-(re)-humanizing-data/>‚Äú(Re)Humanizing Data: Digitally Navigating the Bellevue Almshouse‚Äù</a>, Anelise Hanson Shrout
- ["Pandas Basics Part 1"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html), *Introduction to Cultural Analytics & Python*

Preview dataset

In [42]:
pd.read_csv("../data/bellevue_almshouse_modified.csv").sample(5)

Unnamed: 0,date_in,first_name,last_name,age,disease,profession,gender,children
6247,1847-08-05,Jerry,Donohue,30.0,sickness,laborer,m,
9406,1846-11-09,James,Wilson,4.0,,laborer,m,
7489,1847-11-23,Thomas,Wood,38.0,sickness,laborer,m,
8361,1847-06-01,Rose,Hall,32.0,sickness,married,w,
7933,1847-07-08,Mary,Creighton,62.0,sickness,widow,w,


---

## Social Media üï∏Ô∏è


### Donald Trump's Tweets
(2009-2020)

Get the data: {download}`Download Donald Trump tweets data <../texts/social-media/Trump-Tweets_2009-2021.csv>`  
Original Source: [Trump Twitter Archive](https://www.thetrumparchive.com/)

Brief description:   

The {download}`Donald Trump tweets CSV file (.csv) <../texts/social-media/Trump-Tweets_2009-2021.csv>` contains nearly 30,000 tweets from Donald Trump's account from 2009 to January 2021. The information about each tweet includes the source, tweet text, date of tweet, as well as retweet and favorite counts. Data can be downloaded at [Trump Twitter Archive](https://www.thetrumparchive.com/).
 
Example uses:
- ["Topic Modeling Time Series"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/11-Topic-Modeling-Time-Series.html), *Introduction to Cultural Analytics & Python*'s
- ["How Trump Reshaped the Presidency in 11,000 Tweets"](https://www.nytimes.com/interactive/2019/11/02/us/politics/trump-twitter-presidency.html), *New York Times*

Preview dataset

In [1]:
import pandas as pd
pd.read_csv("../texts/social-media/Trump-Tweets_2009-2021.csv").sample(5)

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
50174,3.11868e+17,@heyavampirebat Good idea--I wish him luck.,f,f,Twitter Web Client,0,0,3/13/13 15:54,f
44483,4.01686e+17,The real J.P.Morgan is spinning in his grave a...,f,f,Twitter for Android,85,94,11/16/13 12:19,f
479,1.33739e+18,Now that the Biden Administration will be a sc...,f,f,Twitter for iPhone,266423,51993,12/11/20 13:16,f
3426,1.30923e+18,On my way to North Carolina! https://t.co/IHrU...,f,f,Twitter for iPhone,53446,10612,9/24/20 20:29,f
22739,1.00142e+18,Why aren‚Äôt the 13 Angry and heavily conflicted...,f,f,Twitter for iPhone,67370,14586,5/29/18 11:09,f


---

### "Am I The Asshole?" Reddit Posts

Get the data: {download}`Download "Am I The Asshole?" Reddit data <../data/top-reddit-aita-posts.csv>`

Brief description:  

The {download}`Am I The Asshole?" Reddit posts CSV file (.csv) <../data/top-reddit-aita-posts.csv>` contains 2,932 Reddits posts from the subreddit "Am I the Asshole?" that have at least an upvote score of 2,000. The information in the dataset includes the date of the post, title, body text, url, upvote score, number of comments, and number of crossposts. This data was collected with [PSAW](https://github.com/dmarx/psaw), a wrapper for the [Pushshift API](https://github.com/pushshift/api).

Example uses:
- ["Topic Modeling CSV Files"](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-CSV.html), *Introduction to Cultural Analytics & Python*

Preview dataset

In [44]:
pd.read_csv("../data/top-reddit-aita-posts.csv").sample(5)

Unnamed: 0,author,full_date,date,title,selftext,url,subreddit,upvote_score,num_comments,num_crossposts
1887,carmelacorleone,2019-07-02 20:54:36+00:00,2019-07-02,AITA for cussing at my racist MIL,Hello. I am a 24F engaged to be married. My fi...,https://www.reddit.com/r/AmItheAsshole/comment...,AmItheAsshole,2581,654,0
2086,Maximum_Silver,2019-06-10 23:35:03+00:00,2019-06-10,WIBTA for forbidding my Father in law from see...,My father in law is the king of passive aggres...,https://www.reddit.com/r/AmItheAsshole/comment...,AmItheAsshole,16249,1970,0
2197,GeneralJen8,2019-05-31 16:28:55+00:00,2019-05-31,AITA for shouting at protesters?,My (f26) mum died of cancer when I was 15. For...,https://www.reddit.com/r/AmItheAsshole/comment...,AmItheAsshole,2924,636,0
2718,CrazyBirboLady,2019-03-29 21:55:46+00:00,2019-03-29,AITA for telling a ‚Äúfriend‚Äù that at least ‚Äúmy ...,I‚Äôm a ‚Äúsmall‚Äù chested woman. Not even that sma...,https://www.reddit.com/r/AmItheAsshole/comment...,AmItheAsshole,16049,2641,0
560,Concerned_boyfriend7,2019-10-22 20:04:20+00:00,2019-10-22,AITA for purposely scaring my girlfriend when ...,My (20m) girlfriend (21f) commutes to and from...,https://www.reddit.com/r/AmItheAsshole/comment...,AmItheAsshole,2018,799,0


---

## Food üçî


### The New York Public Library's Menu Dataset
(1840-present)

Get the data: [Link to The New York Public Library's Menu Dataset](http://menus.nypl.org/data)  
Source: The New York Public Library

Brief description:   

The [The New York Public Library's menu dataset link](http://menus.nypl.org/data) takes you to a web page where you can download data from the New York Public Library's massive menu collection ‚Äî tens of thousands of transcribed menus and menu items from the 1840s to the present. Click "Download the latest data export in CSV format" for the most updated menu data.

Example uses:
- [*Curating Menus*](http://curatingmenus.org/), Katie Rawson and Trevor Mu√±oz

Preview dataset

In [50]:
pd.read_csv("../data/NYPL_Menus_2020_11_01_07/Dish.csv").sample(5)

Unnamed: 0,id,name,description,menus_appeared,times_appeared,first_appeared,last_appeared,lowest_price,highest_price
370442,456603,Sliced Cucumber salad,,2,2,0,0,0.0,0.0
88857,111607,"Quail Pie, New England Style",,1,1,1899,1899,0.0,0.0
37386,48141,Grape Sherbet,,7,9,1901,1973,0.25,0.25
212711,270784,"Pressed Lamb's Tongue, Chow Chow",,1,1,1961,1961,0.0,0.0
176554,223557,Sundaes Chocolate,,1,1,1965,1965,0.75,0.75


---

## Other Dataset Compilations

Below are some other great compilations of cultural and humanities-related datasets:
- Jeremy Singer-Vine's [*Data Is Plural* archive](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0) (you can [subscribe to his excellent dataset newsletter here](https://tinyletter.com/data-is-plural))
- *The Pudding*'s [GitHub repository](https://github.com/the-pudding/data)
- Alan Liu's [DH Toychest](http://dhresourcesforprojectbuilding.pbworks.com/w/page/69244469/Data%20Collections%20and%20Datasets)
- Reddit's [r/datasets subreddit](https://www.reddit.com/r/datasets/)

## Suggestions?

If you'd like to add a dataset or an example use case, please open [an issue on GitHub](https://github.com/melaniewalsh/Intro-Cultural-Analytics/issues) or email me at melwalsh@uw.edu