{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IZtnwJbt6oJT"
},
"source": [
"
\n",
"Created by [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)
\n",
"**For questions/comments/improvements, email nathan.kelber@ithaka.org.**
\n",
"____"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring Metadata and Pre-Processing\n",
"\n",
"**Description of methods in this notebook:**\n",
"This [notebook](https://docs.tdm-pilot.org/key-terms/#jupyter-notebook) shows how to explore and pre-process the [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) of a [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) using [Pandas](https://docs.tdm-pilot.org/key-terms/#pandas). \n",
"\n",
"The following processes are described:\n",
"\n",
"* Importing a [CSV file](https://docs.tdm-pilot.org/key-terms/#csv-file) containing the [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) for a given dataset ID\n",
"* Creating a [Pandas](https://docs.tdm-pilot.org/key-terms/#pandas) dataframe to view the [metadata](https://docs.tdm-pilot.org/key-terms/#metadata)\n",
"* Pre-processing your [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) by filtering out unwanted texts\n",
"* Exporting a list of relevant IDs to a [CSV file](https://docs.tdm-pilot.org/key-terms/#csv-file)\n",
"* Visualizing the metadata of your pre-processed [dataset](https://docs.tdm-pilot.org/key-terms/#dataset) by the number of documents/year and pages/year\n",
"\n",
"**Use Case:** For Learners (Detailed explanation, not ideal for researchers)\n",
"\n",
"**Difficulty:** Intermediate\n",
"\n",
"**Completion time:** 45 minutes\n",
"\n",
"**Knowledge Required:** \n",
"* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))\n",
"\n",
"**Knowledge Recommended:**\n",
"\n",
"* [Pandas I](./pandas-1.ipynb)\n",
"\n",
"**Data Format:** [CSV file](https://docs.tdm-pilot.org/key-terms/#csv-file)\n",
"\n",
"**Libraries Used:**\n",
"* [tdm_client](https://docs.tdm-pilot.org/key-terms/#tdm-client) to retrieve the [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) in a [CSV file](https://docs.tdm-pilot.org/key-terms/#csv-file)\n",
"* [Pandas](https://docs.tdm-pilot.org/key-terms/#pandas) to manipulate and visualize the metadata\n",
"\n",
"**Research Pipeline:** None\n",
"____"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import your dataset\n",
"\n",
"We'll use the `tdm_client` library to automatically retrieve the [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) for a [dataset](https://docs.tdm-pilot.org/key-terms/#dataset). We can retrieve [metadata](https://docs.tdm-pilot.org/key-terms/#metadata) in a [CSV file](https://docs.tdm-pilot.org/key-terms/#csv-file) using the `get_metadata` method.\n",
"\n",
"Enter a [dataset ID](https://docs.tdm-pilot.org/key-terms/#dataset-ID) in the next code cell. \n",
"\n",
"If you don't have a dataset ID, you can:\n",
"* Use the sample dataset ID already in the code cell\n",
"* [Create a new dataset](https://tdm-pilot.org/builder)\n",
"* [Use a dataset ID from other pre-built sample datasets](https://tdm-pilot.org/dataset/dashboard)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Creating a variable `dataset_id` to hold our dataset ID\n",
"# The default dataset is Shakespeare Quarterly, 1950-present\n",
"dataset_id = \"7e41317e-740f-e86a-4729-20dab492e925\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, import the `tdm_client`, passing the `dataset_id` as an argument using the `get_metadata` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Import the `tdm_client`\n",
"import tdm_client\n",
"\n",
"# Pull in our dataset CSV using \n",
"dataset_metadata = tdm_client.get_metadata(dataset_id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are ready to import pandas for our analysis and create a dataframe. We will use the `read_csv()` method to create our dataframe from the CSV file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import pandas \n",
"import pandas as pd\n",
"\n",
"# Create our dataframe\n",
"df = pd.read_csv(dataset_metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can confirm the size of our dataset using the `len()` function on our dataframe."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"original_document_count = len(df)\n",
"print('Total original documents:', original_document_count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's take a look at the data in our dataframe `df`. We will set pandas to show all columns using `set_option()` then get a preview using `head()`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set the pandas option to show all columns\n",
"pd.set_option(\"max_columns\", None) \n",
"\n",
"# Show the first five rows of our dataframe\n",
"df.head() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Metadata Type by Column Name\n",
"\n",
"Here are descriptions for the metadata types found in each column:\n",
"\n",
"|Column Name|Description|\n",
"|---|---|\n",
"|id|a unique item ID (In JSTOR, this is a stable URL)|\n",
"|title|the title for the item|\n",
"|isPartOf|the larger work that holds this title (for example, a journal title)|\n",
"|publicationYear|the year of publication|\n",
"|doi|the digital object identifier for an item|\n",
"|docType|the type of document (for example, article or book)|\n",
"|provider|the source or provider of the dataset|\n",
"|datePublished|the publication date in yyyy-mm-dd format|\n",
"|issueNumber|the issue number for a journal publication|\n",
"|volumeNumber|the volume number for a journal publication|\n",
"|url|a URL for the item and/or the item's metadata|\n",
"|creator|the author or authors of the item|\n",
"|publisher|the publisher for the item|\n",
"|language|the language or languages of the item (eng is the ISO 639 code for English)|\n",
"|pageStart|the first page number of the print version|\n",
"|pageEnd|the last page number of the print version|\n",
"|placeOfPublication|the city of the publisher|\n",
"|wordCount|the number of words in the item|\n",
"|pageCount|the number of print pages in the item|\n",
"|outputFormat|what data is available ([unigrams](https://docs.tdm-pilot.org/key-terms/#unigram), [bigrams](https://docs.tdm-pilot.org/key-terms/#bigram), [trigrams](https://docs.tdm-pilot.org/key-terms/#trigram), and/or full-text)|\n",
"\n",
"If there are any columns you would like to drop from your analysis, you can drop them with:\n",
"\n",
"`df df.drop(['column_name1', 'column_name2', ...], axis=1)`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Drop each of these named columns\n",
"df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished', 'language'], axis=1)\n",
"\n",
"# Show the first five rows of our updated dataframe\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you would like to know if a particular id is in the dataframe, you can use the `in` operator to return a boolean value (True or False). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check if a particular item id is in the `id` column\n",
"'http://www.jstor.org/stable/2868641' in df.id.values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering Out Unwanted Texts\n",
"\n",
"Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let's examine the first and last twenty rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Preview the first twenty items in the dataframe\n",
"# df.head(20) # Change 20 to view a greater or lesser number of rows"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Preview the last twenty items in the dataframe\n",
"# df.tail(20) # Change 20 to view a greater or lesser number of rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove all rows without data for a particular column\n",
"\n",
"For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is `creator`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Remove all texts without an author\n",
"df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Print the total original documents followed by the current number\n",
"print('Total original documents:', original_document_count)\n",
"print('Total current documents: ', len(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove row based on the content of a particular column\n",
"\n",
"We can also remove texts that have a particular value in a column. Here are a few examples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Remove all items with a particular title\n",
"df = df[df.title != 'Review Article'] # Change `Review Article` to your desired title"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Remove all items with less than 3000 words\n",
"df = df[df.wordCount > 3000] # Change `3000` to your desired number"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Print the total original documents followed by the current number\n",
"print('Total original documents:', original_document_count)\n",
"print('Total current documents: ', len(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Preview the first 50 lines of your dataset\n",
"df.head(50)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving a list of IDs to a CSV file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Write the column \"id\" to a CSV file called `pre-processed_###.csv` where ### is the `dataset_id`\n",
"df[\"id\"].to_csv('data/pre-processed_' + dataset_id + '.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the \"pre-processed_###.csv\" file (where ### is the `dataset_id`) for future analysis. You can use this file in combination with the dataset ID to automatically filter your texts and reduce the processing time of your analyses."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Visualizing the Pre-Processed Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "upxLwnZr6oJs"
},
"outputs": [],
"source": [
"# Group the data by publication year and the aggregated number of ids into a bar chart\n",
"df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); \n",
"\n",
"# Read more about Pandas dataframe plotting here: \n",
"# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "fNp2wN8I6oJt"
},
"source": [
"And now let's look at the total page numbers by year."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "uifT8Ocy6oJu"
},
"outputs": [],
"source": [
"# Group the data by publication year and aggregated sum of the page counts into a bar chart\n",
"\n",
"df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"name": "2-metadata.ipynb",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}