{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/kasparvonbeelen/ghi_python/main?labpath=8_-_Data_Exploration_with_Pandas_I.ipynb)\n", "\n", "# Lecture 8: Exploring Tabular Data\n", "\n", "## Data Science for Historians (with Python)\n", "## A Gentle Introduction to Working with Data in Python\n", "\n", "### Created by Kaspar Beelen and Luke Blaxill\n", "\n", "### For the German Historical Institute, London\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.0 Overview\n", "\n", "In this notebook, we have a closer look at how to work with metadata. Using an example taken from the British Library Catalogue, this notebook demonstrates how to work with tabular data in a programmatic way using the Python library Pandas. \n", "\n", "More precisely, this lecture covers how to:\n", "\n", "- load a .CSV file as a Pandas data frame\n", "- select rows in a data frame\n", "- manipulate values in a data frame\n", "- sort data frames by column\n", "- make simple plots\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.1 Introduction\n", "\n", "In this lecture we turn to working with (semi-)structured data. \n", "\n", "We referred to text as 'unstructured' because it Python initially reads the document as sequence of characters. Most of our effort went to wrangling 'raw' text to more meaningful representations, by for example detecting and counting words.\n", "\n", "In the coming lectures, we will insepct tabular or structured data. Tabular data consists of **rows** and **columns**. The rows represent individual **records**, which can be basically anything, a book, a measurement, a person... The columns are the **atributes** of these records, they capture the **features** of each observation. \n", "\n", "Spreadsheets are common format for tabular data, documents which you can open and edit with programs such as Microsoft Excel.\n", "\n", "Without further ado, let's look at a concrete example: structured metadata from the British Library [catalogue](https://www.bl.uk/collection-metadata/metadata-services)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.2 CSV Data: Metadata on the British Library Books Corpus\n", "\n", "In the notebook, we will have a closer look at the [British Library Book corpus](https://www.bl.uk/collection-guides/digitised-printed-books) (BLB). This corpus contains around 60.000 books dating primarily from the 19th century. Its contents are freely accessible and have proved a rich resource for previous and ongoing research projects. \n", "\n", "One problem with this corpus, however, is its composition: while it is large, it remains unclear which types of content have been selected. The criteria remain for inclusion somewhat of a mystery and understanding the contours of the corpus is a non-trivial task that requires additional research at the level of corpus metadata.\n", "\n", "We focus therefore on exploring the metadata of this collection that is available as a CSV file. In this notebook, we show you how to explore the BLB metadata and get a better grip on the composition and contours of a large corpus. \n", "\n", "The data is available by following this link: `https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en`. We first inspect the tabular format, and later on, have a more detailed exploration of data frames as a data structure.\n", "\n", "In the code below we use the `requests` library to download the data and save it in the `data` variable. We then print the first 400 characters." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import requests # requests library\n", "link = 'https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en' # define location of data with url\n", "data = requests.get(link).text # get text" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,Number within series,Country of publication,Place of publication,Publisher,Date of publication,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource\\n014602826,Monograph,\"Yearsley, Ann\",1753-1806,person,,'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[:400]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you notice, these data are just text: i.e. the metadata is initially just a string. We can confirm this by printing the data `type`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, ho, wait. Didn't you tell us previously we'd be working with **structured** data? Yes, but let's have a look at the data in its 'raw' format first. \n", "\n", "What we printed in the cell above are the column names. You can observe how each name is separated by a comma (`,`). Also, spot the return character `\\n` which marks the end of a row. \n", "\n", "While our data is, initially, just a text file, you notice that the BLB metadata has an implicit structure, determined by comma's (cell boundaries) and hard returns (row boundaries). This format is commonly referred to as CSV, i.e. 'Comma Separated Values'. You will encounter this format regularly when working with data in the Digital Humanities. \n", "\n", "The first row of a CSV file contain usually the column headers which provide semantic information about the content of a column or, put differently, the attributes of each record. \n", "\n", "The BL books data contains the following columns:\n", "\n", "```\n", "BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,Number within series,Country of publication,Place of publication,Publisher,Date of publication,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource\n", "```\n", "\n", "The first record looks as follow:\n", "\n", "```\n", "014602826,Monograph,\"Yearsley, Ann\",1753-1806,person,,\"More, Hannah, 1745-1833 [person] ; Yearsley, Ann, 1753-1806 [person]\",Poems on several occasions [With a prefatory letter by Hannah More.],,,,England,London,,1786,Fourth edition MANUSCRIPT note,,,Digital Store 11644.d.32,,,English,,003996603\n", "```\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, the first column (`BL record ID`) captures the identifier of a record. The identifier for the first record is `014602826`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.3 Exploring CSV files as Pandas DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While you could write a script to 'parse' these data, i.e. make the comma-separative structure explicit—remember `text.split(',')`?—there exist quite some tools to help you explore and analyse tabular CSV data. \n", "\n", "In this course, we will be working with Pandas, a popular Python library that covers many data science functionalities in Python.\n", "\n", "Below we import Pandas using the `pd` abbreviation. This is just for convenience, to save us typing characters. If we want to call any tools from this library we just have to prefix it `pd` instead of `pandas`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can read the CSV file by providing the link to the online document to the `pd.read_csv` function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\n", " \"https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en\",\n", " index_col='BL record ID'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`read_csv()` takes a string as an argument. This string can either represent a path (e.g. the location of a file on your local hard drive) or a URL (e.g. a link to an online repository). In our case, we provide the URL as an argument. We added one more name argument `index_col` where we specified the values we want use an index for our rows.\n", "\n", "We save the output of this function call in a variable with the name `df` (short for data frame). The function returns a Pandas `DataFrame` object." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A dataframe consists of rows and columns. The `.shape` attribute gives you the dimensionality of the data frame, i.e. the number of rows and columns." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(52695, 23)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can observe, the BLB books corpus contains 52695 records. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To inspect the column names, print the `.columns` attribute attached to the DataFrame object `df`. This returns the metadata attributes present in the CSV file." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Type of resource', 'Name', 'Dates associated with name',\n", " 'Type of name', 'Role', 'All names', 'Title', 'Variant titles',\n", " 'Series title', 'Number within series', 'Country of publication',\n", " 'Place of publication', 'Publisher', 'Date of publication', 'Edition',\n", " 'Physical description', 'Dewey classification', 'BL shelfmark',\n", " 'Topics', 'Genre', 'Languages', 'Notes',\n", " 'BL record ID for physical resource'],\n", " dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, we have rich and detailed metadata on each book in the BLB collection: dates, author names, genre etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `.head()` method to print the first rows. The code below prints the first three rows." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Type of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...Date of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resource
BL record ID
14602826MonographYearsley, Ann1753-1806personNaNMore, Hannah, 1745-1833 [person] ; Yearsley, A...Poems on several occasions [With a prefatory l...NaNNaNNaN...1786Fourth edition MANUSCRIPT noteNaNNaNDigital Store 11644.d.32NaNNaNEnglishNaN3996603
14602830MonographA, T.NaNpersonNaNOldham, John, 1653-1683 [person] ; A, T. [person]A Satyr against Vertue. (A poem: supposed to b...NaNNaNNaN...1679NaN15 pages (4°)NaNDigital Store 11602.ee.10. (2.)NaNNaNEnglishNaN1143
14602831MonographNaNNaNNaNNaNNaNThe Aeronaut, a poem; founded almost entirely,...NaNNaNNaN...1816NaN17 pages (8°)NaNDigital Store 992.i.12. (3.)Dublin (Ireland)NaNEnglishNaN22782
\n", "

3 rows × 23 columns

\n", "
" ], "text/plain": [ " Type of resource Name Dates associated with name \\\n", "BL record ID \n", "14602826 Monograph Yearsley, Ann 1753-1806 \n", "14602830 Monograph A, T. NaN \n", "14602831 Monograph NaN NaN \n", "\n", " Type of name Role \\\n", "BL record ID \n", "14602826 person NaN \n", "14602830 person NaN \n", "14602831 NaN NaN \n", "\n", " All names \\\n", "BL record ID \n", "14602826 More, Hannah, 1745-1833 [person] ; Yearsley, A... \n", "14602830 Oldham, John, 1653-1683 [person] ; A, T. [person] \n", "14602831 NaN \n", "\n", " Title \\\n", "BL record ID \n", "14602826 Poems on several occasions [With a prefatory l... \n", "14602830 A Satyr against Vertue. (A poem: supposed to b... \n", "14602831 The Aeronaut, a poem; founded almost entirely,... \n", "\n", " Variant titles Series title Number within series ... \\\n", "BL record ID ... \n", "14602826 NaN NaN NaN ... \n", "14602830 NaN NaN NaN ... \n", "14602831 NaN NaN NaN ... \n", "\n", " Date of publication Edition \\\n", "BL record ID \n", "14602826 1786 Fourth edition MANUSCRIPT note \n", "14602830 1679 NaN \n", "14602831 1816 NaN \n", "\n", " Physical description Dewey classification \\\n", "BL record ID \n", "14602826 NaN NaN \n", "14602830 15 pages (4°) NaN \n", "14602831 17 pages (8°) NaN \n", "\n", " BL shelfmark Topics Genre \\\n", "BL record ID \n", "14602826 Digital Store 11644.d.32 NaN NaN \n", "14602830 Digital Store 11602.ee.10. (2.) NaN NaN \n", "14602831 Digital Store 992.i.12. (3.) Dublin (Ireland) NaN \n", "\n", " Languages Notes BL record ID for physical resource \n", "BL record ID \n", "14602826 English NaN 3996603 \n", "14602830 English NaN 1143 \n", "14602831 English NaN 22782 \n", "\n", "[3 rows x 23 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may notice the many `NaN` values in the header of the data frame. `NaN` stands for 'not a value' and indicates that information is missing: we don't have information on 'Genre' for the books with id '14602826'.\n", "\n", "To quickly get an estimate of the completeness of our data, call the `info()` function." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 52695 entries, 14602826 to 16289062\n", "Data columns (total 23 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Type of resource 52695 non-null object\n", " 1 Name 47552 non-null object\n", " 2 Dates associated with name 10825 non-null object\n", " 3 Type of name 47552 non-null object\n", " 4 Role 1680 non-null object\n", " 5 All names 49633 non-null object\n", " 6 Title 52695 non-null object\n", " 7 Variant titles 5867 non-null object\n", " 8 Series title 260 non-null object\n", " 9 Number within series 111 non-null object\n", " 10 Country of publication 36460 non-null object\n", " 11 Place of publication 51923 non-null object\n", " 12 Publisher 27487 non-null object\n", " 13 Date of publication 52517 non-null object\n", " 14 Edition 4198 non-null object\n", " 15 Physical description 39846 non-null object\n", " 16 Dewey classification 78 non-null object\n", " 17 BL shelfmark 52428 non-null object\n", " 18 Topics 3136 non-null object\n", " 19 Genre 1973 non-null object\n", " 20 Languages 52637 non-null object\n", " 21 Notes 6576 non-null object\n", " 22 BL record ID for physical resource 52695 non-null int64 \n", "dtypes: int64(1), object(22)\n", "memory usage: 9.6+ MB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As often when working with \"real\" data, completeness is an issue. For example, you can see that we have a title for each book (`52695 non-null`) while the majority of `Genre` column is empty (`1973 non-null`)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see `pd.read_csv` converted the 'raw' text to a tabular format, segmenting and properly identifying rows and columns.\n", "\n", "\n", "So far we used the Pandas functionalities—the `.head()` method and the `.shape` and `.columns` attributes attached to the data frame—to explore the structure of the metadata. \n", "\n", "But Pandas in the many tools for accessing, manipulating, and analysing tabular content. We first discuss how to access and retrieve content and then turn to manipulating information and producing basic analytics. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.3.1 Accessing Rows and Columns in DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most straightforward method for access is via the data frame index. In the code, above we specified that `BL record ID` should serve as the row index. This allows us the inspect a record related to a specific identifier. For example, if we want to inspect the book with identifier `14602826` we pass this identifier to `.loc`. \n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Type of resource Monograph\n", "Name NaN\n", "Dates associated with name NaN\n", "Type of name NaN\n", "Role NaN\n", "All names NaN\n", "Title The Aeronaut, a poem; founded almost entirely,...\n", "Variant titles NaN\n", "Series title NaN\n", "Number within series NaN\n", "Country of publication Ireland\n", "Place of publication Dublin\n", "Publisher Richard Milliken\n", "Date of publication 1816\n", "Edition NaN\n", "Physical description 17 pages (8°)\n", "Dewey classification NaN\n", "BL shelfmark Digital Store 992.i.12. (3.)\n", "Topics Dublin (Ireland)\n", "Genre NaN\n", "Languages English\n", "Notes NaN\n", "BL record ID for physical resource 22782\n", "Name: 14602831, dtype: object" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[14602831]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The syntax resembles accessing values by key in a Python dictionary: the item between square brackets is the key via which we retrieve the corresponding value. Similarly, you can read the line above as: retrieve the record (value) with the identifier (key) `14602831`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also retrieve rows by their positional index using `.iloc()` (which is similar to the type indexing we used previously in Python lists). The code below returns the record at position 7 (or the 8th row)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Type of resource Monograph\n", "Name NaN\n", "Dates associated with name NaN\n", "Type of name NaN\n", "Role NaN\n", "All names NaN\n", "Title Confessions of a Coquette, while staying at Sc...\n", "Variant titles NaN\n", "Series title NaN\n", "Number within series NaN\n", "Country of publication England\n", "Place of publication Scarborough\n", "Publisher E. T. W. Dennis\n", "Date of publication 1888\n", "Edition NaN\n", "Physical description 42 pages (8°)\n", "Dewey classification NaN\n", "BL shelfmark Digital Store 11602.ee.17. (8.)\n", "Topics NaN\n", "Genre NaN\n", "Languages English\n", "Notes NaN\n", "BL record ID for physical resource 156011\n", "Name: 14602836, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[7]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`.loc` and `iloc` allow slicing operations. The slice notation is similar to lists, where a colon separates the start and end positions of the slice we want. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Type of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...Date of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resource
BL record ID
14602831MonographNaNNaNNaNNaNNaNThe Aeronaut, a poem; founded almost entirely,...NaNNaNNaN...1816NaN17 pages (8°)NaNDigital Store 992.i.12. (3.)Dublin (Ireland)NaNEnglishNaN22782
14602832MonographAlbert, Prince Consort, consort of Victoria, Q...1819-1861personNaNPlimsoll, Joseph [person] ; Albert, Prince Con...The Prince Albert, a poem [By Joseph Plimsoll.]AppendixNaNNaN...1868NaN16 pages (8°)NaNDigital Store 11602.ee.17. (1.)NaNNaNEnglishNaN39775
14602833MonographAnslow, RobertNaNpersonNaNAnslow, Robert [person]The Defeat of the Spanish Armada, A.D. 1588. A...NaNNaNNaN...1888NaN40 pages (8°)NaNDigital Store 11602.ee.17. (7.)NaNNaNEnglishNaN92666
14602834MonographNaNNaNNaNNaNSwift, Jonathan, 1667-1745 [person]A Familiar Answer to a Familiar Letter [In ver...Appendix. I. Contemporary Satires, Eulogies, etcNaNNaN...1720NaN7 pages (4°)NaNDigital Store 11602.ee.10. (5.)NaNNaNEnglishNaN93359
14602835MonographNaNNaNNaNNaNNaNThe Irish Home Rule Bill. A poetical pamphlet,...NaNNaNNaN...1893NaN4 pages (8°)NaNDigital Store 11601.g.28. (3.)NaNNaNEnglishNaN150273
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " Type of resource \\\n", "BL record ID \n", "14602831 Monograph \n", "14602832 Monograph \n", "14602833 Monograph \n", "14602834 Monograph \n", "14602835 Monograph \n", "\n", " Name \\\n", "BL record ID \n", "14602831 NaN \n", "14602832 Albert, Prince Consort, consort of Victoria, Q... \n", "14602833 Anslow, Robert \n", "14602834 NaN \n", "14602835 NaN \n", "\n", " Dates associated with name Type of name Role \\\n", "BL record ID \n", "14602831 NaN NaN NaN \n", "14602832 1819-1861 person NaN \n", "14602833 NaN person NaN \n", "14602834 NaN NaN NaN \n", "14602835 NaN NaN NaN \n", "\n", " All names \\\n", "BL record ID \n", "14602831 NaN \n", "14602832 Plimsoll, Joseph [person] ; Albert, Prince Con... \n", "14602833 Anslow, Robert [person] \n", "14602834 Swift, Jonathan, 1667-1745 [person] \n", "14602835 NaN \n", "\n", " Title \\\n", "BL record ID \n", "14602831 The Aeronaut, a poem; founded almost entirely,... \n", "14602832 The Prince Albert, a poem [By Joseph Plimsoll.] \n", "14602833 The Defeat of the Spanish Armada, A.D. 1588. A... \n", "14602834 A Familiar Answer to a Familiar Letter [In ver... \n", "14602835 The Irish Home Rule Bill. A poetical pamphlet,... \n", "\n", " Variant titles Series title \\\n", "BL record ID \n", "14602831 NaN NaN \n", "14602832 Appendix NaN \n", "14602833 NaN NaN \n", "14602834 Appendix. I. Contemporary Satires, Eulogies, etc NaN \n", "14602835 NaN NaN \n", "\n", " Number within series ... Date of publication Edition \\\n", "BL record ID ... \n", "14602831 NaN ... 1816 NaN \n", "14602832 NaN ... 1868 NaN \n", "14602833 NaN ... 1888 NaN \n", "14602834 NaN ... 1720 NaN \n", "14602835 NaN ... 1893 NaN \n", "\n", " Physical description Dewey classification \\\n", "BL record ID \n", "14602831 17 pages (8°) NaN \n", "14602832 16 pages (8°) NaN \n", "14602833 40 pages (8°) NaN \n", "14602834 7 pages (4°) NaN \n", "14602835 4 pages (8°) NaN \n", "\n", " BL shelfmark Topics Genre \\\n", "BL record ID \n", "14602831 Digital Store 992.i.12. (3.) Dublin (Ireland) NaN \n", "14602832 Digital Store 11602.ee.17. (1.) NaN NaN \n", "14602833 Digital Store 11602.ee.17. (7.) NaN NaN \n", "14602834 Digital Store 11602.ee.10. (5.) NaN NaN \n", "14602835 Digital Store 11601.g.28. (3.) NaN NaN \n", "\n", " Languages Notes BL record ID for physical resource \n", "BL record ID \n", "14602831 English NaN 22782 \n", "14602832 English NaN 39775 \n", "14602833 English NaN 92666 \n", "14602834 English NaN 93359 \n", "14602835 English NaN 150273 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[14602831:14602835] # get records with BL record ID 14602831 to 14602835" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Type of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...Date of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resource
BL record ID
14603039MonographStanhope, H., pseudonym [i.e. William Bond?]NaNpersonNaNStanhope, H., pseudonym [i.e. William Bond?] [...The Patriot: an epistle [in verse] to ... Phil...NaNNaNNaN...1733NaN8 pages (folio)NaNDigital Store 11642.i.9. (2.)NaNNaNEnglishNaN3477622
14603040MonographNaNNaNNaNNaNNaNVae Victis. Duty, and other poemsNaNNaNNaN...1850NaNNaNNaNDigital Store 11645.g.45NaNPoetry or verseEnglishNaN3745155
14603041MonographWebber, ThomasNaNpersonNaNWebber, Thomas [person]Stockton: an historical, biographical and desc...NaNNaNNaN...1830NaN40 pages (8°)NaNDigital Store 11643.bbb.25. (4.)NaNNaNEnglishNaN3871500
14603042MonographWells, E. T.NaNpersonNaNWells, E. T. [person]A Few VersesNaNNaNNaN...1895NaN11 pages (8°)NaNDigital Store 11601.f.36. (5.)NaNNaNEnglishNaN3885435
14603043MonographWilson, J. GordonNaNpersonNaNWilson, J. Gordon [person]Descriptive Poem. The Death of ... F. Burnaby ...NaNNaNNaN...1885NaNNaNNaNDigital Store 11643.bbb.25. (9.)NaNNaNEnglishNaN3945042
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " Type of resource Name \\\n", "BL record ID \n", "14603039 Monograph Stanhope, H., pseudonym [i.e. William Bond?] \n", "14603040 Monograph NaN \n", "14603041 Monograph Webber, Thomas \n", "14603042 Monograph Wells, E. T. \n", "14603043 Monograph Wilson, J. Gordon \n", "\n", " Dates associated with name Type of name Role \\\n", "BL record ID \n", "14603039 NaN person NaN \n", "14603040 NaN NaN NaN \n", "14603041 NaN person NaN \n", "14603042 NaN person NaN \n", "14603043 NaN person NaN \n", "\n", " All names \\\n", "BL record ID \n", "14603039 Stanhope, H., pseudonym [i.e. William Bond?] [... \n", "14603040 NaN \n", "14603041 Webber, Thomas [person] \n", "14603042 Wells, E. T. [person] \n", "14603043 Wilson, J. Gordon [person] \n", "\n", " Title \\\n", "BL record ID \n", "14603039 The Patriot: an epistle [in verse] to ... Phil... \n", "14603040 Vae Victis. Duty, and other poems \n", "14603041 Stockton: an historical, biographical and desc... \n", "14603042 A Few Verses \n", "14603043 Descriptive Poem. The Death of ... F. Burnaby ... \n", "\n", " Variant titles Series title Number within series ... \\\n", "BL record ID ... \n", "14603039 NaN NaN NaN ... \n", "14603040 NaN NaN NaN ... \n", "14603041 NaN NaN NaN ... \n", "14603042 NaN NaN NaN ... \n", "14603043 NaN NaN NaN ... \n", "\n", " Date of publication Edition Physical description \\\n", "BL record ID \n", "14603039 1733 NaN 8 pages (folio) \n", "14603040 1850 NaN NaN \n", "14603041 1830 NaN 40 pages (8°) \n", "14603042 1895 NaN 11 pages (8°) \n", "14603043 1885 NaN NaN \n", "\n", " Dewey classification BL shelfmark Topics \\\n", "BL record ID \n", "14603039 NaN Digital Store 11642.i.9. (2.) NaN \n", "14603040 NaN Digital Store 11645.g.45 NaN \n", "14603041 NaN Digital Store 11643.bbb.25. (4.) NaN \n", "14603042 NaN Digital Store 11601.f.36. (5.) NaN \n", "14603043 NaN Digital Store 11643.bbb.25. (9.) NaN \n", "\n", " Genre Languages Notes \\\n", "BL record ID \n", "14603039 NaN English NaN \n", "14603040 Poetry or verse English NaN \n", "14603041 NaN English NaN \n", "14603042 NaN English NaN \n", "14603043 NaN English NaN \n", "\n", " BL record ID for physical resource \n", "BL record ID \n", "14603039 3477622 \n", "14603040 3745155 \n", "14603041 3871500 \n", "14603042 3885435 \n", "14603043 3945042 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[200:205] # get records at position 200 to 205" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we accessed the content in the dataframe by specifying the rows we wanted to retrieve. But the Pandas data frames enable you to retrieve items by column, for example, the line of code below that returns the date of publication for each book in our corpus (column with name `\"'Date of publication'\"`)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14602826 1786\n", "14602830 1679\n", "14602831 1816\n", "14602832 1868\n", "14602833 1888\n", " ... \n", "16289058 NaN\n", "16289059 NaN\n", "16289060 1913\n", "16289061 1924\n", "16289062 1919\n", "Name: Date of publication, Length: 52695, dtype: object" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Date of publication']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that columns belong to a different data type, namely `Series`. While a DataFrame always has two dimensions (rows and columns) a Series object only has one." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(df['Date of publication'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(52695,)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Date of publication'].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Returning the columns itself:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14602826 1786\n", "14602830 1679\n", "14602831 1816\n", "14602832 1868\n", "14602833 1888\n", " ... \n", "16289058 NaN\n", "16289059 NaN\n", "16289060 1913\n", "16289061 1924\n", "16289062 1919\n", "Name: Date of publication, Length: 52695, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Date of publication']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output shows the BL record identifier and the corresponding year of publication. Please notice the following:\n", "- Firstly, again some records have NaN (not a number) as date. This points to missing data, i.e. the book lacks a date of publication which can happen for many reasons. In Pandas the NaN is an instance of the float class. Run the code below to see if for yourself." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nan\n" ] } ], "source": [ "n = df.loc[16289059,'Date of publication']\n", "print(n)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "float" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Secondly, each column belongs to a specific data type or `dtype`. In this case, the date of publication columns has `object` as its data type, which often indicates that the column contains information of different types or strings. This may come as a surprise: we would expect dates or integers to appear in this case. If we look closer at the row with identifier `16289061` we observe that the year is a string. `Date of Publication` contains a mixture of string and float (`NaN`) objects. Later in this tutorial, we will show how to convert this column to an integer expressing the year of publication, which we can subsequently use to plot timelines." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1924 \n" ] } ], "source": [ "n = df.loc[16289061,'Date of publication']\n", "print(n,type(n))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select more than one column, for example Date of Publication and Genre, you have to pass a **list** with column names. The first line of code below will fail, but the second one will work (notice the double square brackets in the second statement)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "('Date of publication', 'Genre')", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3360\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: ('Date of publication', 'Genre')", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/2_/fcdvqwzs6j75cfr97nggzfn499sjqf/T/ipykernel_39483/2653721281.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Date of publication'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m'Genre'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;31m# this will not work\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3453\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3454\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3455\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3456\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3457\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3361\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3365\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: ('Date of publication', 'Genre')" ] } ], "source": [ "df['Date of publication','Genre'] # this will not work " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date of publicationGenre
BL record ID
146028261786NaN
146028301679NaN
146028311816NaN
146028321868NaN
146028331888NaN
.........
16289058NaNNaN
16289059NaNNaN
162890601913NaN
162890611924NaN
162890621919NaN
\n", "

52695 rows × 2 columns

\n", "
" ], "text/plain": [ " Date of publication Genre\n", "BL record ID \n", "14602826 1786 NaN\n", "14602830 1679 NaN\n", "14602831 1816 NaN\n", "14602832 1868 NaN\n", "14602833 1888 NaN\n", "... ... ...\n", "16289058 NaN NaN\n", "16289059 NaN NaN\n", "16289060 1913 NaN\n", "16289061 1924 NaN\n", "16289062 1919 NaN\n", "\n", "[52695 rows x 2 columns]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['Date of publication','Genre']] # this works!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas provides valuable tools for exploring the values of a specific column. `.value_counts()` will return the frequency of each unique value in the columns. For example, we can apply this method to the Genre column, to assess which genres appear in the metadata and how often. Notice that it excludes a count of `NaN` values." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Poetry or verse 1002\n", "Drama 461\n", "Drama ; Poetry or verse 151\n", "Travel 77\n", "Periodical 39\n", " ... \n", "Drama ; Early works to 1800 ; Libretto 1\n", "Census data ; Gazetteer 1\n", "History 1\n", "Translations into Latin 1\n", "Biography ; Source 1\n", "Name: Genre, Length: 64, dtype: int64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Genre'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `.unique()` methods show the **set** of the values in a columns (i.e. each unique value). This helps us to understand and explore range of values in a column. For example, below we can inspect all the unique genres present in the BL books corpus." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([nan, 'Song', 'Poetry or verse', 'Music', 'Drama ; Poetry or verse',\n", " 'Drama', 'Periodical', 'Poetry or verse ; Song', 'Biography',\n", " 'Libretto', 'Drama ; Essay ; Poetry or verse', 'Diary',\n", " 'Pictorial work', 'Drama ; Libretto', 'Travel', 'Census data',\n", " 'Diary ; Travel', 'Lecture', 'Gazetteer', 'Guidebook',\n", " 'Compendium or compilation', 'Correspondence', 'Directory',\n", " 'Essay', 'Fiction', 'Review', 'Drawing', 'Drama ; Drawing',\n", " 'Novel', 'Ephemeris', 'Early works to 1800',\n", " 'Correspondence ; Travel', 'Drama ; Essay', \"Children's fiction\",\n", " 'Illustration', 'Correspondence ; Diary',\n", " 'Translations into Italian',\n", " 'Translations from English ; Translations into Russian', 'Hymnal',\n", " 'Calendar ; Poetry or verse',\n", " 'Drama ; Early works to 1800 ; Libretto', 'Handbook or manual',\n", " 'Atlas', 'Census data ; Gazetteer', 'History',\n", " 'Translations into Latin', 'Periodical ; Poetry or verse',\n", " 'Engraving ; Illustration ; Plan or view', 'Dictionary',\n", " 'Concordance', 'Short story', 'Guidebook ; Travel', 'Textbook',\n", " 'Register', 'Source', 'Statistics', 'Encyclopaedia', 'Lithograph',\n", " 'Biography ; Diary ; Personal narrative',\n", " 'Diary ; Personal narrative', 'Lithograph ; Plan or view',\n", " 'Handbook or manual ; Periodical', 'Humour or satire',\n", " 'Bibliography', 'Biography ; Source'], dtype=object)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Genre'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then use this information to select specific rows by Genre, e.g. return all books categorized as 'Travel' literature. To accomplish this, we need to construct a **mask**: an array of boolean (True, False) values that express which row matches a certain condition (i.e. Gender is equal to the string 'Travel'). Let's explore this with a toy example.\n", "\n", "First, we create a small, data frame, which only records the date of publication and genre." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date of PublicationGenre
01944Travel
11943Periodical
21946Travel
31947Biography
\n", "
" ], "text/plain": [ " Date of Publication Genre\n", "0 1944 Travel\n", "1 1943 Periodical\n", "2 1946 Travel\n", "3 1947 Biography" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_toy = pd.DataFrame([[1944,'Travel'],\n", " [1943,'Periodical'],\n", " [1946,'Travel'],\n", " [1947,'Biography']], columns= ['Date of Publication','Genre'])\n", "df_toy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we create a mask using the `==` (is equal to) operator. In this case, we retrieve rows where the value in the Genre column is equal to `Travel`. This operation returns an array (more precisely a Pandas Series object) of boolean values: True when the recorded genre of a book matches the string `\"Travel\"`, False otherwise." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "Name: Genre, dtype: bool" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_toy['Genre'] == \"Travel\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save this mask in the `mask` variable (notice the difference between `=`, value assignment, and `==` \"equal to\" operator )" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 False\n", "Name: Genre, dtype: bool" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = df_toy['Genre'] == \"Travel\"\n", "mask" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We pass `mask` to `.loc[]`, which selects the rows in the toy dataframe that contain travel literature." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date of PublicationGenre
01944Travel
21946Travel
\n", "
" ], "text/plain": [ " Date of Publication Genre\n", "0 1944 Travel\n", "2 1946 Travel" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_toy.loc[mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Masking will return later on this course. For now it suffices to say that Pandas provides some useful functions for selecting subsets of a of dataframe. `.isin()` for example, is useful in scenarios where one wants to find multipe genres, for example `'Travel'` and `'Biography'`. This method takes a list of values as argument, and will return rows whose values appear in this list." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 False\n", "2 True\n", "3 True\n", "Name: Genre, dtype: bool" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = df_toy['Genre'].isin([\"Travel\",\"Biography\"])\n", "mask" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date of PublicationGenre
01944Travel
21946Travel
31947Biography
\n", "
" ], "text/plain": [ " Date of Publication Genre\n", "0 1944 Travel\n", "2 1946 Travel\n", "3 1947 Biography" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_toy[mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, we could have repeatedly used the `==` operator and combined the results. However, `.isin()` provides a more elegant solution.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is one more symbol to use in this context: the tilde or `~` which serves as a negation. In the example below, we obtain all rows **except** those having 'Travel' as their Genre." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Date of PublicationGenre
11943Periodical
31947Biography
\n", "
" ], "text/plain": [ " Date of Publication Genre\n", "1 1943 Periodical\n", "3 1947 Biography" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = df_toy['Genre'].isin([\"Travel\"])\n", "df_toy[~mask]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's return to our main case study, the BLB corpus. The statements below demonstrate how masking enables you to explore these data by Genre. \n", "\n", "Note how we save the subsection of the original dataframe in a new variable `travel`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Type of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...Date of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resource
BL record ID
14750869MonographGrindlay & CoNaNorganisationNaNGrindlay & Co [organisation]Grindlay and Co.'s Overland Circular. Hints fo...NaNNaNNaN...1854Third editionNaNNaNDigital Store 1298.h.25. (3.)IndiaTravelEnglishNaN1517704
14756377MonographStocqueler, J. H. (Joachim Hayward)1800-1885personNaNStocqueler, J. H. (Joachim Hayward), 1800-1885...The Overland Companion: being a guide for the ...NaNNaNNaN...1850NaN(12°)NaNDigital Store 1298.h.14Asia--Description and travel ; India ; EgyptTravelEnglishNaN3510639
14804046MonographAnnesley, George, Earl of MountnorrisNaNpersonNaNAnnesley, George, Earl of Mountnorris [person]...Voyages and Travels to India, Ceylon, the Red ...NaNNaNNaN...1809NaN3 volumes (4°)NaNDigital Store 10058.l.13India ; Sri Lanka ; EgyptTravelEnglishNaN91083
14804297MonographAtkinson, Thomas WitlamNaNpersonNaNAtkinson, Thomas Witlam [person]Travels in the Regions of the Upper and Lower ...NaNNaNNaN...1860NaNxiii, 570 pages (8°)NaNDigital Store 010058.ff.2China ; India ; RussiaTravelEnglishNaN136225
14804834MonographBaynes, C. R. (Charles Robert)NaNpersonNaNBaynes, C. R. (Charles Robert) [person]Notes and Reflections, during a ramble in the ...NaNNaNNaN...1843NaNviii, 275 pages (12°)NaNDigital Store 1425.d.7IndiaTravelEnglishNaN236176
..................................................................
14839906MonographElwood, Anne Katharine, MrsNaNpersonNaNElwood, Anne Katharine, Mrs [person]Narrative of a journey overland from England, ...NaNNaNNaN...1830NaN2 volumes (8°)NaNDigital Store 1046.d.6-7Voyages and travels ; India ; India--Descripti...TravelEnglishNaN1063966
14848081MonographStocqueler, J. H. (Joachim Hayward)1800-1885personNaNStocqueler, J. H. (Joachim Hayward), 1800-1885...The Hand-Book of India, a guide to the strange...NaNNaNNaN...1845Second edition(12°)NaNDigital Store 1298.h.8India ; India--GuidebooksTravelEnglishNaN3510627
14871557MonographDemidov, Anatoly Nikolaevich, Prince di San Do...NaNpersonNaNDemidov, Anatoly Nikolaevich, Prince di San Do...Esquisses d'un voyage dans la Russie méridiona...NaNNaNNaN...1838NaN102 pages (8°)NaNDigital Store 10291.e.2Crimea (Ukraine)--19th century--Description an...TravelFrenchNaN905050
14872822MonographSementovsky, NikolaiNaNpersonNaNSementovsky, Nikolai [person]Кіевъ, его святиня, древности, достопамятности...NaNNaNNaN...1864NaNNaNNaNDigital Store 10291.e.21Kiev (Ukraine)--Description and travelTravelRussianNaN3334914
15743152MonographWynne, Arthur BeevorNaNpersonNaNWynne, Arthur Beevor [person]On the connexion between Travelled Blocks in t...NaNNaNNaN...1881NaN3 pages (8°)NaNDigital Store 7106.f.14. (1.)Geology ; IndiaTravelEnglishNaN3990134
\n", "

77 rows × 23 columns

\n", "
" ], "text/plain": [ " Type of resource \\\n", "BL record ID \n", "14750869 Monograph \n", "14756377 Monograph \n", "14804046 Monograph \n", "14804297 Monograph \n", "14804834 Monograph \n", "... ... \n", "14839906 Monograph \n", "14848081 Monograph \n", "14871557 Monograph \n", "14872822 Monograph \n", "15743152 Monograph \n", "\n", " Name \\\n", "BL record ID \n", "14750869 Grindlay & Co \n", "14756377 Stocqueler, J. H. (Joachim Hayward) \n", "14804046 Annesley, George, Earl of Mountnorris \n", "14804297 Atkinson, Thomas Witlam \n", "14804834 Baynes, C. R. (Charles Robert) \n", "... ... \n", "14839906 Elwood, Anne Katharine, Mrs \n", "14848081 Stocqueler, J. H. (Joachim Hayward) \n", "14871557 Demidov, Anatoly Nikolaevich, Prince di San Do... \n", "14872822 Sementovsky, Nikolai \n", "15743152 Wynne, Arthur Beevor \n", "\n", " Dates associated with name Type of name Role \\\n", "BL record ID \n", "14750869 NaN organisation NaN \n", "14756377 1800-1885 person NaN \n", "14804046 NaN person NaN \n", "14804297 NaN person NaN \n", "14804834 NaN person NaN \n", "... ... ... ... \n", "14839906 NaN person NaN \n", "14848081 1800-1885 person NaN \n", "14871557 NaN person NaN \n", "14872822 NaN person NaN \n", "15743152 NaN person NaN \n", "\n", " All names \\\n", "BL record ID \n", "14750869 Grindlay & Co [organisation] \n", "14756377 Stocqueler, J. H. (Joachim Hayward), 1800-1885... \n", "14804046 Annesley, George, Earl of Mountnorris [person]... \n", "14804297 Atkinson, Thomas Witlam [person] \n", "14804834 Baynes, C. R. (Charles Robert) [person] \n", "... ... \n", "14839906 Elwood, Anne Katharine, Mrs [person] \n", "14848081 Stocqueler, J. H. (Joachim Hayward), 1800-1885... \n", "14871557 Demidov, Anatoly Nikolaevich, Prince di San Do... \n", "14872822 Sementovsky, Nikolai [person] \n", "15743152 Wynne, Arthur Beevor [person] \n", "\n", " Title \\\n", "BL record ID \n", "14750869 Grindlay and Co.'s Overland Circular. Hints fo... \n", "14756377 The Overland Companion: being a guide for the ... \n", "14804046 Voyages and Travels to India, Ceylon, the Red ... \n", "14804297 Travels in the Regions of the Upper and Lower ... \n", "14804834 Notes and Reflections, during a ramble in the ... \n", "... ... \n", "14839906 Narrative of a journey overland from England, ... \n", "14848081 The Hand-Book of India, a guide to the strange... \n", "14871557 Esquisses d'un voyage dans la Russie méridiona... \n", "14872822 Кіевъ, его святиня, древности, достопамятности... \n", "15743152 On the connexion between Travelled Blocks in t... \n", "\n", " Variant titles Series title Number within series ... \\\n", "BL record ID ... \n", "14750869 NaN NaN NaN ... \n", "14756377 NaN NaN NaN ... \n", "14804046 NaN NaN NaN ... \n", "14804297 NaN NaN NaN ... \n", "14804834 NaN NaN NaN ... \n", "... ... ... ... ... \n", "14839906 NaN NaN NaN ... \n", "14848081 NaN NaN NaN ... \n", "14871557 NaN NaN NaN ... \n", "14872822 NaN NaN NaN ... \n", "15743152 NaN NaN NaN ... \n", "\n", " Date of publication Edition Physical description \\\n", "BL record ID \n", "14750869 1854 Third edition NaN \n", "14756377 1850 NaN (12°) \n", "14804046 1809 NaN 3 volumes (4°) \n", "14804297 1860 NaN xiii, 570 pages (8°) \n", "14804834 1843 NaN viii, 275 pages (12°) \n", "... ... ... ... \n", "14839906 1830 NaN 2 volumes (8°) \n", "14848081 1845 Second edition (12°) \n", "14871557 1838 NaN 102 pages (8°) \n", "14872822 1864 NaN NaN \n", "15743152 1881 NaN 3 pages (8°) \n", "\n", " Dewey classification BL shelfmark \\\n", "BL record ID \n", "14750869 NaN Digital Store 1298.h.25. (3.) \n", "14756377 NaN Digital Store 1298.h.14 \n", "14804046 NaN Digital Store 10058.l.13 \n", "14804297 NaN Digital Store 010058.ff.2 \n", "14804834 NaN Digital Store 1425.d.7 \n", "... ... ... \n", "14839906 NaN Digital Store 1046.d.6-7 \n", "14848081 NaN Digital Store 1298.h.8 \n", "14871557 NaN Digital Store 10291.e.2 \n", "14872822 NaN Digital Store 10291.e.21 \n", "15743152 NaN Digital Store 7106.f.14. (1.) \n", "\n", " Topics Genre \\\n", "BL record ID \n", "14750869 India Travel \n", "14756377 Asia--Description and travel ; India ; Egypt Travel \n", "14804046 India ; Sri Lanka ; Egypt Travel \n", "14804297 China ; India ; Russia Travel \n", "14804834 India Travel \n", "... ... ... \n", "14839906 Voyages and travels ; India ; India--Descripti... Travel \n", "14848081 India ; India--Guidebooks Travel \n", "14871557 Crimea (Ukraine)--19th century--Description an... Travel \n", "14872822 Kiev (Ukraine)--Description and travel Travel \n", "15743152 Geology ; India Travel \n", "\n", " Languages Notes BL record ID for physical resource \n", "BL record ID \n", "14750869 English NaN 1517704 \n", "14756377 English NaN 3510639 \n", "14804046 English NaN 91083 \n", "14804297 English NaN 136225 \n", "14804834 English NaN 236176 \n", "... ... ... ... \n", "14839906 English NaN 1063966 \n", "14848081 English NaN 3510627 \n", "14871557 French NaN 905050 \n", "14872822 Russian NaN 3334914 \n", "15743152 English NaN 3990134 \n", "\n", "[77 rows x 23 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = df['Genre'] == \"Travel\"\n", "travel = df[mask]\n", "travel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can work on a subset of rows and inspect the number of travel books in the collection with their titles." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(77, 23)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "travel.shape" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14750869 Grindlay and Co.'s Overland Circular. Hints fo...\n", "14756377 The Overland Companion: being a guide for the ...\n", "14804046 Voyages and Travels to India, Ceylon, the Red ...\n", "14804297 Travels in the Regions of the Upper and Lower ...\n", "14804834 Notes and Reflections, during a ramble in the ...\n", " ... \n", "14839906 Narrative of a journey overland from England, ...\n", "14848081 The Hand-Book of India, a guide to the strange...\n", "14871557 Esquisses d'un voyage dans la Russie méridiona...\n", "14872822 Кіевъ, его святиня, древности, достопамятности...\n", "15743152 On the connexion between Travelled Blocks in t...\n", "Name: Title, Length: 77, dtype: object" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "travel['Title']" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Voyages and Travels to India, Ceylon, the Red Sea, Abyssinia, and Egypt. 1802, 1803, 1804, 1805 and 1806 [With plates by Henry Salt.]'" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "travel['Title'].iloc[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8.3.2 Manipulating dataframes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have already covered quite some ground in this notebook. At this point, you should understand how to open and explore Pandas DataFrames. In what follows, we demonstrate how to change and manipulate information in a data frame. We focus on processing the dates of publication and convert the strings to integers that indicate the year of publication. This will help us later on with plotting and investigating trends over time.\n", "\n", "First, let us inspect the values in the column in more detail." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['1786', '1679', '1816', '1868', '1888', '1720', '1893', '1815',\n", " '1848', '1889', '1710', '1886', '1887', '1896', '1688', '1899',\n", " '1791', '1805', '1691', '1818', '1897', '1858', '1890', '1814',\n", " '1882', '1885', '1840', '1894', '1809', '1875', '1819', '1774',\n", " '1898', '1769', '1856', '1857', '1775', '1773', '1776', '1866',\n", " '1799', '1841', '1828', '1765', '1807', '1845', '1872', '1823',\n", " '1759', '1768', '1780', '1806', '1705', '1821', '1800', '1734',\n", " '1767', '1817', '1792', '1892', '1801', '1756', '1824', '1762',\n", " '1793', '1770', '1690', '1761', '1785', '1810', '1764', '1757',\n", " '1797', '1689', '1883', '1884', '1766', '1833', '1798', '1803',\n", " '1891', '1820', '1750', '1850', '1777', '1842', '1813', '1830',\n", " '1787', '1853', '1733', '1895', '1879', '1742', '1754', '1812',\n", " '1861', '1796', '1825', '1746', '1832', '1755', '1852', '1871',\n", " '1795', '1790', '1788', '1741', '1709', '1789', '1880', '1782',\n", " '1715', '1827', '1829', '1727', '1772', '1802', '1752', '1837',\n", " '1836', '1834', '1732', '1847', '1863', '1694', '1804', '1783',\n", " '1822', '1649', '1651', '1670', '1663', '1839', '1704', '1835',\n", " '1838', '1859', '1843', '1778', '1855', '1851', '1779', '1849',\n", " '1707', '1722', '1831', '1862', '1878', '1708', '1865', '1676',\n", " '1695', '1684', '1844', '1681', '1869', '1747', '1860', '1717',\n", " '1811', '1874', '1718', '1826', '1706', '1662', '1744', '1794',\n", " '1846', '1652', '1771', '1728', '1808', '1714', '1716', '1703',\n", " '1881', '1745', '1641', '1730', '1700', '1736', '1661', '1696',\n", " '1784', '1739', '1738', '1635', '1737', '1713', '1686', '1623',\n", " '1877', '1697', '1682', '1760', '1616', nan, '1665', '1655',\n", " '1687', '1711', '1701', '1854', '1729', '1867', '1927', '1870',\n", " '1596', '1873', '1735', '1698', '1685', '1719', '1660', '1633',\n", " '1644', '1666', '1528', '1639', '1556', '1667', '1637', '1653',\n", " '1693', '1668', '1900', '1876', '1781', '1763', '1864', '1674',\n", " '1743', '1731', '1647', '1712', '1821-', '1672', '1751', '1917',\n", " '1726', '1724', '1680', '1758', '1678', '1749', '1702', '1753',\n", " '1748', '1723', '1725', '1721', '1886-', '1896-', '1909', '1874-',\n", " '1829-', '1918', '1830-', '1683', '1878-1879', '1872-1873',\n", " '1892-1925', '1905', '1936', '1692', '1922', '1640', '1631',\n", " '1650', '1634', '1625', '1648', '1906', '1671', '1849-1850',\n", " '1890-1891', '1914', '1957', '1904', '1837-1839', '1677', '1657',\n", " '1846-', '1902', '1658', '1612', '1846-1847', '1646', '1912',\n", " '1780-1790', '1605', '1630', '1675', '1908', '1599', '1656',\n", " '1636', '1606', '1659', '1673', '1699', '1669', '1861-1862',\n", " '1626', '1857-1863', '1876-1869', '1887-1878', '1819-',\n", " '1811-1817', '1615', '1592', '1540', '1886-1887', '1921', '1638',\n", " '1617', '1880-1989', '1809-1811', '1654', '1932', '1642', '1611',\n", " '1937', '1907', '1781-', '1607', '1629', '1664', '1643', '1924',\n", " '1632', '1618', '1584', '1916', '1910', '1903', '1622', '1602',\n", " '1613', '1898-1902', '1871-1872', '1882-1884', '1881-1886',\n", " '1808-', '1872-', '1795-', '1911', '1740', '1870-1883',\n", " '1894-1898', '1879-1885', '1940', '1834-', '1926', '1864-', '1610',\n", " '1859-1862', '1593', '1817-', '1843-', '1839-', '1838-', '1855-',\n", " '1876-', '1893-', '1802-', '1887-', '1822-', '1824-', '1844-1847',\n", " '1510', '1825-', '1628', '1925', '1608', '1917-', '1883-1884',\n", " '1866-', '1888-1894', '1853-', '1845-', '1858-', '1862-1863',\n", " '1913', '1901', '1938', '1885-1908', '1823-', '1949', '1923',\n", " '1915', '1805-', '1878-1885', '1939', '1886-1980', '1874-1888',\n", " '1811-1813', '1892-1912', '1851-1873', '1865-1869', '1811-1812',\n", " '1897-', '1829-1830', '1887-1889', '1882-1894', '1882-1893',\n", " '1974', '1893-1895', '1848-1849', '1871-', '1876-1879',\n", " '1885-1895', '1852-1855', '1884-1909', '1873-1887', '1979',\n", " '1852-1941', '1951', '1862-1871', '1871-1873', '1885-1886',\n", " '1886-1893', '1919', '1920', '1946', '1888-1889', '1898-1901',\n", " '1928', '1897-1901', '1872-1878', '1851-1871', '1870-1895',\n", " '1867-1904', '1888-1890', '1884-', '1882-1890', '1955',\n", " '1898-1912', '1892-1901', '1835-1836', '1869-1872', '1885-1900',\n", " '1860-1863', '1931', '1933', '1953', '1952', '1929', '1943',\n", " '1954', '1935', '1945', '1934', '1930', '1944', '1942', '1947',\n", " '1950'], dtype=object)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Date of publication'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the varying ways in which the date of publication is recorded. Mostly, the value comprises just a number, but sometimes it indicates a data range, such as '1884-1909'. In some cases, the record lacks a start or end year (e.g. '1884-')\n", "\n", "The messiness of these data is fairly typical of heritage collections. The data are often entered manually and the conventions may not always be obvious. Even though the data are structured, they still require some processing to be useable.\n", "\n", "`dtype=object` (at the end of the output returned by `.unique()` indicates that the values are neither numbers nor dates. Selecting rows with the masking technique we introduced earlier, won't work in this case. For example, using the greater than (`>`) operator to obtain all books published after 1850 will produce a TypeError, telling us that numbers and strings are not comparable in this situation." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'>' not supported between instances of 'str' and 'int'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/2_/fcdvqwzs6j75cfr97nggzfn499sjqf/T/ipykernel_39483/2626304343.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Date of publication'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1850\u001b[0m \u001b[0;31m# this raises an error\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/ops/common.py\u001b[0m in \u001b[0;36mnew_method\u001b[0;34m(self, other)\u001b[0m\n\u001b[1;32m 67\u001b[0m \u001b[0mother\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mitem_from_zerodim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 70\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mnew_method\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/arraylike.py\u001b[0m in \u001b[0;36m__gt__\u001b[0;34m(self, other)\u001b[0m\n\u001b[1;32m 46\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0munpack_zerodim_and_defer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"__gt__\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 47\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__gt__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 48\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_cmp_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moperator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgt\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 49\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 50\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0munpack_zerodim_and_defer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"__ge__\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36m_cmp_method\u001b[0;34m(self, other, op)\u001b[0m\n\u001b[1;32m 5499\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5500\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrstate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mall\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"ignore\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 5501\u001b[0;31m \u001b[0mres_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcomparison_op\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5502\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5503\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_construct_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mres_values\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mres_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/ops/array_ops.py\u001b[0m in \u001b[0;36mcomparison_op\u001b[0;34m(left, right, op)\u001b[0m\n\u001b[1;32m 282\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 283\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mis_object_dtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 284\u001b[0;31m \u001b[0mres_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcomp_method_OBJECT_ARRAY\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mop\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 285\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 286\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/core/ops/array_ops.py\u001b[0m in \u001b[0;36mcomp_method_OBJECT_ARRAY\u001b[0;34m(op, x, y)\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvec_compare\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mravel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mravel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 72\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 73\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscalar_compare\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mravel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 74\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 75\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m/usr/local/lib/python3.7/site-packages/pandas/_libs/ops.pyx\u001b[0m in \u001b[0;36mpandas._libs.ops.scalar_compare\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: '>' not supported between instances of 'str' and 'int'" ] } ], "source": [ "df['Date of publication'] > 1850 # this raises an error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So let's convert the date of publication to integers. First we discard all rows with missing information. When applied to a column the `.isnull()` method returns `True` for all rows with NaN value (for the selected column)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14602826 False\n", "14602830 False\n", "14602831 False\n", "14602832 False\n", "14602833 False\n", " ... \n", "16289058 True\n", "16289059 True\n", "16289060 False\n", "16289061 False\n", "16289062 False\n", "Name: Date of publication, Length: 52695, dtype: bool" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Date of publication'].isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we want the opposite, i.e. `True` for the rows where we **have** date information. For this we can use the tilde (`~`) or negation symbol." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14602826 True\n", "14602830 True\n", "14602831 True\n", "14602832 True\n", "14602833 True\n", " ... \n", "16289058 False\n", "16289059 False\n", "16289060 True\n", "16289061 True\n", "16289062 True\n", "Name: Date of publication, Length: 52695, dtype: bool" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "~df['Date of publication'].isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can save the result of this operation as a new mask..." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "has_date_mask = ~df['Date of publication'].isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... and use the mask variable to select all rows with date information. We save the resulting data frame in a new variable called `df_s`. To keep track of the information we are throwing away, we print the `.shape` attribute of the original data frame `df` and `df_s`. As you'll notice, we are not discarding many rows: we only have around 175 missing values in \"Date of publication\"." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(52695, 23) (52517, 23)\n" ] } ], "source": [ "df_s = df[has_date_mask]\n", "print(df.shape,df_s.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we create a new function that converts a string to an integer. We use the typecasting function `int()` to convert the string. " ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2016 \n" ] } ], "source": [ "year_as_int = int('2016')\n", "print(year_as_int,type(year_as_int))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is one more step left: handling the date ranges marked with a hyphen between two years. To simplify matters we only keep the first year mentioned in the date range. We obtain this number by splitting a string on the hyphen (`.split()`).\n", "\n", "Please remember that `split()` always returns a list. We use the index notation to retrieve the first element of the list `[0]` (in Python we count from zero!).\n", "\n", "Run the cells below to understand how this works." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2005']" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'2005'.split('-')" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2005'" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'2005'.split('-')[0]" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2000', '2019']" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'2000-2019'.split('-')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2000'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'2000-2019'.split('-')[0]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2000'" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'2000-2019'.split('-')[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining these steps with `int()` will convert a date range to a number." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "int('2000-2019'.split('-')[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can package all these steps in one function. The function takes a string as input (`date_string`), splits it, and returns the element before the hyphen." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "def get_first_year(date_string):\n", " return int(date_string.lstrip('-').split('-')[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function produces the same output as the lines of code above." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2005" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_first_year('2005')" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "get_first_year('2000-2019')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To retrieve the first year of publication we have to apply `get_first_year()` to each value in the 'Date of publication' column. Luckily, this is very easy to do in Pandas with the `.apply()` method. `.apply()` takes a function as an argument—in this case, `get_first_year`—and will apply it (ha, what's in a name) to each value in the selected column. \n", "\n", "Basically, it returns a new column that contains a transformation from another column.\n", "\n", "For example, the code below returns a new column that contains the first year mentioned in the 'Date of publication' column. \n", "\n", "`df_s.loc[:,'Date of publication']` selects all rows: `:` means all values from the beginning till the end, for the 'Date of publication' column. The `,` in `.loc` separates the dimension (i.e. rows `,` columns)." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14602826 1786\n", "14602830 1679\n", "14602831 1816\n", "14602832 1868\n", "14602833 1888\n", " ... \n", "16289056 1936\n", "16289057 1922\n", "16289060 1913\n", "16289061 1924\n", "16289062 1919\n", "Name: Date of publication, Length: 52517, dtype: int64" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s.loc[:,'Date of publication'].apply(get_first_year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The transformation also changed the the data type. Notice the `dtype: int64` at the end of the above output.\n", "\n", "We're almost there. As the last step, we want to attach the new column produced by `.apply()` to our data frame `df_s`. Again, the syntax here is very convenient as is shown in the example below." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py:1667: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " self.obj[key] = value\n" ] } ], "source": [ "df_s.loc[:,'First year of pulication'] = df_s.loc[:,'Date of publication'].apply(get_first_year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspect the original data frame with `.head()` At the right-hand side of the table you should observe a new column with the first year of publication now correctly formatted" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Type of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...EditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resourceFirst year of pulication
BL record ID
14602826MonographYearsley, Ann1753-1806personNaNMore, Hannah, 1745-1833 [person] ; Yearsley, A...Poems on several occasions [With a prefatory l...NaNNaNNaN...Fourth edition MANUSCRIPT noteNaNNaNDigital Store 11644.d.32NaNNaNEnglishNaN39966031786
14602830MonographA, T.NaNpersonNaNOldham, John, 1653-1683 [person] ; A, T. [person]A Satyr against Vertue. (A poem: supposed to b...NaNNaNNaN...NaN15 pages (4°)NaNDigital Store 11602.ee.10. (2.)NaNNaNEnglishNaN11431679
14602831MonographNaNNaNNaNNaNNaNThe Aeronaut, a poem; founded almost entirely,...NaNNaNNaN...NaN17 pages (8°)NaNDigital Store 992.i.12. (3.)Dublin (Ireland)NaNEnglishNaN227821816
14602832MonographAlbert, Prince Consort, consort of Victoria, Q...1819-1861personNaNPlimsoll, Joseph [person] ; Albert, Prince Con...The Prince Albert, a poem [By Joseph Plimsoll.]AppendixNaNNaN...NaN16 pages (8°)NaNDigital Store 11602.ee.17. (1.)NaNNaNEnglishNaN397751868
14602833MonographAnslow, RobertNaNpersonNaNAnslow, Robert [person]The Defeat of the Spanish Armada, A.D. 1588. A...NaNNaNNaN...NaN40 pages (8°)NaNDigital Store 11602.ee.17. (7.)NaNNaNEnglishNaN926661888
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " Type of resource \\\n", "BL record ID \n", "14602826 Monograph \n", "14602830 Monograph \n", "14602831 Monograph \n", "14602832 Monograph \n", "14602833 Monograph \n", "\n", " Name \\\n", "BL record ID \n", "14602826 Yearsley, Ann \n", "14602830 A, T. \n", "14602831 NaN \n", "14602832 Albert, Prince Consort, consort of Victoria, Q... \n", "14602833 Anslow, Robert \n", "\n", " Dates associated with name Type of name Role \\\n", "BL record ID \n", "14602826 1753-1806 person NaN \n", "14602830 NaN person NaN \n", "14602831 NaN NaN NaN \n", "14602832 1819-1861 person NaN \n", "14602833 NaN person NaN \n", "\n", " All names \\\n", "BL record ID \n", "14602826 More, Hannah, 1745-1833 [person] ; Yearsley, A... \n", "14602830 Oldham, John, 1653-1683 [person] ; A, T. [person] \n", "14602831 NaN \n", "14602832 Plimsoll, Joseph [person] ; Albert, Prince Con... \n", "14602833 Anslow, Robert [person] \n", "\n", " Title \\\n", "BL record ID \n", "14602826 Poems on several occasions [With a prefatory l... \n", "14602830 A Satyr against Vertue. (A poem: supposed to b... \n", "14602831 The Aeronaut, a poem; founded almost entirely,... \n", "14602832 The Prince Albert, a poem [By Joseph Plimsoll.] \n", "14602833 The Defeat of the Spanish Armada, A.D. 1588. A... \n", "\n", " Variant titles Series title Number within series ... \\\n", "BL record ID ... \n", "14602826 NaN NaN NaN ... \n", "14602830 NaN NaN NaN ... \n", "14602831 NaN NaN NaN ... \n", "14602832 Appendix NaN NaN ... \n", "14602833 NaN NaN NaN ... \n", "\n", " Edition Physical description \\\n", "BL record ID \n", "14602826 Fourth edition MANUSCRIPT note NaN \n", "14602830 NaN 15 pages (4°) \n", "14602831 NaN 17 pages (8°) \n", "14602832 NaN 16 pages (8°) \n", "14602833 NaN 40 pages (8°) \n", "\n", " Dewey classification BL shelfmark \\\n", "BL record ID \n", "14602826 NaN Digital Store 11644.d.32 \n", "14602830 NaN Digital Store 11602.ee.10. (2.) \n", "14602831 NaN Digital Store 992.i.12. (3.) \n", "14602832 NaN Digital Store 11602.ee.17. (1.) \n", "14602833 NaN Digital Store 11602.ee.17. (7.) \n", "\n", " Topics Genre Languages Notes \\\n", "BL record ID \n", "14602826 NaN NaN English NaN \n", "14602830 NaN NaN English NaN \n", "14602831 Dublin (Ireland) NaN English NaN \n", "14602832 NaN NaN English NaN \n", "14602833 NaN NaN English NaN \n", "\n", " BL record ID for physical resource First year of pulication \n", "BL record ID \n", "14602826 3996603 1786 \n", "14602830 1143 1679 \n", "14602831 22782 1816 \n", "14602832 39775 1868 \n", "14602833 92666 1888 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Converting the recorded date of publication to an integer makes other operations, such as sorting and plotting the data, much easier. It requires the following step:\n", "\n", "- Apply the `value_counts()` method to ' First year of publication' to count how often each year appears in the BLB metadata." ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1897 1480\n", "1896 1415\n", "1895 1277\n", "1893 1207\n", "1890 1185\n", " ... \n", "1602 1\n", "1608 1\n", "1613 1\n", "1628 1\n", "1950 1\n", "Name: First year of pulication, Length: 354, dtype: int64" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s['First year of pulication'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- For each year, we see the corresponding number of books. We can order these counts chronologically, sorting the values (counts) by their index (year). `sort_index()` does exactly this." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1510 1\n", "1528 1\n", "1540 1\n", "1556 1\n", "1584 1\n", " ..\n", "1954 2\n", "1955 2\n", "1957 1\n", "1974 1\n", "1979 1\n", "Name: First year of pulication, Length: 354, dtype: int64" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s['First year of pulication'].value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The focus of the BLB corpus is largely on the 19th century. However, our dataset does contain some earlier and later works: the oldest book dates from 1510 and the most recent one from 1979. Let's focus on books published between 1800 and 1900. We can use `.loc[]` in combination with a slicing operation (from 1800 to 1900)." ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1800 120\n", "1801 90\n", "1802 108\n", "1803 100\n", "1804 117\n", " ... \n", "1896 1415\n", "1897 1480\n", "1898 1119\n", "1899 876\n", "1900 123\n", "Name: First year of pulication, Length: 101, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s['First year of pulication'].value_counts().sort_index().loc[1800:1900]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- These three steps count the number of books published each year and sort them in chronological order. After slicing, we can plot the number of books by year by appending the `.plot(figsize=(20,5))` method to this sequence of operations (`figsize=(20,5)` regulates the size of the figure)." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_s['First year of pulication'].value_counts().sort_index().loc[1800:1900].plot(figsize=(20,5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.4 Additional Examples\n", "\n", "An excellent and more complete introduction to Pandas is available online: [Python Data Science Handbook](\n", "https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas.\n", "\n", "We provide a few more examples of useful code for interrogating Pandas DataFrame. Please consult the book for more information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Select rows based on two or more conditions." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Type of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within series...EditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resourceFirst year of pulication
BL record ID
14804303MonographAubert De La Faige, Geneste ÉmileNaNpersonNaNLa Boutresse, Roger de [person] ; Tiersonnier,...Les Fiefs du Bourbonnais. La Palisse, etc. (Mo...NaNNaNNaN...NaN2 tomes (folio)NaNDigital Store 10172.i.19NaNNaNFrenchNaN1381051936
14804601MonographBarcelona, Concell de CentNaNorganisationNaNBarcelona, Concell de Cent [organisation]Manual de novells ardits vulgarment apellat Di...NaNColecció de documents histórichs inedits de Ar...NaN...NaN17 volumes (8°)NaNDigital Store 10161.eee.6NaNNaNSpanishNaN1968001922
14804602MonographArxiu Municipal Históric (Barcelona)NaNorganisationNaNArxiu Municipal Históric (Barcelona) [organisa...Colecció de documents histórichs inédits del A...NaNNaNNaN...NaNNaNNaNDigital Store 10161.eee.6NaNNaNSpanishOther volume are entered under the authers names1968391922
14809508MonographDubois, Marcel, Maître à l'Ecole normale supér...NaNpersonNaNGuy, Camille [person] ; Dubois, Marcel, Maître...Album géographique [With illustrations.]NaNNaNNaN...NaN5 volumes (4°)NaNDigital Store 10002.i.5NaNNaNFrenchNaN9912431906
14815227MonographKirchhoff, AlfredNaNpersonNaNKirchhoff, Alfred [person]Unser Wissen von der Erde. Allgemeine Erdkunde...NaNNaNNaN...NaN4 Band (8°)NaNDigital Store 10001.g.6NaNNaNGermanNaN19744311907
..................................................................
15742966MonographMilyukov, Pavel NikolaevichNaNpersonNaNMilyukov, Pavel Nikolaevich [person]Очерки по исторіи Русской Культуры ... 3-е изд...NaNNaNNaN...NaN3 част (8°)NaNDigital Store 9454.g.37NaNNaNRussianЧаст. 2 is of the 2nd edition and част. 3 is i...25031231903
15742968MonographMorselli, EnricoNaNpersonNaNVigo, G. B. [person] ; Raverdino, G. [person] ...Antropologia generale. Lezioni su l'uomo secon...NaNNaNNaN...NaNxxxi, 1395 pages (8°)NaNDigital Store 10007.v.5NaNNaNItalianNaN25581771911
16285845MonographRae, Milne, Mrs1844-1933personNaNRae, Milne, Mrs, 1844-1933 [person]Bride LorraineNaNNaNNaN...NaN192 pagesNaNNaNNaNNaNNaNNaN30283141912
16287370MonographLima, Archer deNaNpersonNaNLima, Archer de [person] ; Teisser, P. Carducc...Due Vite. Poema doloroso. (Traduzione di P. Ca...NaNNaNNaN...NaN48 pages (8°)NaNDigital Store 011650.i.6. (2.)NaNNaNItalianNaN21702661909
16288441MonographSmith, D.NaNpersonwriterSmith, D., writer [person]SongsNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaN34146941909
\n", "

108 rows × 24 columns

\n", "
" ], "text/plain": [ " Type of resource \\\n", "BL record ID \n", "14804303 Monograph \n", "14804601 Monograph \n", "14804602 Monograph \n", "14809508 Monograph \n", "14815227 Monograph \n", "... ... \n", "15742966 Monograph \n", "15742968 Monograph \n", "16285845 Monograph \n", "16287370 Monograph \n", "16288441 Monograph \n", "\n", " Name \\\n", "BL record ID \n", "14804303 Aubert De La Faige, Geneste Émile \n", "14804601 Barcelona, Concell de Cent \n", "14804602 Arxiu Municipal Históric (Barcelona) \n", "14809508 Dubois, Marcel, Maître à l'Ecole normale supér... \n", "14815227 Kirchhoff, Alfred \n", "... ... \n", "15742966 Milyukov, Pavel Nikolaevich \n", "15742968 Morselli, Enrico \n", "16285845 Rae, Milne, Mrs \n", "16287370 Lima, Archer de \n", "16288441 Smith, D. \n", "\n", " Dates associated with name Type of name Role \\\n", "BL record ID \n", "14804303 NaN person NaN \n", "14804601 NaN organisation NaN \n", "14804602 NaN organisation NaN \n", "14809508 NaN person NaN \n", "14815227 NaN person NaN \n", "... ... ... ... \n", "15742966 NaN person NaN \n", "15742968 NaN person NaN \n", "16285845 1844-1933 person NaN \n", "16287370 NaN person NaN \n", "16288441 NaN person writer \n", "\n", " All names \\\n", "BL record ID \n", "14804303 La Boutresse, Roger de [person] ; Tiersonnier,... \n", "14804601 Barcelona, Concell de Cent [organisation] \n", "14804602 Arxiu Municipal Históric (Barcelona) [organisa... \n", "14809508 Guy, Camille [person] ; Dubois, Marcel, Maître... \n", "14815227 Kirchhoff, Alfred [person] \n", "... ... \n", "15742966 Milyukov, Pavel Nikolaevich [person] \n", "15742968 Vigo, G. B. [person] ; Raverdino, G. [person] ... \n", "16285845 Rae, Milne, Mrs, 1844-1933 [person] \n", "16287370 Lima, Archer de [person] ; Teisser, P. Carducc... \n", "16288441 Smith, D., writer [person] \n", "\n", " Title \\\n", "BL record ID \n", "14804303 Les Fiefs du Bourbonnais. La Palisse, etc. (Mo... \n", "14804601 Manual de novells ardits vulgarment apellat Di... \n", "14804602 Colecció de documents histórichs inédits del A... \n", "14809508 Album géographique [With illustrations.] \n", "14815227 Unser Wissen von der Erde. Allgemeine Erdkunde... \n", "... ... \n", "15742966 Очерки по исторіи Русской Культуры ... 3-е изд... \n", "15742968 Antropologia generale. Lezioni su l'uomo secon... \n", "16285845 Bride Lorraine \n", "16287370 Due Vite. Poema doloroso. (Traduzione di P. Ca... \n", "16288441 Songs \n", "\n", " Variant titles \\\n", "BL record ID \n", "14804303 NaN \n", "14804601 NaN \n", "14804602 NaN \n", "14809508 NaN \n", "14815227 NaN \n", "... ... \n", "15742966 NaN \n", "15742968 NaN \n", "16285845 NaN \n", "16287370 NaN \n", "16288441 NaN \n", "\n", " Series title \\\n", "BL record ID \n", "14804303 NaN \n", "14804601 Colecció de documents histórichs inedits de Ar... \n", "14804602 NaN \n", "14809508 NaN \n", "14815227 NaN \n", "... ... \n", "15742966 NaN \n", "15742968 NaN \n", "16285845 NaN \n", "16287370 NaN \n", "16288441 NaN \n", "\n", " Number within series ... Edition Physical description \\\n", "BL record ID ... \n", "14804303 NaN ... NaN 2 tomes (folio) \n", "14804601 NaN ... NaN 17 volumes (8°) \n", "14804602 NaN ... NaN NaN \n", "14809508 NaN ... NaN 5 volumes (4°) \n", "14815227 NaN ... NaN 4 Band (8°) \n", "... ... ... ... ... \n", "15742966 NaN ... NaN 3 част (8°) \n", "15742968 NaN ... NaN xxxi, 1395 pages (8°) \n", "16285845 NaN ... NaN 192 pages \n", "16287370 NaN ... NaN 48 pages (8°) \n", "16288441 NaN ... NaN NaN \n", "\n", " Dewey classification BL shelfmark Topics \\\n", "BL record ID \n", "14804303 NaN Digital Store 10172.i.19 NaN \n", "14804601 NaN Digital Store 10161.eee.6 NaN \n", "14804602 NaN Digital Store 10161.eee.6 NaN \n", "14809508 NaN Digital Store 10002.i.5 NaN \n", "14815227 NaN Digital Store 10001.g.6 NaN \n", "... ... ... ... \n", "15742966 NaN Digital Store 9454.g.37 NaN \n", "15742968 NaN Digital Store 10007.v.5 NaN \n", "16285845 NaN NaN NaN \n", "16287370 NaN Digital Store 011650.i.6. (2.) NaN \n", "16288441 NaN NaN NaN \n", "\n", " Genre Languages \\\n", "BL record ID \n", "14804303 NaN French \n", "14804601 NaN Spanish \n", "14804602 NaN Spanish \n", "14809508 NaN French \n", "14815227 NaN German \n", "... ... ... \n", "15742966 NaN Russian \n", "15742968 NaN Italian \n", "16285845 NaN NaN \n", "16287370 NaN Italian \n", "16288441 NaN NaN \n", "\n", " Notes \\\n", "BL record ID \n", "14804303 NaN \n", "14804601 NaN \n", "14804602 Other volume are entered under the authers names \n", "14809508 NaN \n", "14815227 NaN \n", "... ... \n", "15742966 Част. 2 is of the 2nd edition and част. 3 is i... \n", "15742968 NaN \n", "16285845 NaN \n", "16287370 NaN \n", "16288441 NaN \n", "\n", " BL record ID for physical resource First year of pulication \n", "BL record ID \n", "14804303 138105 1936 \n", "14804601 196800 1922 \n", "14804602 196839 1922 \n", "14809508 991243 1906 \n", "14815227 1974431 1907 \n", "... ... ... \n", "15742966 2503123 1903 \n", "15742968 2558177 1911 \n", "16285845 3028314 1912 \n", "16287370 2170266 1909 \n", "16288441 3414694 1909 \n", "\n", "[108 rows x 24 columns]" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s[(df_s['First year of pulication'] > 1900) & (df_s['Languages'] != 'English')]" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BL record ID\n", "14815227 Unser Wissen von der Erde. Allgemeine Erdkunde...\n", "14823900 Russland in Asien. Bd. 1-11\n", "14847330 Versuch über die Ungleichheit der Menschenrace...\n", "14861142 Weltgeschichte ... Dritte verbesserte Auflage....\n", "14867365 Geschichte der Stadt Pressburg ... Herausgegeb...\n", "14867456 Die Geschichte Husums im Rahmen der Geschichte...\n", "14867692 Das Bisthum Augsburg, historisch und statistis...\n", "14867824 Die Gemeinde-Verwaltung der Reichshaupt und Re...\n", "14867867 Kulmbach und die Plassenburg in alter und neue...\n", "14871193 Nordische Fahrten. Skizzen und Studien\n", "14891317 Geschichte des Königreichs Hannover, etc. 2 Tl...\n", "14891669 Die Berner Chronik des Diebold Schilling, 1468...\n", "14896011 Oesterreichischer Erbfolge-Krieg 1740-1748. Na...\n", "14896217 Die Könige der Germanen. Das Wesen des älteste...\n", "14896882 Fürst Bismarck und der Bundesrat\n", "14897057 Monographien zur deutschen Kulturgeschichte, h...\n", "14897106 Geschichte der Siebenbürger Sachsen für das sä...\n", "14905437 Das Herzogtum Schleswig in seiner ethnographis...\n", "14905541 Geschichte der k. und k. Wehrmacht, etc\n", "14912765 Bibliothek deutscher Geschichte ... Herausgege...\n", "14939924 Bibliothek livländischer Geschichte herausgege...\n", "15105866 Acta Publica. Verhandlungen und Correspondenze...\n", "15115747 Englische Geschichte im achtzehnten Jahrhundert\n", "Name: Title, dtype: object" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_s[(df_s['First year of pulication'] > 1900) & (df_s['Languages'] == 'German')]['Title']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Compare subsections of the corpus and plot a timeline." ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "df_en = df_s[df_s.Languages=='English']\n", "df_de = df_s[df_s.Languages=='German']" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_en['First year of pulication'].value_counts().sort_index().loc[1800:1900].plot()\n", "df_de['First year of pulication'].value_counts().sort_index().loc[1800:1900].plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Plot the prevalence of words over time." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "import re\n", "pattern = re.compile(r'\\bwoman\\b|\\bwomen\\b',flags= re.I)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Women', 'woman']" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern.findall('Women woman bwomena')" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "def in_title(title,pattern):\n", " return bool(pattern.findall(str(title)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Use `.apply()` with an additional keyword argument." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] } ], "source": [ "df_s['travel'] = df_s.Title.apply(in_title,pattern=pattern)" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "df_s[df_s['travel'] == True]['First year of pulication'].value_counts().sort_index().plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fin." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }