--- title: Data Mining the Internet Archive Collection layout: lesson date: 2014-03-03 authors: - Caleb McDaniel reviewers: - Adam Crymble editors: - William J. Turkel - Adam Crymble difficulty: 2 exclude_from_check: - review-ticket activity: acquiring topics: [web-scraping] abstract: "The collections of the Internet Archive include many digitized historical sources. Many contain rich bibliographic data in a format called MARC. In this lesson, you'll learn how to use Python to automate the downloading of large numbers of MARC files from the Internet Archive and the parsing of MARC records for specific information such as authors, places of publication, and dates. The lesson can be applied more generally to other Internet Archive files and to MARC records found elsewhere." redirect_from: /lessons/data-mining-the-internet-archive avatar_alt: Group of of men working in a mine doi: 10.46430/phen0035 --- {% include toc.html %} Lesson Goals ------------ The collections of the [Internet Archive][] (IA) include many digitized sources of interest to historians, including [early JSTOR journal content][], [John Adams's personal library][], and the [Haiti collection][] at the John Carter Brown Library. In short, to quote Programming Historian [Ian Milligan][], "The Internet Archive rocks." In this lesson, you'll learn how to download files from such collections using a Python module specifically designed for the Internet Archive. You will also learn how to use another Python module designed for parsing MARC XML records, a widely used standard for formatting bibliographic metadata. For demonstration purposes, this lesson will focus on working with the digitized version of the [Anti-Slavery Collection][] at the Boston Public Library in Copley Square. We will first download a large collection of MARC records from this collection, and then use Python to retrieve and analyze bibliographic information about items in the collection. For example, by the end of this lesson, you will be able to create a list of every named place from which a letter in the antislavery collection was written, which you could then use for a mapping project or some other kind of analysis. For Whom Is This Useful? ------------------------ This intermediate lesson is good for users of the Programming Historian who have completed general lessons on downloading files and performing text analysis on them, but would like an applied example of these principles. It will also be of interest to historians or archivists who work with the MARC format or the Internet Archive on a regular basis. Before You Begin ---------------- To write scripts that interact with the Internet Archive, you will first need to [create an IA account](https://archive.org/account/login.createaccount.php). Follow the steps to confirm your account and carefully note down your email address and password. We will be working with two Python modules that are not included in Python's standard library. The first, [internetarchive][], provides programmatic access to the Internet Archive. The second, [pymarc][], makes it easier to parse MARC records. The easiest way to download both is to use pip, the python package manager. Begin by installing `pip` using Fred Gibbs' [Installing Python Modules with pip][]. Then issue these commands at the command line: To install `internetarchive`: ``` bash sudo pip install internetarchive ``` Now you will need to configure your computer so that the new package will work. Type `ia configure` at the command line, and then enter in the email address and password you used above to create your Internet Archive account. To install `pymarc`: ``` bash sudo pip install pymarc ``` Now you are ready to go to work! The Antislavery Collection at the Internet Archive -------------------------------------------------- The Boston Public Library's anti-slavery collection at Copley Square contains not only the letters of William Lloyd Garrison, one of the icons of the American abolitionist movement, but also large collections of letters by and to reformers somehow connected to him. And by "large collection," I mean large. According to the library's estimates, there are over 16,000 items at Copley. As of this writing, approximately 7,000 of those items have been digitized and uploaded to the [Internet Archive][]. This is good news, not only because the Archive is committed to making its considerable cultural resources free for download, but also because each uploaded item is paired with a wealth of metadata suitable for machine-reading. Take [this letter][] sent by Frederick Douglass to William Lloyd Garrison. Anyone can read the [original manuscript][] online, without making the trip to Boston, and that alone may be enough to revolutionize and democratize future abolitionist historiography. But you can also download [multiple files][] related to the letter that are rich in metadata, like a [Dublin Core][] record and a fuller [MARCXML][] record that uses the [Library of Congress's MARC 21 Format for Bibliographic Data][]. Stop and think about that for a moment: every item uploaded from the Collection contains these things. Right now, that means historians have access to rich metadata, full images, and partial descriptions for [thousands of antislavery letters, manuscripts, and publications][]. Accessing an IA Collection in Python ------------------------------------ Internet Archive (IA) collections and items all have a unique identifier, and URLs to collections and items all look like this: ``` http://archive.org/details/[IDENTIFIER] ``` So, for example, here is a URL to the Archive item discussed above, Douglass's letter to Garrison: ``` http://archive.org/details/lettertowilliaml00doug ``` And here is a URL to the entire antislavery collection at the Boston Public Library: ``` http://archive.org/details/bplscas/ ``` Because the URLs are so similar, the only way to tell that you are looking at a collection page, instead of an individual item page, is to examine the page layout. An item page usually has a lefthand sidebar that says "View the Book" and lists links for reading the item online or accessing other file formats. A collection page will probably have a "Spotlight Item" in the lefthand sidebar instead. You can browse to different collections through the [eBook and Texts][] portal, and you may also want to read a little bit about [the way that items and item URLs are structured][]. Once you have a collection's identifier—in this case, `bplscas`—seeing all of the items in the collection is as easy as navigating to the Archive's [advanced search][] page, selecting the id from the drop down menu next to "Collection," and hitting the search button. Performing that search with `bplscas` selected returns [this page][], which as of this writing showed 7,029 results. We can also [search the Archive using the Python module that we installed][], and doing so makes it easier to iterate over all the items in the collection for purposes of further inspection and downloading. For example, let's modify the sample code from the module's documentation to see if we can tell, with Python, how many items are in the digital Antislavery Collection. The sample code looks something like what you see below. The only difference is that instead of importing only the `search_items` module from `internetarchive`, we are going to import the whole library. ``` python import internetarchive search = internetarchive.search_items('collection:nasa') print search.num_found ``` All we should need to modify is the collection identifier, from `nasa` to `bplscas`. After starting your computer's Python interpreter, try entering each of the above lines, followed by enter, but modify the collection id in the second command: ``` python search = internetarchive.search_items('collection:bplscas') ``` After hitting enter on the print command, you should see a number that matches the number of results you saw when doing [the advanced search for the collection][] in the browser. Accessing an IA Item in Python ------------------------------ The `internetarchive` module also allows you to access individual items using their identifiers. Let's try that using the [documentation's sample code][downloading], modifying it in order to get the Douglass letter we discussed earlier. If you are still at your Python interpreter's command prompt, you don't need to `import internetarchive` again. Since we imported the whole module, we also need to modify the sample code so that our interpreter will know that `get_item` is from the `internetarchive` module. We also need to change the sample identifier `stairs` to our item identifier, *lettertowilliaml00doug* (note that the character before the two zeroes is a lowercase L, not the number 1): ``` python item = internetarchive.get_item('lettertowilliaml00doug') item.download() ``` Enter each of those lines in your interpreter, followed by enter. Depending on your Internet connection speed, it will now probably take a minute or two for the command prompt to return, because your computer is downloading all of the files associated with that item, including some pretty large images. But when it's done downloading, you should be see a new directory on your computer whose name is the item identifier. To check, first exit your Python interpreter: ``` python exit() ``` Then list the contents of the current directory to see if a folder now appears named `lettertowilliaml00doug`. If you list the contents of that folder, you should see a list of files similar to this: ``` 39999066767938.djvu 39999066767938.epub 39999066767938.gif 39999066767938.pdf 39999066767938_abbyy.gz 39999066767938_djvu.txt 39999066767938_djvu.xml 39999066767938_images.zip 39999066767938_jp2.zip 39999066767938_scandata.xml lettertowilliaml00doug_archive.torrent lettertowilliaml00doug_dc.xml lettertowilliaml00doug_files.xml lettertowilliaml00doug_marc.xml lettertowilliaml00doug_meta.mrc lettertowilliaml00doug_meta.xml lettertowilliaml00doug_metasource.xml ``` Now that we know how to use the Search and Item functions in the `internetarchive` module, we can turn to thinking about how to make this process more effective for downloading lots of information from the collection for further analysis. Downloading MARC Records from a Collection ------------------------------------------ Downloading one item is nice, but what if we want to look at thousands of items in a collection? We're in luck, because the `internetarchive` module's Search function allows us to iterate over all the results in a search. To see how, let's first start our Python interpreter again. We'll need to import our module again, and perform our search again: ``` python import internetarchive search = internetarchive.search_items('collection:bplscas') ``` Now let's enter the documentation's sample code for printing out the item identifier of every item returned by our search: ``` python for result in search: print result['identifier'] ``` Note that after entering the first line, your Python interpreter will automatically print an ellipsis on line two. This is because you have started a *for loop,* and Python is expecting there to be more. It wants to know what you want to do for each result in the search. That's also why, once you hit enter on the second line, you'll see a third line with another ellipsis, because Python doesn't know whether you are finished telling it what to do with each result. Hit enter again to end the for loop and execute the command. You should now see your terminal begin to print out the identifiers for each result returned by our *bplscas search*---in this case, all 7,029 of them! You can interrupt the print out by hitting `Ctrl-C` on your keyboard, which will return you to the prompt. If you didn't see identifiers printing out to your screen, but instead saw an error like this, you may have forgotten to enter a few spaces before your print command: ``` python for result in search: print result['identifier'] File "", line 2 print result['identifier'] ^ IndentationError: expected an indented block ``` Remember that whitespace matters in Python, and you need to indent the lines in a for loop so that Python can tell which command(s) to perform on each item in the loop. Understanding the for loop -------------------------- The *for loop,* expressed in plain English, tells Python to do something to each thing in a collection of things. In the above case, we printed the identifier for each result in the results of our collection search. Two additional points about the *for loop:* First, the word we used after `for` is what's called a *local variable* in Python. It serves as a placeholder for whatever instance or item we are going to be working with inside the loop. Usually it makes sense to pick a name that describes what kind of thing we are working with—in this case, a search result—but we could have used other names in place of that one. For example, try running the above for loop again, but substitute a different name for the local variable, such as: ``` python for item in search: print item['identifier'] ``` You should get the same results. The second thing to note about the *for loop* is that the indented block could could have contained other commands. In this case, we printed each individual search result's identifier. But we could have chosen to do, for each result, anything that we could do to an individual Internet Archive item. For example, earlier we downloaded all the files associated with the item *lettertowilliaml00doug.* We could have done that to each item returned by our search by changing the line `print result['identifier']` in our *for loop* to `result.download()`. We probably want to think twice before doing that, though—downloading all the files for each of the 7,029 items in the bplscas collection is a lot of files. Fortunately, the download function in the `internetarchive` module also allows you to [download specific files associated with an item][downloading]. If we had only wanted to download the MARC XML record associated with a particular item, we could have instead done this: ``` python item = internetarchive.get_item('lettertowilliaml00doug') marc = item.get_file('lettertowilliaml00doug_marc.xml') marc.download() ``` Because Internet Archive [item files are named according to specific rules][], we can also figure out the name of the MARC file we want just by knowing the item's unique identifier. And armed with that knowledge, we can proceed to … Download All the MARC XML Files from a Collection ------------------------------------------------- For the next section, we're going to move from using the Python shell to writing a Python script that downloads the MARC record from each item in the BPL Antislavery Collection. Try putting this script into Komodo or your preferred text editor: ``` python #!/usr/bin/python import internetarchive search = internetarchive.search_items('collection:bplscas') for result in search: itemid = result['identifier'] item = internetarchive.get_item(itemid) marc = item.get_file(itemid + '_marc.xml') marc.download() print "Downloading " + itemid + " ..." ``` This script looks a lot like the experiments we have done above with the Frederick Douglass letter, but since we want to download the MARC record for each item returned by our collection search, we are using an itemid variable to account for the fact that the identifier and filename will be different for each result. Before running this script (which, I should note, is going to download thousands of small XML files to your computer), make a directory where you want those MARC records to be stored and place the above script in that directory. Then run the script from within the directory so that the files will be downloaded in an easy-to-find place. (Note that if you receive what looks like a `ConnectionError` on your first attempt, check your Internet connection, wait a few minutes, and then try running the script again.) If all goes well, when you run your script, you should see the program begin to print out status updates telling you that it is downloading MARC records. But allowing the script to run its full course will probably take a couple of hours, so let's stop the script and look a little more closely at ways to improve it. Pressing `Ctrl-C` while in your terminal window should make the script stop. Building Error Reporting into the Script ---------------------------------------- Since downloading all of these records will take some time, we are probably going to want to walk away from our computer for a while. But the chances are high that during those two hours, something could go wrong that would prevent our script from working. Let's say, for example, that we had forgotten that we already downloaded an individual file into this directory. Or maybe your computer briefly loses its Internet connection, or some sort of outage happens on the Internet Archive server that prevents the script from getting the file it wants. In those and other error cases, Python will raise an "exception" telling you what the problem is. Unfortunately, an exception will also crash your script instead of continuing on to the next item. To prevent this, we can use what's called a *try statement* in Python, which does exactly what it sounds like. The statement will try to execute a certain snippet of code until it hits an exception, in which case you can give it some other code to execute instead. You can read more about [handling exceptions][] in the Python documentation, but for now let's just update our above script so that it looks like this: ``` python #!/usr/bin/python import internetarchive import time error_log = open('bpl-marcs-errors.log', 'a') search = internetarchive.search_items('collection:bplscas') for result in search: itemid = result['identifier'] item = internetarchive.get_item(itemid) marc = item.get_file(itemid + '_marc.xml') try: marc.download() except Exception as e: error_log.write('Could not download ' + itemid + ' because of error: %s\n' % e) print "There was an error; writing to log." else: print "Downloading " + itemid + " ..." time.sleep(1) ``` The main thing we've added here, after our module import statements, is a line that opens a text file called `bpl-marcs-errors.log` and prepares it to have text appended to it. We are going to use this file to log exceptions that the script raises. The *try statement* that we have added to our *for loop* will attempt to download the MARC record. If it can't, it will write a descriptive statement about what went wrong to our log file. That way we can go back to the file later and identify which items we will need to try to download again. If the try clause succeeds and can download the record, then the script will execute the code in the *else* clause. One other thing we have added, upon successful download, is this line: ``` python time.sleep(1) ``` This line uses the `time` module that we are now importing at the beginning to tell our script to pause for one second before proceeding, which is basically just a way for us to be nice to Internet Archive's servers by not clobbering them every millisecond or so with a request. Try updating your script to look like the above lines, and run it again in the directory where you want to store your MARC files. Don't be surprised if you immediately encounter a string of error messages; that means the script is doing what it's supposed to do! Calmly go into your text editor, while leaving the script running, and open the `bpl-marcs-errors.log` to see what exceptions have been recorded there. You'll probably see that the script raised the exception "File already exists" for each of the files that you had already downloaded when running our earlier, shorter program. If you leave the program running for a little while, the script will eventually get to items that you have not already downloaded and resume collecting your MARCs! Scraping Information from a MARC Record --------------------------------------- Once your download script has completed, you should find yourself in the possession of nearly 7,000 detailed MARC XML records about items in the Anti-Slavery Collection (or whichever other collection you may have downloaded instead; the methods above should work on any collection whose items have MARC files attached to them). Now what? The next step depends on what sort of questions about the collection you want to answer. The MARC formatting language captures a wealth of data about an item, as you can see if you return to [the MARC XML record for the Frederick Douglass letter][MARCXML] mentioned at the outset. Notice, for example, that the Douglass letter contains information about the place where the letter was written in the *datafield* that is tagged *260,* inside the subfield coded *a.* The person who prepared this MARC record knew to put place information in that specific field because of [rules specified for the 260 datafield][] by the [MARC standards][]. That means that it should be possible for us to look inside all of the MARC records we have downloaded, grab the information inside of datafield *260,* subfield *a,* and make a list of every place name where items in the collection were published. To do this, we'll use the other helpful Python module that we downloaded with `pip` at the beginning: [pymarc][1]. That module makes it easy to get information out of subfields. Assuming that we have a MARC record prepared for parsing by the module assigned to the variable record, we could get the information about publication place names this way: ``` python place_of_pub = record['260']['a'] ``` The documentation for `pymarc` is a little less complete than that for the Internet Archive, especially when it comes to parsing XML records. But a little rooting around in the source code for the module reveals some [functions that it provides for working with MARC XML records][]. One of these, called `map_xml()` is described this way: ``` python def map_xml(function, *files): """ map a function onto the file, so that for each record that is parsed the function will get called with the extracted record def do_it(r): print r map_xml(do_it, 'marc.xml') """ ``` Translated into plain English, this function means that we can take an XML file containing MARC data (like the nearly 7,000 we now have on our computer), pass it to the `map_xml` function in the `pymarc` module, and then specify another function (that we will write) telling our program what to do with the MARC data retrieved from the XML file. In rough outline, our code will look something like this: ``` python import pymarc def get_place_of_pub(record): place_of_pub = record['260']['a'] print place_of_pub pymarc.map_xml(get_place_of_pub, 'lettertowilliaml00doug_marc.xml') ``` Try saving that code to a script and running it from a directory where you already have the Douglass letter XML saved. If all goes well, the script should spit out this: ``` python Belfast, [Northern Ireland], ``` Voila! Of course, this script would be much more useful if we scraped the place of publication from every letter in our collection of MARC records. Putting together what we've learned from earlier in the lesson, we can do that with a script that looks like this: ``` python #!/usr/bin/python import os import pymarc path = '/path/to/dir/with/xmlfiles/' def get_place_of_pub(record): try: place_of_pub = record['260']['a'] print place_of_pub except Exception as e: print e for file in os.listdir(path): if file.endswith('.xml'): pymarc.map_xml(get_place_of_pub, path + file) ``` This script modifies our above code in several ways. First, it uses a *for loop* to iterate over each file in our directory. In place of the `internetarchive` search results that we iterated over in our first part of this lesson, we iterate over the files returned by `os.listdir(path)` which uses the built-in Python module `os` to list the contents of the directory specified in the path variable, which you will need to modify so that it matches the directory where you have downloaded all of your MARC files. We have also added some error handling to our `get_place_of_pub()` function to account for the fact that some records may (for whatever reason) not contain the information we are looking for. The function will try to print the place of publication, but if this raises an Exception, it will print out the information returned by the Exception instead. In this case, if the try statement failed, the exception will probably print `None`. Understanding why is a subject for another lesson on Python Type errors, but for now the None printout is descriptive enough of what happened, so it could be useful to us. Try running this script. If all goes well, your screen should fill with a list of the places where these letters were written. If that works, try modifying your script so that it saves the place names to a text file instead of printing them to your screen. You could then use the [Counting Frequencies][] lesson to figure out which place names are most common in the collection. You could work with the place names to find coordinates that could be placed on a map using the [Google Maps lesson][]. Or, to get a very rough visual sense of the places where letters were written, you could do what I've done below and simply make a [Wordle word cloud][] of the text file. {% include figure.html filename="bpl-wordle.png" caption="Wordle wordcloud of places of publication for abolitionist letters" %} Of course, to make such techniques useful would require more [cleaning of your data][]. And other applications of this lesson may prove more useful. For example, working with the MARC data fields for personal names, you could create a network of correspondents. Or you could analyze which subjects are common in the MARC records. Now that you have the MARC records downloaded and can use `pymarc` to extract information from the fields, the possibilities can multiply rapidly! [Internet Archive]: http://archive.org/ [early JSTOR journal content]: https://archive.org/details/jstor_ejc [John Adams's personal library]: https://archive.org/details/johnadamsBPL [Haiti collection]: https://archive.org/details/jcbhaiti [Ian Milligan]: http://activehistory.ca/2013/09/the-internet-archive-rocks-or-two-million-plus-free-sources-to-explore/ [Anti-Slavery Collection]: http://archive.org/details/bplscas [internetarchive]: https://pypi.python.org/pypi/internetarchive [pymarc]: https://pypi.python.org/pypi/pymarc/ [this letter]: http://archive.org/details/lettertowilliaml00doug [original manuscript]: http://archive.org/stream/lettertowilliaml00doug/39999066767938#page/n0/mode/2up [multiple files]: http://archive.org/download/lettertowilliaml00doug [Dublin Core]: http://archive.org/download/lettertowilliaml00doug/lettertowilliaml00doug_dc.xml [MARCXML]: http://archive.org/download/lettertowilliaml00doug/lettertowilliaml00doug_marc.xml [Library of Congress's MARC 21 Format for Bibliographic Data]: http://www.loc.gov/marc/bibliographic/ [thousands of antislavery letters, manuscripts, and publications]: http://archive.org/search.php?query=collection%3Abplscas&sort=-publicdate [eBook and Texts]: https://archive.org/details/texts [the way that items and item URLs are structured]: http://blog.archive.org/2011/03/31/how-archive-org-items-are-structured/ [advanced search]: https://archive.org/advancedsearch.php [this page]: https://archive.org/search.php?query=collection%3A%28bplscas%29 [search the Archive using the Python module that we installed]: http://internetarchive.readthedocs.io/en/latest/quickstart.html#searching [the advanced search for the collection]: http://archive.org/search.php?query=collection%3Abplscas [downloading]: http://internetarchive.readthedocs.io/en/latest/quickstart.html#downloading [remember those?]: /lessons/code-reuse-and-modularity [item files are named according to specific rules]: https://archive.org/about/faqs.php#140 [handling exceptions]: http://docs.python.org/2/tutorial/errors.html#handling-exceptions [rules specified for the 260 datafield]: http://www.loc.gov/marc/bibliographic/bd260.html [MARC standards]: http://www.loc.gov/marc/ [1]: https://github.com/edsu/pymarc [functions that it provides for working with MARC XML records]: https://github.com/edsu/pymarc/blob/master/pymarc/marcxml.py [Counting Frequencies]: /lessons/counting-frequencies [Google Maps lesson]: /lessons/googlemaps-googleearth [Wordle word cloud]: https://web.archive.org/web/20201202151557/http://www.wordle.net/ [cleaning of your data]: /lessons/cleaning-ocrd-text-with-regular-expressions [Installing Python Modules with pip]: /lessons/installing-python-modules-pip