--- layout: default title: "DSC Multilingual Mystery #3: Quinn and Lee Clean Up Ghost Cat Data-Hairballs" booktitle: "DSC Multilingual Mystery #3: Quinn and Lee Clean Up Ghost Cat Data-Hairballs" coverart: "dscm3_cover.jpg" blurb: "Ghost Cat Data-Hairballs are a mess! Time for OpenRefine!" bookseries: mystery permalink: /dscm3/ --- # DSC Multilingual Mystery \#3: Quinn and Lee Clean Up Ghost Cat Data-Hairballs ```{index} single: *Book Topics ; Web scraping & OpenRefine (DSC M3) ``` by Lee Skallerup Bessette and Quinn Dombrowski April 2, 2020 DOI logo https://doi.org/10.25740/yx951qc9382
DSC M3 book cover
```{index} single: Covid ; March 2020 ``` ## Dear Reader Dear Reader, We've been writing this Multilingual Mystery on and off for the last month. During that time, Lee performed non-stop heroics for the emergency shift to online instruction at Georgetown. Quinn finished teaching her Dungeons-and-Dragons style DH role-playing game course virtually, watching as COVID-19 disrupted the lives of the characters -- and the students. For more than two weeks now, the Bay Area has been on lockdown, and Quinn has been running a preschool/kindergarten for three kids 6-and-under while trying to help non-English language and literature faculty at Stanford figure out what to do for spring quarter. Lee is also [working from home with older kids](https://readywriting.org/working-from-home/), and we're both grappling with how to [manage what the school districts are putting together for K-12 virtual education](http://quinndombrowski.com/blog/2020/03/27/pandemic-parenting-pedagogy). ```{index} single: Covid ; data-wrangling as a locus of control ``` Throughout all of this, we kept coming back to writing this Data-Sitters Club "book". We've had to go back and edit it in places (ballet classes are no longer happening), and we've felt moments of estrangement from words written less than a month before. There's been nights when we've been [too exhausted to do anything else](http://quinndombrowski.com/blog/2020/03/21/working-conditions), but this project has been a reassuring distraction. The topics -- data downloading, web scraping, and data cleaning -- may seem dry and technical, but in this strange time there's something comforting about writing up how to have control over _something_... even if it's just text on a screen. We hope it brings you some comfort, too. If you can put it to use in some way that helps other people, so much the better -- whether it's gathering data about hiring freezes, dubious government statements, or 80's/90's cultural phenomena that might bring a smile to someone's face in a dark time. And if you're not in the mental space right now to tackle it, that's okay too. It'll be here for you when the time is right. Take care, Quinn & Lee P.S. If you want a laugh, check out our ongoing series of [Important Public Health Messages from the Data-Sitters Club](http://datasittersclub.github.io/site/covid19). ## Recap When we left off with [DSC Multilingual Mystery #2](https://datasittersclub.github.io/site/dscm2.html), we had realized that metadata is going to be essential for just about anything we do with computational text analysis at scale for these translations. And in DSC #4 (forthcoming soon!), Anouk comes to the same conclusions about the original corpus in English. It was time to investigate metadata acquisition and cleaning, and the DSC Multilingual Mystery crew was on the case! ```{index} single: Metadata ; national libraries ``` ## Lee As I mentioned in the [DSC Multilingual Mystery #1](https://datasittersclub.github.io/site/dscm1.html), I already had a pretty good working idea about where to get the metadata we needed to be able to identify translators and other publication data for the various foreign-language translations of the series: national libraries. And, I wasn't wrong. The [Bibliothèque et Archives nationales du Québec](https://banq.qc.ca/accueil/), the [Bibliothèque nationale de France](https://www.bnf.fr/fr), the [Biblioteca Nacional de España](http://www.bne.es/es/Inicio/index.html), the [Biblioteca Nazionale Centrale di Firenze](https://www.bncf.firenze.sbn.it/), and the [Koninklijke Bibliotheek](https://www.kb.nl/en) provided me with all of the information we would need to get the information we required. Oh, and the five lonely German translations from [Deutsche Nationalbibliothek](https://www.dnb.de/EN/Home/home_node.html). My methodology was simple: Google the country name and "national library." Put Ann M. Martin in the search box (thank goodness for the universal language of user experience and expectations). If that didn't work, find the word that most closely resembles "catalogue" and click on it to find a different search box. And then, the final and least simple of the steps: export the metadata. This is where the inconsistencies arose, with each library offering its own idiosyncratic way of getting the metadata out of the library catalogue and into the hands of users. Quebec would only let me export a certain number of results from my search at a time as a csv file. I decided to limit my search by time period, as it was easiest to limit my searches in the interface that way. So, there are I had one CSV file for each year that the books were being published. France was the easiest to export, allowing me to download my entire search (which included the France and Belgium translations) as a giant CSV file. Spain let me export a page of results at a time as XML that displayed in my browser window, which I copied into a file. Shout-out to the Netherlands for having a functional English interface...and exporting the search as a simple text file. And then Italy...Italy made me cut and paste every single entry individually as XML as there was no batch export function. All I gotta say is that y'all are lucky I love to solve a good mystery and that I only have the mental capacity for menial tasks while waiting during my daughter's ballet class where I found myself sitting in the most uncomfortable seat imaginable while seven different dance studios practice five different dance styles, playing seven different songs repeatedly. Sigh. I never thought I'd say this, but I miss those days. ```{index} single: Multilingual DH ; problems exporting accents from national libraries ``` There was often the option to export the results as a simple text file, but the challenge was that simple text often didn't preserve the accents on letters, becoming gibberish instead, making it difficult to understand the information and data. So, CSV and XML were the closest I could get to relatively universal readability, or at least the cleanest data I could get for Quinn. It wasn't perfect, but it was a good first start. Also, the metadata you extract is only as good as the metadata contained in the book's library entry, and apparently no one bothered in either France or Belgium to note who translated the first handful of Belgian-French translations of the books. So while I extracted all of the data that was available, there are still some mysteries that might only be solved by either looking at a physical copy of the books or reaching out to the publisher to see if they have preserved the archives or kept the records somewhere. More mysteries that maybe we'll be able to someday solve if we're ever allowed to leave our houses again (lol*sob*). I was also curious to see if there were other translations in common Romance languages, so I looked in national libraries in South America for other Spanish translations, as well as Brazil and Portugal for Portugese translations. I came up empty-handed, but it was worth it to look to see if there were any other translations in a language that I could recognize. Which brings up the next limitation in my search: I know English and French fluently, have six credits of university Spanish, and feel pretty comfortable given my knowledge of French and Spanish in navigating an Italian language library website. So while there are more translations, I didn't feel like I was comfortable enough to rummage around in the national libraries of other languages. ## Quinn I was amazed at what Lee managed to find in those national library catalog records. There it was: the answers to our question about the number of translations, and who most of the translators were! One beautiful thing about most national libraries, many museums, and various other online digital collections is that there's often an option for just downloading your search or browsing results, which saves you a _ton of work_ relative to trying to do web scraping to collect the same information. Before you embark on web scraping from any cultural heritage website, be sure to read the documentation and FAQs, Google around, and be extra, extra sure that there's not an easier route to get the data directly. I'm not kidding: about half my "web scraping" consults at work have been resolved by me finding some kind of "download" button. Downloading the data isn't everything, though: I knew that if we didn't spend the time now to reorganize it into a consistent format that we could easily use to look things up and check for correlations (e.g. between translators and certain character-name inconsistencies), we'd definitely regret it later. ```{index} single: Data ; decision-making around cleaning ``` Now, it's easy to go overboard with data cleaning. Sometimes my library degree gets the better of me, and I start organizing All The Things. (This in contrast to my usual inclination to just pile things in places and assume it'll all sort itself out when the time is right.) With DH in particular, there's a temptation to clean more than you need, and produce a beautiful data set that has all the information perfectly structured. It's usually driven by altruism towards your future self or other scholars. "Someday, _someone might want this._" And if the data is easy and quick to clean, there's not much lost in that investment. But if it's a gnarly problem, and your future use case is hypothetical, maybe it's worth reconsidering whether it's the best use of your time _right now_. If someone wants a clean version of that data in the future, it's okay to let it be their problem. In the national library records that Lee found, there's a lot of data like that. Would it be interesting to see the publication cities for all the translations? Sure! Would it be fun to look at differences in subject metadata? Absolutely! (I'm really tempted by that one, truth be told.) But the data for both of those is kind of annoying to clean, and not actually what we need to answer the questions we're working on. Sometimes the hardest part of DH is setting aside all the things you _could do_, to finish the thing that you're _actually doing_. What I was _actually doing_ was trying to get five pieces of information out of the data that Lee downloaded, for each book: 1. The translated title of the book 2. A unique identifier to connect back to the original book (book number, title, etc.) 3. Which translation series is appears in (e.g. for French: Quebec, Belgium, or France) 4. The date it was published 5. The translator Anything else in these files -- regardless of how many great ideas I had for what I might do with it -- was a distraction. But wait! We already started wondering about whether certain things we were finding were quirks of particular ghostwriters. This was the perfect time to scrape and clean a data set that could help us answer those kinds of questions, too. So before we get to cleaning, let's see what else the Ghost Cat of Data-Sitters Club Multilingual Mystery #3 might grace us with, all over the carpet! (Sorry, readers -- what can I say, we're [committed to fidelity to the original book titles](https://babysittersclub.fandom.com/wiki/Mallory_and_the_Ghost_Cat), however tortured it gets.) If you want to follow along some of the steps, but not the whole thing, you can check out the [GitHub repo for this book](https://github.com/datasittersclub/dscm3) for the data and configuration files we're with here. ```{index} single: Web scraping ; with Webscraper.io ``` ```{index} single: Tools ; Webscraper.io ``` ```{index} single: Webscraper.io ``` ### Web scraping I do a lot of web scraping. When I need to do something simple, quick, and relatively small-scale, I go with webscraper.io, even though it gives you less flexibility in structuring and exporting your results compared to Python. I knew the [Baby-Sitters Club Wiki](https://babysittersclub.fandom.com/wiki/The_Baby-Sitters_Club_Wiki) on Fandom.com had the data I needed, presented as well-structured metadata on each book page. Webscraper.io was going to be a good tool for this job. Webscraper.io is a plugin for the Chrome browser, so first you need to [install it from the Chrome store](https://webscraper.io/documentation/installation). Next, you need to 1) access it in your browser by opening the _Developer Tools_ panel, then 2) choose _Web Scraper_ from the tabs at the top of the panel. ![Launching the webscraper plugin](_static/images/dscm3_launchwebscraper.jpg) ```{index} single: Webscraper.io ; creating a sitemap ``` #### Creating a sitemap Using Webscraper.io's menu, go to _Create new sitemap > Create sitemap_, as shown by arrow 3, above. (Note: if you want to import an existing sitemap, like one of the complete ones that we've posted at the [Data-Sitters Club GitHub repo for this book](https://github.com/datasittersclub/dscm3), you can choose "Import Sitemap" and copy and paste the text from the appropriate sitemap text file into the text field. If you run into trouble with the instructions here, importing one of our sitemaps and playing around with it might help you understand it better.) The first page is where you put in the start URL of the page you want to scrape. If you're trying to scrape multiple pages of data, and the site uses pagination that involves appending some page number to the end of the URL (e.g. if you go to the second page of results and you see the URL is the same except for something like `?page=2` on the end), you can set a range here, e.g. http://www.somesite.com/results?page=[1-5] if there are 5 pages of results. In our case, though, we have eight _different_ web pages we want to scrape for URLs, one for each book (sub-)series. First, we'll give our scraper a name (it can be almost anything; I went with _bsc_fan_wiki_link_scraper_). For the URL, I just put in the URL for the main book series, . If you have multiple pages (and especially if you're putting in a range), it's often best to start by putting in a single page, setting up the scraper, checking the results, and seeing if you need to make any modifications before you scrape hundreds of pages without capturing what you need. (Trust me -- it's a mistake I've made!) ```{index} single: Webscraper.io ; simple use of selectors ``` #### Handling selectors: easy mode Web scraping is _a lot easier_ if you know at least a little bit about HTML, how it's structured, and some common elements. Especially once you get into Python-based web scraping, CSS becomes important, too. I was lucky enough to learn HTML in elementary school (my final project in the 5th grade was a Sailor Moon fan site, complete with starry page background and at least one element), so it's hard for me to remember not knowing this stuff. But there's lots of tutorials out there, like this one from [W3Schools](https://www.w3schools.com/html/html_intro.asp), that can get you up to speed with the basics and let you play around with it if you're not really comfortable with HTML. The way we set up a web scraper is related to the way HTML is structured: HTML is made up of nested elements, and if we're doing something complex with our scraper, it'll have a nested structure, too. For our first scraper, we're _just_ trying to get the URLs of all the links on each page. We'll start by hitting the "Add new selector" button, which takes us to a different interface for choosing the selector. You have to give it a unique ID (I chose _pagelink_), then choose a type from the dropdown menu. For now, we just want _Link_ for the type. Under the _Selector_ header in this interface, check the "multiple" box (there are multiple links on the page), then click the "Select" button. Now the fun begins! Move your mouse back up to the page, and start clicking on the things you want to capture. After you've clicked on 2-3 of them, the scraper usually gets the idea and highlights the rest. In this case, I started clicking in the "A's", and after two clicks it picked up the rest of the entries filed under that letter, but I had to click in the results for a different letter before it picked up _everything_. ![Selecting links for the scraper](_static/images/dscm3_selecting_links.jpg) When everything you want is highlighted in red, click the blue "Done selecting" button. What I then had was `ul:nth-of-type(n+3) a.category-page__member-link`. If you're not comfortable with HTML, you might just be relieved that the computer sorted out all _that_ mess. But if you know how to read this, you should be concerned. `