{ "cells": [ { "cell_type": "markdown", "id": "a235c0fa", "metadata": {}, "source": [ "# DNBLab Tutorial: Daten bereinigen und zusammenführen mittels OpenRefine" ] }, { "cell_type": "markdown", "id": "7efab486", "metadata": {}, "source": [ "## Part 1: Datenbezug mittels SRU-Schnittstelle" ] }, { "cell_type": "markdown", "id": "a835ec16", "metadata": {}, "source": [ "Als Datenbasis dienen die Metadaten des Digitalisierungsprojektes \"100 Bände Klassik\". Es enthält namhafte klassische Werke u.a. von Theodor Fontane, J.W. von Goethe und Rainer Maria Rilke und eignet sich daher besonders für einen ersten Einstieg in die Datenanreicherung, da die AutorInnen bereits umfassende Einträge in der GND haben. " ] }, { "cell_type": "markdown", "id": "7bd4b171", "metadata": {}, "source": [ "Die Daten werden mittels SRU-Schnittstelle bezogen und zur weiteren Verarbeitung in einer csv-Datei gespeichert. " ] }, { "cell_type": "markdown", "id": "c3c1bf1f", "metadata": {}, "source": [ "## Einrichten der Arbeitsumgebung" ] }, { "cell_type": "markdown", "id": "f3d66639", "metadata": {}, "source": [ "Um die Arbeitsumgebung für die folgenden Schritte passend einzurichten, sollten zunächst die benötigten Python-Bibliotheken importiert werden. Für Anfragen über die SRU-Schnittstelle wird `Requests` https://docs.python-requests.org/en/latest/ und zur Verarbeitung der XML-Daten `etree` https://docs.python.org/3/library/xml.etree.elementtree.html verwendet. Mit `Pandas` https://pandas.pydata.org/ können Elemente aus dem MARC21-Format ausgelesen werden." ] }, { "cell_type": "code", "execution_count": 25, "id": "17e5d771", "metadata": { "tags": [] }, "outputs": [], "source": [ "import requests\n", "from bs4 import BeautifulSoup \n", "import unicodedata\n", "from lxml import etree\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "681af64f", "metadata": {}, "source": [ "## SRU-Abfrage mit Ausgabe in MARC21-xml" ] }, { "cell_type": "markdown", "id": "780d8207", "metadata": {}, "source": [ "Die Funktion `dnb_sru` nimmt den Paramter \"query\" der SRU-Abfrage entgegen und liefert alle Ergebnisse als eine Liste von Records aus. Bei mehr als 100 Records werden weitere Datensätze mit \"&startRecord=101\" abgerufen (mögliche Werte 1 bis 99.000). Weitere Informationen und Funktionen der SRU- Schnittstelle werden unter https://www.dnb.de/sru beschrieben." ] }, { "cell_type": "code", "execution_count": 26, "id": "201fc688", "metadata": { "tags": [] }, "outputs": [], "source": [ "def dnb_sru(query):\n", " base_url = \"https://services.dnb.de/sru/dnb\"\n", " params = {\n", " 'recordSchema': 'MARC21-xml',\n", " 'operation': 'searchRetrieve',\n", " 'version': '1.1',\n", " 'maximumRecords': '100',\n", " 'query': query\n", " }\n", "\n", " r = requests.get(base_url, params=params)\n", " # Verwende den XML-Parser\n", " xml = BeautifulSoup(r.content, features=\"xml\")\n", " records = xml.find_all('record', {'type': 'Bibliographic'})\n", "\n", " if len(records) < 100:\n", " return records\n", " else:\n", " num_records_fetched = 100 # Anzahl der abgerufenen Datensätze\n", " start_record = 101 # Startindex für die nächste Abfrage\n", " while num_records_fetched == 100:\n", " params.update({'startRecord': start_record})\n", " r = requests.get(base_url, params=params)\n", " # Verwende den XML-Parser \n", " xml = BeautifulSoup(r.content, features=\"xml\")\n", " new_records = xml.find_all('record', {'type': 'Bibliographic'})\n", " records += new_records\n", " start_record += 100\n", " num_records_fetched = len(new_records)\n", "\n", " return records" ] }, { "cell_type": "markdown", "id": "47b0ba64", "metadata": {}, "source": [ "# Durchsuchen eines MARC-Feldes" ] }, { "cell_type": "markdown", "id": "008c6382", "metadata": {}, "source": [ "Die Funktion `parse_records` nimmt als Parameter jeweils ein Record entgegen und sucht über xpath die gewünschte Informationen heraus und liefert diese als Dictionary zurück. Die Schlüssel-Werte-Paare können beliebig angepasst und erweitert werden. \n", "\n", "Tipp: Schauen Sie sich gern unsere Feldübersicht für MARC21 an. \n", "https://www.dnb.de/SharedDocs/Downloads/DE/Professionell/Services/efa2023HandoutInhalteInMarc.pdf?__blob=publicationFile&v=2 " ] }, { "cell_type": "code", "execution_count": 27, "id": "e5a5d20a", "metadata": { "tags": [] }, "outputs": [], "source": [ "def parse_record(record):\n", " ns = {\"marc\": \"http://www.loc.gov/MARC21/slim\"}\n", " xml = etree.fromstring(unicodedata.normalize(\"NFC\", str(record)))\n", "\n", " def safe_xpath(xpath_expr):\n", " elements = xml.xpath(xpath_expr, namespaces=ns)\n", " return elements[0].text if elements else \"unknown\"\n", "\n", " # IDN\n", " idn = safe_xpath(\"marc:controlfield[@tag = '001']\")\n", "\n", " # Titel\n", " titel = safe_xpath(\"marc:datafield[@tag = '245']/marc:subfield[@code = 'a']\")\n", "\n", " # Erscheinungsjahr\n", " jahr = safe_xpath(\"marc:datafield[@tag = '264']/marc:subfield[@code = 'c']\")\n", "\n", " # Verfasserangabe\n", " verfasser = safe_xpath(\"marc:datafield[@tag = '100']/marc:subfield[@code = 'a']\")\n", "\n", " # GND-ID\n", " gnd_id = safe_xpath(\"marc:datafield[@tag = '100']/marc:subfield[@code = '0']\")\n", "\n", " # URN\n", " urn = safe_xpath(\"marc:datafield[@tag = '856']/marc:subfield[@code = 'u']\")\n", "\n", " # Verlag\n", " verlag = safe_xpath(\"marc:datafield[@tag = '264']/marc:subfield[@code = 'b']\")\n", "\n", " # Verlagsort\n", " verlagsort = safe_xpath(\"marc:datafield[@tag = '264']/marc:subfield[@code = 'a']\")\n", "\n", " meta_dict = {\n", " \"idn\": idn,\n", " \"titel\": titel,\n", " \"jahr\": jahr,\n", " \"verfasser\": verfasser,\n", " \"gnd_id\": gnd_id,\n", " \"urn\": urn,\n", " \"verlag\": verlag,\n", " \"verlagsort\": verlagsort\n", " }\n", "\n", " return meta_dict\n" ] }, { "cell_type": "markdown", "id": "a74578c3", "metadata": {}, "source": [ "Das Digitalisierungsprojekt \"100 Bände Klassik\" kann mt dem Projektcode \"d002\" als Datenset abgefragt und die Ergebnismenge ausgegeben werden." ] }, { "cell_type": "code", "execution_count": 28, "id": "5bcba3f8", "metadata": { "scrolled": false, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "108 Ergebnisse\n" ] } ], "source": [ "records = dnb_sru('cod=d002')\n", "print(len(records), 'Ergebnisse')" ] }, { "cell_type": "markdown", "id": "da801857", "metadata": {}, "source": [ "## CSV Download" ] }, { "cell_type": "markdown", "id": "ed6c3f37", "metadata": {}, "source": [ "Für die Datenbereinigung und Datenanreicherung wird die Arbeit im csv-Format empfohlen. Hierfür werden die Suchergebnisse im Folgenden in einem Dataframe (Tabelle) ausgegeben und anschließend für die weitere Bearbeitung heruntergeladen. " ] }, { "cell_type": "code", "execution_count": 29, "id": "684a40c6", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | idn | \n", "titel | \n", "jahr | \n", "verfasser | \n", "gnd_id | \n", "urn | \n", "verlag | \n", "verlagsort | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "1003104487 | \n", "Egmont | \n", "[1946] | \n", "Goethe, Johann Wolfgang von | \n", "(DE-588)118540238 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Schöningh | \n", "Paderborn | \n", "
1 | \n", "999490184 | \n", "Das Amulett | \n", "[1939] | \n", "Meyer, Conrad Ferdinand | \n", "(DE-588)118581775 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Verl. Dt. Volksbücher | \n", "Wiesbaden | \n", "
2 | \n", "1000047377 | \n", "Der Struwwelpeter oder lustige Geschichten u... | \n", "[1939] | \n", "Hoffmann, Heinrich | \n", "(DE-588)11855249X | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "[Loewe] | \n", "[Stuttgart] | \n", "
3 | \n", "1000290328 | \n", "Der Zweikampf | \n", "1939 | \n", "Kleist, Heinrich von | \n", "(DE-588)118563076 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Kohlhammer | \n", "Stuttgart | \n", "
4 | \n", "99962461X | \n", "Heidi | \n", "1939 | \n", "Spyri, Johanna | \n", "(DE-588)118616455 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Rascher | \n", "Zürich | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
103 | \n", "1000746348 | \n", "Leyer und Schwerdt | \n", "1913 | \n", "Körner, Theodor | \n", "(DE-588)118713507 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Morawe & Scheffelt | \n", "Berlin | \n", "
104 | \n", "100003917X | \n", "Schillers Wallenstein | \n", "[1913] | \n", "Schiller, Friedrich | \n", "(DE-588)118607626 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Dt. Bibliothek | \n", "Berlin | \n", "
105 | \n", "1000062104 | \n", "Vor dem Sturm | \n", "1913 | \n", "Fontane, Theodor | \n", "(DE-588)118534262 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Cotta | \n", "Stuttgart | \n", "
106 | \n", "1000775615 | \n", "Der Tod des Tizian | \n", "[1912] | \n", "Hofmannsthal, Hugo von | \n", "(DE-588)118552759 | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Insel-Verl. | \n", "Leipzig | \n", "
107 | \n", "1000778517 | \n", "Die drei gerechten Kammacher | \n", "[1903] | \n", "Keller, Gottfried | \n", "(DE-588)11856109X | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Cotta | \n", "Stuttgart | \n", "
108 rows × 8 columns
\n", "\n", " | idn | \n", "titel | \n", "jahr | \n", "verfasser | \n", "gnd_id | \n", "Geburtsdatum | \n", "Geburtsort | \n", "Beruf oder Beschäftigung | \n", "Sterbeort | \n", "urn | \n", "verlag | \n", "verlagsort | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1003104487 | \n", "Egmont | \n", "1946.0 | \n", "Goethe, Johann Wolfgang von | \n", "118540238 | \n", "1749-08-28 | \n", "Frankfurt am Main | \n", "Schriftsteller | \n", "Weimar | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Schöningh | \n", "Paderborn | \n", "
1 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Publizist | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Politiker | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Jurist | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Naturwissenschaftler | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
618 | \n", "1000778517 | \n", "Die drei gerechten Kammacher | \n", "1903.0 | \n", "Keller, Gottfried | \n", "11856109X | \n", "1819-07-19 | \n", "Zürich | \n", "Schriftsteller | \n", "Zürich | \n", "https://nbn-resolving.org/urn:nbn:de:101:2-201... | \n", "Cotta | \n", "Stuttgart | \n", "
619 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Librettist | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
620 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Maler | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
621 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Lyriker | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
622 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Lyriker | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
623 rows × 12 columns
\n", "