{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Ejemplo de extracción de datos\n", "\n", "Este notebook utiliza una colección digital descrita a través de ficheros MARCXML que incluye metadatos descriptivos del catálogo [Moving Image Archive](https://data.nls.uk/data/metadata-collections/moving-image-archive/) de la Biblioteca Nacional de Escocia." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importando las librerías de código" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# https://pypi.org/project/pymarc/\n", "import pymarc, re, csv\n", "import pandas as pd\n", "from pymarc import parse_xml_to_array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generando un fichero CSV como salida con el contenido procesado a partir de los archivos originales" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "with open('registros_marc.csv', 'w') as csv_fichero:\n", " csv_salida = csv.writer(csv_fichero, delimiter = ',', quotechar = '\"', quoting = csv.QUOTE_MINIMAL)\n", " csv_salida.writerow(['titulo', 'autor', 'lugar_produccion', 'fecha', 'extension', 'creditos', 'materias', 'resumen', 'detalles', 'enlace'])\n", "\n", "\n", " registros = parse_xml_to_array(open('Moving-Image-Archive/Moving-Image-Archive-dataset-MARC.xml'))\n", "\n", " for registro in registros:\n", "\n", " titulo = autor = lugar_produccion = fecha = extension = creditos = materias = resumen = detalles = enlace =''\n", "\n", " # titulo\n", " if registro['245'] is not None:\n", " titulo = registro['245']['a']\n", " if registro['245']['b'] is not None:\n", " titulo = titulo + \" \" + registro['245']['b']\n", "\n", " # autor\n", " if registro['100'] is not None:\n", " autor = registro['100']['a']\n", " elif registro['110'] is not None:\n", " autor = registro['110']['a']\n", " elif registro['700'] is not None:\n", " autor = registro['700']['a']\n", " elif registro['710'] is not None:\n", " autor = registro['710']['a']\n", "\n", " # lugar de producción\n", " if registro['264'] is not None:\n", " lugar_produccion = registro['264']['a']\n", "\n", " # fecha\n", " for f in registro.get_fields('264'):\n", " fechas = f.get_subfields('c')\n", " if len(fechas):\n", " fecha = fechas[0]\n", "\n", " if fecha.endswith('.'): fecha = fecha[:-1]\n", "\n", "\n", " # Physical Description - extent\n", " for f in registro.get_fields('300'):\n", " extension = f.get_subfields('a')\n", " if len(extension):\n", " extension = extension[0]\n", " # TODO cleaning\n", " detalles = f.get_subfields('b')\n", " if len(detalles):\n", " detalles = detalles[0]\n", "\n", " # creditos\n", " if registro['508'] is not None:\n", " creditos = registro['508']['a']\n", "\n", " # Resumen\n", " if registro['520'] is not None:\n", " resumen = registro['520']['a']\n", "\n", " # Materia\n", " if registro['653'] is not None:\n", " materias = '' \n", " for f in registro.get_fields('653'):\n", " materias += f.get_subfields('a')[0] + ' -- '\n", " materias = re.sub(' -- $', '', materias)\n", "\n", "\n", " # enlace\n", " if registro['856'] is not None:\n", " enlace = registro['856']['u']\n", "\n", "\n", " csv_salida.writerow([titulo,autor,lugar_produccion,fecha,extension,creditos,materias,resumen,detalles,enlace])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Leyendo el fichero CSV " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Este comando añade el contenido del fichero a un Pandas DataFrame\n", "df = pd.read_csv('registros_marc.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Consultando el contenido" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | titulo | \n", "autor | \n", "lugar_produccion | \n", "fecha | \n", "extension | \n", "creditos | \n", "materias | \n", "resumen | \n", "detalles | \n", "enlace | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "(GLASGOW TRAMS AND BOTANIC GARDENS). | \n", "RUSSELL, Stanley Livingstone | \n", "[Place of production not identified] : | \n", "1950.0 | \n", "(2.00 mins) : | \n", "Director, [filmed by Stanley L. Russell, Thame... | \n", "Bus Stations and Depots -- Buses and Coaches, ... | \n", "The Botanic Gardens, Glasgow with shots of the... | \n", "mute, colour | \n", "http://movingimage.nls.uk/film/0001 | \n", "
1 | \n", "(LAST DAY OF THE TRAMS, GLASGOW). | \n", "NaN | \n", "[Place of production not identified] : | \n", "1962.0 | \n", "(28.00 mins) : | \n", "Director, [filmed by SAAC]. | \n", "Transport -- Glasgow -- documentary -- amateur | \n", "Footage of the last trams to run in Glasgow, a... | \n", "silent, colour | \n", "http://movingimage.nls.uk/film/0002 | \n", "
2 | \n", "INTO THE MISTS. | \n", "NaN | \n", "[Place of production not identified] : | \n", "1956.0 | \n", "(10.04 mins) : | \n", "Director, [filmed by W.S. Dobson]. | \n", "Ceremonies -- Emotions, Attitudes and Behaviou... | \n", "The story of the last Edinburgh tram. Shots o... | \n", "silent, colour | \n", "http://movingimage.nls.uk/film/0004 | \n", "
3 | \n", "PASSING OF THE TRAMCAR, the. | \n", "NaN | \n", "[Place of production not identified] : | \n", "1962.0 | \n", "(63.36 mins) : | \n", "NaN | \n", "Ceremonies -- Transport -- Glasgow | \n", "Footage of the last tram to run in Glasgow. Th... | \n", "silent, colour | \n", "http://movingimage.nls.uk/film/0005 | \n", "
4 | \n", "SCOTS OF TOMORROW. | \n", "Campbell Harper Productions | \n", "[Place of production not identified] : | \n", "1959.0 | \n", "(13.00 mins) : | \n", "Producer, Campbell Harper Films Ltd.. | \n", "Art and Artists, general -- Education -- edu... | \n", "Scottish school pupils studying scientific and... | \n", "sound, black and white | \n", "http://movingimage.nls.uk/film/0007 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
6012 | \n", "CITY OF BIRMINGHAM . | \n", "NaN | \n", "[Place of production not identified] : | \n", "1948.0 | \n", "(6.11 mins) : | \n", "NaN | \n", "Ceremonies -- Construction and Engineering -- ... | \n", "Built and engined by John Brown & Co. Ltd. S... | \n", "silent, colour | \n", "http://movingimage.nls.uk/film/UCS0195 | \n", "
6013 | \n", "BUILDING THE BIG DREDGE - STAGE 1. | \n", "NaN | \n", "[Place of production not identified] : | \n", "1964.0 | \n", "(8min20sec) : | \n", "Producer, Stephen Group Film Unit. | \n", "Construction and Engineering -- Ships and Ship... | \n", "Shots of Indonesian Sea Dredge No. 1, under co... | \n", "silent, colour | \n", "http://movingimage.nls.uk/film/UCS0204 | \n", "
6014 | \n", "ALEXANDER STEPHEN'S YARD. | \n", "NaN | \n", "[Place of production not identified] : | \n", "1964.0 | \n", "(11.57 mins) : | \n", "Producer, . | \n", "Employment, Industry and Industrial Relations ... | \n", "Shots of the Alexander Stephen's yard, and the... | \n", "silent, colour | \n", "http://movingimage.nls.uk/film/UCS0207 | \n", "
6015 | \n", "QUEEN ELIZABETH Ship No. 552. | \n", "NaN | \n", "[Place of production not identified] : | \n", "1940.0 | \n", "(5min24sec) : | \n", "NaN | \n", "Employment, Industry and Industrial Relations ... | \n", "Built and engineered by John Brown & Co. Ltd. ... | \n", "silent, black and white | \n", "http://movingimage.nls.uk/film/UCS0213 | \n", "
6016 | \n", "RUAHINE. | \n", "NaN | \n", "[Place of production not identified] : | \n", "1951.0 | \n", "(12.26 mins) : | \n", "NaN | \n", "Carriages -- Ceremonies -- Ships and Shipping ... | \n", "Footage of \"Ruahine\" ship being launched and t... | \n", "silent, black and white/colour | \n", "http://movingimage.nls.uk/film/UCS0214 | \n", "
6017 rows × 10 columns
\n", "materia | contador | |
---|---|---|
0 | \n", "amateur | \n", "2023 | \n", "
1 | \n", "Leisure and Recreation | \n", "813 | \n", "
2 | \n", "Glasgow | \n", "797 | \n", "
3 | \n", "documentary | \n", "707 | \n", "
4 | \n", "Transport | \n", "674 | \n", "
5 | \n", "Employment, Industry and Industrial Relations | \n", "632 | \n", "
6 | \n", "television news | \n", "542 | \n", "
7 | \n", "Edinburgh | \n", "538 | \n", "
8 | \n", "Sporting Activities | \n", "525 | \n", "
9 | \n", "Celebrations, Traditions and Customs | \n", "453 | \n", "
10 | \n", "Ships and Shipping | \n", "444 | \n", "
11 | \n", "local topical | \n", "411 | \n", "
12 | \n", "Children and Infants | \n", "407 | \n", "
13 | \n", "Media, Communication and the Creative Industries | \n", "399 | \n", "
14 | \n", "educational | \n", "379 | \n", "
15 | \n", "Ceremonies | \n", "359 | \n", "
16 | \n", "Education | \n", "356 | \n", "
17 | \n", "Arts and Crafts | \n", "353 | \n", "
18 | \n", "Tourism and Travel | \n", "352 | \n", "
19 | \n", "Construction and Engineering | \n", "303 | \n", "
20 | \n", "Agriculture | \n", "299 | \n", "
21 | \n", "Fish and Fishing | \n", "272 | \n", "
22 | \n", "sponsored | \n", "270 | \n", "
23 | \n", "Emotions, Attitudes and Behaviour | \n", "267 | \n", "
24 | \n", "promotional | \n", "264 | \n", "
25 | \n", "Food and Drink | \n", "262 | \n", "
26 | \n", "newsreel | \n", "256 | \n", "
27 | \n", "Landscapes and Seascapes | \n", "245 | \n", "
28 | \n", "Art and Artists, general | \n", "237 | \n", "
29 | \n", "television documentary | \n", "224 | \n", "
30 | \n", "Lanarkshire | \n", "222 | \n", "
31 | \n", "Ayrshire | \n", "219 | \n", "
32 | \n", "Fife | \n", "218 | \n", "
33 | \n", "Aberdeen | \n", "214 | \n", "
34 | \n", "home movies and videos | \n", "203 | \n", "
35 | \n", "Military, the | \n", "198 | \n", "
36 | \n", "War | \n", "197 | \n", "
37 | \n", "Animals | \n", "192 | \n", "
38 | \n", "Renfrewshire | \n", "188 | \n", "
39 | \n", "Home Life | \n", "183 | \n", "
40 | \n", "Politics | \n", "182 | \n", "
41 | \n", "Power Resources | \n", "180 | \n", "
42 | \n", "Environment | \n", "180 | \n", "
43 | \n", "Science and Technology | \n", "180 | \n", "
44 | \n", "Water and Waterways | \n", "176 | \n", "
45 | \n", "Argyllshire | \n", "171 | \n", "
46 | \n", "Architecture and Buildings | \n", "167 | \n", "
47 | \n", "Religion | \n", "167 | \n", "
48 | \n", "Forth River | \n", "166 | \n", "
49 | \n", "Dunbartonshire | \n", "166 | \n", "
50 | \n", "Aberdeenshire | \n", "166 | \n", "
51 | \n", "Perth | \n", "161 | \n", "
52 | \n", "Healthcare | \n", "159 | \n", "
53 | \n", "women film makers | \n", "158 | \n", "
54 | \n", "Birds | \n", "148 | \n", "
55 | \n", "advertising | \n", "134 | \n", "
56 | \n", "Fishing Boats | \n", "128 | \n", "
57 | \n", "Housing and Living Conditions | \n", "127 | \n", "
58 | \n", "comedy | \n", "125 | \n", "
59 | \n", "Highlands, the | \n", "124 | \n", "
60 | \n", "Royalty | \n", "122 | \n", "
61 | \n", "West Lothian | \n", "119 | \n", "
62 | \n", "Buses and Coaches, general | \n", "118 | \n", "
63 | \n", "animation | \n", "118 | \n", "
64 | \n", "Dundee | \n", "117 | \n", "
65 | \n", "industrial | \n", "114 | \n", "
66 | \n", "Invernesshire | \n", "111 | \n", "
67 | \n", "Dumfriesshire | \n", "107 | \n", "
68 | \n", "Music | \n", "105 | \n", "
69 | \n", "Borders | \n", "104 | \n", "
70 | \n", "Carriages | \n", "103 | \n", "
71 | \n", "Inner Hebrides | \n", "102 | \n", "
72 | \n", "Outer Hebrides | \n", "101 | \n", "
73 | \n", "Ferries | \n", "93 | \n", "
74 | \n", "Stirlingshire | \n", "91 | \n", "
75 | \n", "Orkney Islands | \n", "90 | \n", "
76 | \n", "technical | \n", "88 | \n", "
77 | \n", "Local Government | \n", "87 | \n", "
78 | \n", "sports | \n", "78 | \n", "
79 | \n", "Shetland Islands | \n", "76 | \n", "
80 | \n", "experimental | \n", "72 | \n", "
81 | \n", "Crime, Punishment and Law Enforcement | \n", "67 | \n", "
82 | \n", "East Lothian | \n", "65 | \n", "
83 | \n", "travelogue | \n", "62 | \n", "
84 | \n", "British Empire, the | \n", "61 | \n", "
85 | \n", "Bute | \n", "61 | \n", "
86 | \n", "Institutional Care | \n", "59 | \n", "
87 | \n", "Ross-shire | \n", "58 | \n", "
88 | \n", "Paddle Steamers | \n", "58 | \n", "
89 | \n", "instructional | \n", "57 | \n", "
90 | \n", "Stirling | \n", "57 | \n", "
91 | \n", "Midlothian | \n", "55 | \n", "
92 | \n", "Roxburghshire | \n", "54 | \n", "
93 | \n", "Celts and Celtic Culture | \n", "53 | \n", "
94 | \n", "Morayshire | \n", "47 | \n", "
95 | \n", "Berwickshire | \n", "47 | \n", "
96 | \n", "Peat and Peat Cutting | \n", "47 | \n", "
97 | \n", "Caithness | \n", "47 | \n", "
98 | \n", "Angus | \n", "41 | \n", "
99 | \n", "Selkirkshire | \n", "41 | \n", "
100 | \n", "Spinning | \n", "40 | \n", "
101 | \n", "music | \n", "40 | \n", "
102 | \n", "propaganda | \n", "37 | \n", "
103 | \n", "television sport | \n", "37 | \n", "
104 | \n", "Highland Games | \n", "37 | \n", "
105 | \n", "Camping | \n", "36 | \n", "
106 | \n", "Fish Markets | \n", "35 | \n", "
107 | \n", "Aircraft see also Helicopters | \n", "34 | \n", "
108 | \n", "biographical | \n", "33 | \n", "
109 | \n", "Cafeterias and Canteens | \n", "33 | \n", "
110 | \n", "Canals | \n", "32 | \n", "
111 | \n", "Banff | \n", "31 | \n", "
112 | \n", "Riding of the Marches | \n", "31 | \n", "
113 | \n", "Fish Gutting | \n", "31 | \n", "
114 | \n", "Christmas see also New Year | \n", "30 | \n", "
115 | \n", "television arts | \n", "30 | \n", "
116 | \n", "Sutherland | \n", "30 | \n", "
117 | \n", "Bus Stations and Depots | \n", "29 | \n", "
118 | \n", "television educational | \n", "28 | \n", "
119 | \n", "Restaurants | \n", "28 | \n", "
120 | \n", "Fishwives | \n", "27 | \n", "
121 | \n", "religion | \n", "25 | \n", "
122 | \n", "Wigtownshire | \n", "24 | \n", "
123 | \n", "music video | \n", "23 | \n", "
124 | \n", "children's | \n", "20 | \n", "
125 | \n", "Canoeing | \n", "19 | \n", "
126 | \n", "romance | \n", "19 | \n", "
127 | \n", "Air displays and shows | \n", "19 | \n", "
128 | \n", "medical | \n", "18 | \n", "
129 | \n", "Peebles- shire | \n", "18 | \n", "
130 | \n", "Gorbals, the | \n", "18 | \n", "
131 | \n", "Butchers and Butcher Shops | \n", "18 | \n", "
132 | \n", "Disillusionment | \n", "18 | \n", "
133 | \n", "Reservoirs | \n", "16 | \n", "
134 | \n", "Lobster Fishing | \n", "16 | \n", "
135 | \n", "television entertainment | \n", "16 | \n", "
136 | \n", "Airports | \n", "16 | \n", "
137 | \n", "scientific | \n", "15 | \n", "
138 | \n", "fantasy | \n", "15 | \n", "
139 | \n", "dance | \n", "15 | \n", "
140 | \n", "Special Needs Education | \n", "14 | \n", "
141 | \n", "Bulldozers | \n", "13 | \n", "
142 | \n", "public information | \n", "13 | \n", "
143 | \n", "historical | \n", "13 | \n", "
144 | \n", "Kincardineshire | \n", "13 | \n", "
145 | \n", "Kirkudbrightshire | \n", "12 | \n", "
146 | \n", "Lifeboats | \n", "12 | \n", "
147 | \n", "crime | \n", "12 | \n", "
148 | \n", "ethnographic | \n", "12 | \n", "
149 | \n", "cine mag | \n", "11 | \n", "
150 | \n", "Loch Ness Monster, the | \n", "11 | \n", "
151 | \n", "Holiday Camps | \n", "11 | \n", "
152 | \n", "Rodents | \n", "11 | \n", "
153 | \n", "Home Guard | \n", "10 | \n", "
154 | \n", "Clackmannanshire | \n", "10 | \n", "
155 | \n", "training | \n", "10 | \n", "
156 | \n", "horror | \n", "9 | \n", "
157 | \n", "Residential Homes for the Elderly | \n", "8 | \n", "
158 | \n", "science fiction | \n", "8 | \n", "
159 | \n", "Dentistry | \n", "8 | \n", "
160 | \n", "Nairn | \n", "8 | \n", "
161 | \n", "Fire Service | \n", "7 | \n", "
162 | \n", "Music Hall | \n", "7 | \n", "
163 | \n", "parody | \n", "7 | \n", "
164 | \n", "Revenge | \n", "5 | \n", "
165 | \n", "Depression, the | \n", "5 | \n", "
166 | \n", "Air Raids | \n", "5 | \n", "
167 | \n", "Reptiles | \n", "4 | \n", "
168 | \n", "Kinrosshire | \n", "4 | \n", "
169 | \n", "Spring | \n", "4 | \n", "
170 | \n", "Broadcasting, general | \n", "3 | \n", "
171 | \n", "Hogmanay | \n", "3 | \n", "
172 | \n", "Buddhism | \n", "2 | \n", "
173 | \n", "Cheese and Cheese Making | \n", "2 | \n", "
174 | \n", "War Crimes | \n", "1 | \n", "
175 | \n", "Easter | \n", "1 | \n", "
176 | \n", "Stained Glass | \n", "1 | \n", "