{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Biblioteca Virtual Miguel de Cervantes LOD & the journal Doxa\n", "\n", "This notebook uses the Linked Open Data repository of the Biblioteca Virtual Miguel de Cervantes.\n", "\n", "This example is based on the journal [*Doxa. Cuadernos de Filosofía del Derecho*](http://data.cervantesvirtual.com/manifestation/237680) that is a periodical publication issued every year since 1984 to promote the interchange between philosophers of law from Latin America and Latin Europe. The information regarding this publication has been published as LOD in the repository, including metadata and text, and is accessible by means of the [SPARQL](http://data.cervantesvirtual.com/sparql) endpoint.\n", "\n", "As an introduction, we provide this [notebook](introduction_to_text_analysis.ipynb) to introduce the concepts that we use in this example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting things up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "import pandas as pd\n", "import pickle\n", "import re\n", "import os\n", "from pathlib import Path\n", "import requests\n", "from collections import Counter\n", "import matplotlib.pyplot as plt\n", "from numpy import mean, ones\n", "from scipy.sparse import csr_matrix\n", "from nltk.corpus import stopwords\n", "import nltk\n", "nltk.download('stopwords')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Relationships between the resources are described in RDA. Manifestations representing journals, volumes and articles are linked by means of the property *rdam:wholePartManifestationRelationship*\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The journal *Doxa. Cuadernos de Filosofía del Derecho* is a periodical publication issued every year since 1984 to promote the interchange between philosophers of law from Latin America and Latin Europe. \n", "\n", "The information regarding this publication has been published as LOD in the repository, including metadata and text, and is accessible by means of the SPARQL endpoint." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's retrieve the results!\n", "\n", "We will create a CSV file containing the results. By using the instruction *VALUES* we can configure the SPARQL query to filter the results using particular years, as is shown below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'http://data.cervantesvirtual.com/bvmc-lod/repositories/data'\n", "query = \"\"\"\n", "PREFIX owl: \n", "PREFIX rdf: \n", "PREFIX rdaa: \n", "PREFIX rdfs: \n", "PREFIX rdam: \n", "PREFIX rda: \n", "PREFIX rdai: \n", "\n", "select ?num ?numTitle ?article ?articleTitle ?date ?noteEdition ?carrierCharacteristic ?pdf\n", "where {\n", " VALUES ?date { \n", " }\n", " ?num rdam:wholePartManifestationRelationship .\n", " ?num rdam:title ?numTitle .\n", " ?num rdam:dateOfPublication ?date .\n", " ?article rdam:wholePartManifestationRelationship ?num .\n", " ?article rdam:title ?articleTitle .\n", " ?article rdam:exemplarOfManifestation ?item .\n", " ?article rdam:noteOnEditionStatement ?noteEdition .\n", " ?item rdai:identifierForTheItem ?pdf .\n", " ?item rdai:itemSpecificCarrierCharacteristic ?carrierCharacteristic;\n", "}\n", "\"\"\"\n", "r = requests.get(url, params = {'format': 'text/plain', 'query': query})\n", "\n", "# save the result\n", "f = open(\"results-doxa-dates.csv\", \"w\")\n", "f.write(r.text)\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also retrieve all the articles without specifying any date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = 'http://data.cervantesvirtual.com/bvmc-lod/repositories/data'\n", "query = \"\"\"\n", "PREFIX owl: \n", "PREFIX rdf: \n", "PREFIX rdaa: \n", "PREFIX rdfs: \n", "PREFIX rdam: \n", "PREFIX rda: \n", "PREFIX rdai: \n", "\n", "select ?num ?numTitle ?article ?articleTitle ?date ?noteEdition ?carrierCharacteristic ?pdf\n", "where {\n", " ?num rdam:wholePartManifestationRelationship .\n", " ?num rdam:title ?numTitle .\n", " ?num rdam:dateOfPublication ?date .\n", " ?article rdam:wholePartManifestationRelationship ?num .\n", " ?article rdam:title ?articleTitle .\n", " ?article rdam:exemplarOfManifestation ?item .\n", " ?article rdam:noteOnEditionStatement ?noteEdition .\n", " ?item rdai:identifierForTheItem ?pdf .\n", " ?item rdai:itemSpecificCarrierCharacteristic ?carrierCharacteristic;\n", "}\n", "\"\"\"\n", "r = requests.get(url, params = {'format': 'text/plain', 'query': query})\n", "\n", "# save the result\n", "f = open(\"results-doxa.csv\", \"w\")\n", "f.write(r.text)\n", "f.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading the data \n", "\n", "This puts the data in a Pandas DataFrame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "df = pd.read_csv('results-doxa-dates.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Retrieving PDF files from Biblioteca Virtual Miguel de Cervantes\n", "\n", "**Note:** This step may take a while to process due to the size of the PDF files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index, row in df.iterrows():\n", " print(index, row['pdf'])\n", " response = requests.get(row['pdf'])\n", " filename = Path('doxa/{}.pdf'.format(row['pdf'].replace('http://www.cervantesvirtual.com/descargaPdf/','').replace('/', '')))\n", " filename.write_bytes(response.content) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting text from the pdf file\n", "\n", "We use the library [Tika](https://pypi.org/project/tika/) to extract the text from the PDF files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tika import parser\n", "\n", "raw = parser.from_file('doxa/agustin-squella-valparaiso.pdf')\n", "print(raw['content'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the text files and extracting the text" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index,row in df.iterrows():\n", " \n", " file = 'doxa/{}.pdf'.format(row['pdf'].replace('http://www.cervantesvirtual.com/descargaPdf/','').replace('/', ''));\n", " print(file)\n", " raw = parser.from_file(file)\n", "\n", " df.loc[index, 'original_text'] = raw['content'].replace('\\n','')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting the years from the LOD URLs\n", "\n", "Dates are defined in the LOD repository using URLs such as http://data.cervantesvirtual.com/date/2000. Let's extract the year." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for index,row in df.iterrows():\n", " \n", " try:\n", " df.loc[index, 'year'] = int(row['date'].replace('http://data.cervantesvirtual.com/date/','').replace('/', ''))\n", " except:\n", " #print(\"An exception occurred\", sys.exc_info()[0]) \n", " df.loc[index, 'year'] = ''" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create an auxiliar class to store the terms and the codes\n", "\n", "A minimal perfect hash is a birectional mapping between objects and consecutive integers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MPHash(object):\n", " # create from iterable \n", " def __init__(self, terms):\n", " self.term = list(terms)\n", " self.code = {t:n for n, t in enumerate(self.term)}\n", " \n", " def __len__(self):\n", " return len(self.term)\n", " \n", " def get_code(self, term):\n", " return self.code.get(term)\n", " \n", " def get_term(self, code):\n", " return self.term[code]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting emergent topics\n", "\n", "This class recibes the texts to extract the emergent topics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# A sample is a collection of texts and publication dates \n", "# For each text, the sample stores its year and word counts. \n", "class Sample(object):\n", " pattern = pattern = r\"(?:\\w+[-])*\\w*[^\\W\\d_]\\w*(?:[-'’`]\\w+)*\"\n", " # Create Sample from data stored in a DataFrame with at least columns \n", " # TEXT, YEAR\n", " # n = maximal ngram size \n", " def __init__(self, data, ngram_length):\n", " self.size = len(data)\n", " self.year = data.year.tolist()\n", " \n", " texts = tuple(data.original_text)\n", " vectorizer = CountVectorizer(token_pattern = Sample.pattern, \n", " #stop_words='spanish',\n", " stop_words=stopwords.words('spanish'),\n", " max_df=0.1,\n", " ngram_range=(1, ngram_length))\n", " matrix = vectorizer.fit_transform(texts).transpose() \n", " # remove all hapax legomena to save space\n", " terms = vectorizer.get_feature_names()\n", " frequencies = matrix.sum(axis=1).A1\n", " selected = [m for m, f in enumerate(frequencies) if f > 1]\n", " hapax_rate = 1 - len(selected) / len(frequencies)\n", " print('Removing hapax legomena ({:.1f}%)'.format(100 * hapax_rate))\n", " self.matrix = matrix[selected, :] \n", " self.term_codes = MPHash([terms[m] for m in selected])\n", " \n", " # store array with global term frequencies\n", " self.term_frequencies = self.matrix.sum(axis=1).A1\n", " # store doc frequencies\n", " self.doc_frequencies = self.matrix.getnnz(axis=1)\n", " # store most common capitalization of terms\n", " print('Obtaining most common capitalizations')\n", " vectorizer.lowercase = False\n", " matrix = vectorizer.fit_transform(texts).transpose()\n", " terms = vectorizer.get_feature_names()\n", " frequencies = matrix.sum(axis=1).A1 \n", " forms = dict()\n", " for t, f in zip(terms, frequencies):\n", " low = t.lower()\n", " if forms.get(low, (None, 0))[1] < f:\n", " forms[low] = (t, f)\n", " self.capitals = {k:v[0] for k, v in forms.items()}\n", " \n", " print('Computed stats for', len(self.term_codes), 'terms')\n", " \n", " # return the number of texts stored in this Sample\n", " def __len__(self):\n", " return self.size\n", " \n", " # return term frequency of the specified term\n", " def get_tf(self, term):\n", " code = self.term_codes.get_code(term.lower())\n", " \n", " return self.term_frequencies[code]\n", " \n", " # return document frequency of the specified term\n", " def get_df(self, term):\n", " code = self.term_codes.get_code(term.lower())\n", " \n", " return self.doc_frequencies[code]\n", " \n", " # return the most frequent capitalization form\n", " # (also for stopwords not in dictionary)\n", " def most_frequent_capitalization(self, term):\n", " return self.capitals.get(term.lower(), term)\n", " \n", " # return the average submission year of texts containing every term\n", " def average_year(self, period, tf_threshold=20, df_threshold=3):\n", " docs = [n for n, y in enumerate(self.year)\\\n", " if period[0] <= y <= period[1]]\n", " tf_matrix = self.matrix[:, docs]\n", " tf_sum = tf_matrix.sum(axis=1).A1\n", " df_sum = tf_matrix.getnnz(axis=1)\n", " terms = [m for m, tf in enumerate(tf_sum)\\\n", " if tf >= tf_threshold and df_sum[m] >= df_threshold]\n", " tf_matrix = tf_matrix[terms, :] \n", " rows, cols = tf_matrix.nonzero()\n", " df_matrix = csr_matrix((ones(len(rows)), (rows, cols)))\n", " year = [self.year[n] for n in docs]\n", " \n", " res = df_matrix @ year / df_matrix.getnnz(axis=1) # @ operator = matrix multiplication\n", " \n", " return {self.term_codes.get_term(terms[m]):res[m] for m in range(len(res))}\n", "\n", " \n", " # return the number of occurrences (doc frequency) for every term \n", " def get_df_per_year(self, term):\n", " m = self.term_codes.get_code(term)\n", " row = self.matrix.getrow(m)\n", " _, docs = row.nonzero()\n", " c = Counter(map(self.year.__getitem__, docs))\n", "\n", " return c\n", " \n", " # return the number of occurrences (term frequency) for every term\n", " def tf_per_year(self, period=None):\n", " rows, cols = self.matrix.nonzero()\n", " res = {m:Counter() for m in rows}\n", " for m, n in zip(rows, cols):\n", " year = self.year[n]\n", " if period == None or period[0] <= year <= period[1]:\n", " res[m][year] += self.matrix[m, n]\n", " \n", " return res\n", " \n", " def plot_tf_series(self, term, period, relative=False):\n", " m = self.term_codes.get_code(term)\n", " if relative:\n", " norm = Counter(self.year)\n", " else:\n", " norm = Counter(set(self.year))\n", " \n", " if m:\n", " row = self.matrix.getrow(m)\n", " _, cols = row.nonzero()\n", " c = Counter()\n", " for n in cols:\n", " year = self.year[n]\n", " if period == None or period[0] <= year <= period[1]:\n", " c[year] += row[0, n]\n", " \n", " X = sorted(c.keys())\n", " Y = [c[x] / norm[x] for x in X]\n", " plt.plot(X, Y, 'o-')\n", " plt.ylim(0, 1.2 * max(Y))\n", " plt.title(term) \n", " else:\n", " raise ValueError('{} is not in store'.format(term))\n", " \n", " # return dictionary with a list of text-years per term \n", " # period = pair of years (min _year, max_year) inclusive\n", " # keep_all = true if unlisted texts are not ignored\n", " def document_years(self, period=None, keep_all=True):\n", " rows, cols = self.matrix.nonzero()\n", " res = {m:list() for m in rows}\n", " for m, n in zip(rows, cols):\n", " if keep_all or self.listed[n]:\n", " year = self.year[n]\n", " print(year)\n", " if period == None or period[0] <= year <= period[1]:\n", " res[m].append(year)\n", " \n", " return res\n", " \n", " # return dictionary with Counter of abstract-years per term\n", " def df_per_year(self, period=None, keep_all=True):\n", " doc_years = self.document_years(period, keep_all)\n", " \n", " return {m:Counter(v) for m, v in doc_years.items()}\n", " \n", " # create a plot with document frequency of terms\n", " def plot_df(self, terms, period, keep_all=True):\n", " dfs = self.df_per_year(period, keep_all)\n", " for term in terms:\n", " m = self.term_codes.get_code(term.lower())\n", " df = dfs[m] \n", " X = range(*period)\n", " Y = [df.get(x, 0) for x in X]\n", " plt.clf()\n", " plt.plot(X, Y)\n", " plt.title(term)\n", " filename = 'plots/{}.png'.format(term)\n", " print('Saving', filename)\n", " plt.savefig(filename, dpi=200)\n", " \n", " # compute the average age in the specified period of documents containing \n", " # each term with global term-frequency above tf_threshold\n", " # and annual document frequency above df_threshold (one year at least)\n", " # period = optional pair of years (min _year, max_year) inclusive\n", " def get_ages(self, period=None, \n", " tf_threshold=20, df_threshold=3, keep_all=True):\n", " res = dict()\n", " doc_years = self.document_years(period, keep_all)\n", " for m, values in doc_years.items():\n", " term = self.term_codes.get_term(m)\n", " if len(values) > 0:\n", " df = Counter(values).most_common(1)[0][1]\n", " tf = self.term_frequencies[m]\n", " #break;\n", " if df >= df_threshold and tf >= tf_threshold: \n", " res[term] = mean(values)\n", " return res\n", " \n", " # return abstract numbers containing any term in this set of terms\n", " def docs_with_term(self, terms, period=None):\n", " rows, cols = self.matrix.nonzero()\n", " res = set()\n", " for m, n in zip(rows, cols):\n", " term = self.term_codes.get_term(m)\n", " if terms == None or term in terms:\n", " year = self.year[n]\n", " if period == None or period[0] <= year <= period[1]:\n", " res.add(n)\n", " \n", " return res\n", " \n", " \n", " def search(self, term):\n", " m = self.term_codes.get_code(term)\n", " docs = self.matrix.getrow(m).nonzero()[1]\n", " \n", " return [(self.year[n], self.type[n], self.panel[n]) for n in docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using the Sample class to extract the emergent topics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data[data.original_text.str.len() > 40] \n", "\n", "print('Processing', len(data), 'texts')\n", "\n", "s = Sample(data, 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving the pickle object\n", "\n", "The [pickle](https://docs.python.org/3/library/pickle.html) module implements binary protocols for serializing and de-serializing a Python object structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('sample-doxa.pkl', 'wb') as f:\n", " pickle.dump(s, f)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('sample-doxa.pkl', 'rb') as f:\n", " s = pickle.load(f)\n", "print('Loaded stats for', len(s), 'texts')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting a period" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "period = (2010, 2018)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ages = s.get_ages(period)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "top = pd.DataFrame.from_dict(ages, orient='index').reset_index()\n", "print(top)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "top.columns = ['TERM', 'AGE']\n", "#top = top.sort_values('AGE', ascending=False).head(250) \n", "top = top.sort_values('AGE', ascending=False)#.head(250) \n", "top['DOC FREQ'] = top.TERM.apply(s.get_df)\n", "top['TERM FREQ'] = top.TERM.apply(s.get_tf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# prepare to export\n", "top['TERM'] = top.TERM.apply(s.most_frequent_capitalization)\n", "print(top.set_index('TERM').head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ts = pd.datetime.now().strftime(\"%Y-%m-%d_%H.%M\") \n", "filename = 'output/vocabulary_{}.xlsx'.format(ts)\n", "with pd.ExcelWriter(filename) as writer:\n", " top.set_index('TERM').to_excel(writer, sheet_name='terms')\n", "\n", "print('vocabulary saved to', filename)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 4 }