## Biblioteca Virtual Miguel de Cervantes LOD & the journal Doxa

This notebook uses the Linked Open Data repository of the Biblioteca Virtual Miguel de Cervantes.

This example is based on the journal [*Doxa. Cuadernos de Filosofía del Derecho*](http://data.cervantesvirtual.com/manifestation/237680) that is a periodical publication issued every year since 1984 to promote the interchange between philosophers of law from Latin America and Latin Europe. The information regarding this publication has been published as LOD in the repository, including metadata and text, and is accessible by means of the [SPARQL](http://data.cervantesvirtual.com/sparql) endpoint.

As an introduction, we provide this [notebook](introduction_to_text_analysis.ipynb) to introduce the concepts that we use in this example.

### Setting things up

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import pickle
import re
import os
from pathlib import Path
import requests
from collections import Counter
import matplotlib.pyplot as plt
from numpy import mean, ones
from scipy.sparse import csr_matrix
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

Relationships between the resources are described in RDA. Manifestations representing journals, volumes and articles are linked by means of the property *rdam:wholePartManifestationRelationship*



The journal *Doxa. Cuadernos de Filosofía del Derecho* is a periodical publication issued every year since 1984 to promote the interchange between philosophers of law from Latin America and Latin Europe. 

The information regarding this publication has been published as LOD in the repository, including metadata and text, and is accessible by means of the SPARQL endpoint.

### Let's retrieve the results!

We will create a CSV file containing the results. By using the instruction *VALUES* we can configure the SPARQL query to filter the results using particular years, as is shown below.

In [None]:
url = 'http://data.cervantesvirtual.com/bvmc-lod/repositories/data'
query = """
PREFIX owl: 
PREFIX rdf: 
PREFIX rdaa: 
PREFIX rdfs: 
PREFIX rdam: 
PREFIX rda: 
PREFIX rdai: 

select ?num ?numTitle ?article ?articleTitle ?date ?noteEdition ?carrierCharacteristic ?pdf
where {
 VALUES ?date { 
 }
 ?num rdam:wholePartManifestationRelationship .
 ?num rdam:title ?numTitle .
 ?num rdam:dateOfPublication ?date .
 ?article rdam:wholePartManifestationRelationship ?num .
 ?article rdam:title ?articleTitle .
 ?article rdam:exemplarOfManifestation ?item .
 ?article rdam:noteOnEditionStatement ?noteEdition .
 ?item rdai:identifierForTheItem ?pdf .
 ?item rdai:itemSpecificCarrierCharacteristic ?carrierCharacteristic;
}
"""
r = requests.get(url, params = {'format': 'text/plain', 'query': query})

# save the result
f = open("results-doxa-dates.csv", "w")
f.write(r.text)
f.close()

We can also retrieve all the articles without specifying any date.

In [None]:
url = 'http://data.cervantesvirtual.com/bvmc-lod/repositories/data'
query = """
PREFIX owl: 
PREFIX rdf: 
PREFIX rdaa: 
PREFIX rdfs: 
PREFIX rdam: 
PREFIX rda: 
PREFIX rdai: 

select ?num ?numTitle ?article ?articleTitle ?date ?noteEdition ?carrierCharacteristic ?pdf
where {
 ?num rdam:wholePartManifestationRelationship .
 ?num rdam:title ?numTitle .
 ?num rdam:dateOfPublication ?date .
 ?article rdam:wholePartManifestationRelationship ?num .
 ?article rdam:title ?articleTitle .
 ?article rdam:exemplarOfManifestation ?item .
 ?article rdam:noteOnEditionStatement ?noteEdition .
 ?item rdai:identifierForTheItem ?pdf .
 ?item rdai:itemSpecificCarrierCharacteristic ?carrierCharacteristic;
}
"""
r = requests.get(url, params = {'format': 'text/plain', 'query': query})

# save the result
f = open("results-doxa.csv", "w")
f.write(r.text)
f.close()

### Loading the data 

This puts the data in a Pandas DataFrame

In [None]:

df = pd.read_csv('results-doxa-dates.csv')

In [None]:
df

## Retrieving PDF files from Biblioteca Virtual Miguel de Cervantes

**Note:** This step may take a while to process due to the size of the PDF files.

In [None]:
for index, row in df.iterrows():
 print(index, row['pdf'])
 response = requests.get(row['pdf'])
 filename = Path('doxa/{}.pdf'.format(row['pdf'].replace('http://www.cervantesvirtual.com/descargaPdf/','').replace('/', '')))
 filename.write_bytes(response.content) 

## Extracting text from the pdf file

We use the library [Tika](https://pypi.org/project/tika/) to extract the text from the PDF files.

In [None]:
from tika import parser

raw = parser.from_file('doxa/agustin-squella-valparaiso.pdf')
print(raw['content'])

## Reading the text files and extracting the text

In [None]:
for index,row in df.iterrows():
 
 file = 'doxa/{}.pdf'.format(row['pdf'].replace('http://www.cervantesvirtual.com/descargaPdf/','').replace('/', ''));
 print(file)
 raw = parser.from_file(file)

 df.loc[index, 'original_text'] = raw['content'].replace('\n','')

In [None]:
df.head(3)

## Extracting the years from the LOD URLs

Dates are defined in the LOD repository using URLs such as http://data.cervantesvirtual.com/date/2000. Let's extract the year.

In [None]:
for index,row in df.iterrows():
 
 try:
 df.loc[index, 'year'] = int(row['date'].replace('http://data.cervantesvirtual.com/date/','').replace('/', ''))
 except:
 #print("An exception occurred", sys.exc_info()[0]) 
 df.loc[index, 'year'] = ''

In [None]:
df.head(3)

We create an auxiliar class to store the terms and the codes

A minimal perfect hash is a birectional mapping between objects and consecutive integers

In [None]:
class MPHash(object):
 # create from iterable 
 def __init__(self, terms):
 self.term = list(terms)
 self.code = {t:n for n, t in enumerate(self.term)}
 
 def __len__(self):
 return len(self.term)
 
 def get_code(self, term):
 return self.code.get(term)
 
 def get_term(self, code):
 return self.term[code]

## Extracting emergent topics

This class recibes the texts to extract the emergent topics.

In [None]:
# A sample is a collection of texts and publication dates 
# For each text, the sample stores its year and word counts. 
class Sample(object):
 pattern = pattern = r"(?:\w+[-])*\w*[^\W\d_]\w*(?:[-'’`]\w+)*"
 # Create Sample from data stored in a DataFrame with at least columns 
 # TEXT, YEAR
 # n = maximal ngram size 
 def __init__(self, data, ngram_length):
 self.size = len(data)
 self.year = data.year.tolist()
 
 texts = tuple(data.original_text)
 vectorizer = CountVectorizer(token_pattern = Sample.pattern, 
 #stop_words='spanish',
 stop_words=stopwords.words('spanish'),
 max_df=0.1,
 ngram_range=(1, ngram_length))
 matrix = vectorizer.fit_transform(texts).transpose() 
 # remove all hapax legomena to save space
 terms = vectorizer.get_feature_names()
 frequencies = matrix.sum(axis=1).A1
 selected = [m for m, f in enumerate(frequencies) if f > 1]
 hapax_rate = 1 - len(selected) / len(frequencies)
 print('Removing hapax legomena ({:.1f}%)'.format(100 * hapax_rate))
 self.matrix = matrix[selected, :] 
 self.term_codes = MPHash([terms[m] for m in selected])
 
 # store array with global term frequencies
 self.term_frequencies = self.matrix.sum(axis=1).A1
 # store doc frequencies
 self.doc_frequencies = self.matrix.getnnz(axis=1)
 # store most common capitalization of terms
 print('Obtaining most common capitalizations')
 vectorizer.lowercase = False
 matrix = vectorizer.fit_transform(texts).transpose()
 terms = vectorizer.get_feature_names()
 frequencies = matrix.sum(axis=1).A1 
 forms = dict()
 for t, f in zip(terms, frequencies):
 low = t.lower()
 if forms.get(low, (None, 0))[1] < f:
 forms[low] = (t, f)
 self.capitals = {k:v[0] for k, v in forms.items()}
 
 print('Computed stats for', len(self.term_codes), 'terms')
 
 # return the number of texts stored in this Sample
 def __len__(self):
 return self.size
 
 # return term frequency of the specified term
 def get_tf(self, term):
 code = self.term_codes.get_code(term.lower())
 
 return self.term_frequencies[code]
 
 # return document frequency of the specified term
 def get_df(self, term):
 code = self.term_codes.get_code(term.lower())
 
 return self.doc_frequencies[code]
 
 # return the most frequent capitalization form
 # (also for stopwords not in dictionary)
 def most_frequent_capitalization(self, term):
 return self.capitals.get(term.lower(), term)
 
 # return the average submission year of texts containing every term
 def average_year(self, period, tf_threshold=20, df_threshold=3):
 docs = [n for n, y in enumerate(self.year)\
 if period[0] <= y <= period[1]]
 tf_matrix = self.matrix[:, docs]
 tf_sum = tf_matrix.sum(axis=1).A1
 df_sum = tf_matrix.getnnz(axis=1)
 terms = [m for m, tf in enumerate(tf_sum)\
 if tf >= tf_threshold and df_sum[m] >= df_threshold]
 tf_matrix = tf_matrix[terms, :] 
 rows, cols = tf_matrix.nonzero()
 df_matrix = csr_matrix((ones(len(rows)), (rows, cols)))
 year = [self.year[n] for n in docs]
 
 res = df_matrix @ year / df_matrix.getnnz(axis=1) # @ operator = matrix multiplication
 
 return {self.term_codes.get_term(terms[m]):res[m] for m in range(len(res))}

 
 # return the number of occurrences (doc frequency) for every term 
 def get_df_per_year(self, term):
 m = self.term_codes.get_code(term)
 row = self.matrix.getrow(m)
 _, docs = row.nonzero()
 c = Counter(map(self.year.__getitem__, docs))

 return c
 
 # return the number of occurrences (term frequency) for every term
 def tf_per_year(self, period=None):
 rows, cols = self.matrix.nonzero()
 res = {m:Counter() for m in rows}
 for m, n in zip(rows, cols):
 year = self.year[n]
 if period == None or period[0] <= year <= period[1]:
 res[m][year] += self.matrix[m, n]
 
 return res
 
 def plot_tf_series(self, term, period, relative=False):
 m = self.term_codes.get_code(term)
 if relative:
 norm = Counter(self.year)
 else:
 norm = Counter(set(self.year))
 
 if m:
 row = self.matrix.getrow(m)
 _, cols = row.nonzero()
 c = Counter()
 for n in cols:
 year = self.year[n]
 if period == None or period[0] <= year <= period[1]:
 c[year] += row[0, n]
 
 X = sorted(c.keys())
 Y = [c[x] / norm[x] for x in X]
 plt.plot(X, Y, 'o-')
 plt.ylim(0, 1.2 * max(Y))
 plt.title(term) 
 else:
 raise ValueError('{} is not in store'.format(term))
 
 # return dictionary with a list of text-years per term 
 # period = pair of years (min _year, max_year) inclusive
 # keep_all = true if unlisted texts are not ignored
 def document_years(self, period=None, keep_all=True):
 rows, cols = self.matrix.nonzero()
 res = {m:list() for m in rows}
 for m, n in zip(rows, cols):
 if keep_all or self.listed[n]:
 year = self.year[n]
 print(year)
 if period == None or period[0] <= year <= period[1]:
 res[m].append(year)
 
 return res
 
 # return dictionary with Counter of abstract-years per term
 def df_per_year(self, period=None, keep_all=True):
 doc_years = self.document_years(period, keep_all)
 
 return {m:Counter(v) for m, v in doc_years.items()}
 
 # create a plot with document frequency of terms
 def plot_df(self, terms, period, keep_all=True):
 dfs = self.df_per_year(period, keep_all)
 for term in terms:
 m = self.term_codes.get_code(term.lower())
 df = dfs[m] 
 X = range(*period)
 Y = [df.get(x, 0) for x in X]
 plt.clf()
 plt.plot(X, Y)
 plt.title(term)
 filename = 'plots/{}.png'.format(term)
 print('Saving', filename)
 plt.savefig(filename, dpi=200)
 
 # compute the average age in the specified period of documents containing 
 # each term with global term-frequency above tf_threshold
 # and annual document frequency above df_threshold (one year at least)
 # period = optional pair of years (min _year, max_year) inclusive
 def get_ages(self, period=None, 
 tf_threshold=20, df_threshold=3, keep_all=True):
 res = dict()
 doc_years = self.document_years(period, keep_all)
 for m, values in doc_years.items():
 term = self.term_codes.get_term(m)
 if len(values) > 0:
 df = Counter(values).most_common(1)[0][1]
 tf = self.term_frequencies[m]
 #break;
 if df >= df_threshold and tf >= tf_threshold: 
 res[term] = mean(values)
 return res
 
 # return abstract numbers containing any term in this set of terms
 def docs_with_term(self, terms, period=None):
 rows, cols = self.matrix.nonzero()
 res = set()
 for m, n in zip(rows, cols):
 term = self.term_codes.get_term(m)
 if terms == None or term in terms:
 year = self.year[n]
 if period == None or period[0] <= year <= period[1]:
 res.add(n)
 
 return res
 
 
 def search(self, term):
 m = self.term_codes.get_code(term)
 docs = self.matrix.getrow(m).nonzero()[1]
 
 return [(self.year[n], self.type[n], self.panel[n]) for n in docs]

### Using the Sample class to extract the emergent topics

In [None]:
data = df

In [None]:
data = data[data.original_text.str.len() > 40] 

print('Processing', len(data), 'texts')

s = Sample(data, 2)

## Saving the pickle object

The [pickle](https://docs.python.org/3/library/pickle.html) module implements binary protocols for serializing and de-serializing a Python object structure.

In [None]:
with open('sample-doxa.pkl', 'wb') as f:
 pickle.dump(s, f)

In [None]:
with open('sample-doxa.pkl', 'rb') as f:
 s = pickle.load(f)
print('Loaded stats for', len(s), 'texts')

## Setting a period

In [None]:
period = (2010, 2018)

In [None]:
ages = s.get_ages(period)

In [None]:
top = pd.DataFrame.from_dict(ages, orient='index').reset_index()
print(top)

In [None]:
top.columns = ['TERM', 'AGE']
#top = top.sort_values('AGE', ascending=False).head(250) 
top = top.sort_values('AGE', ascending=False)#.head(250) 
top['DOC FREQ'] = top.TERM.apply(s.get_df)
top['TERM FREQ'] = top.TERM.apply(s.get_tf)

In [None]:
# prepare to export
top['TERM'] = top.TERM.apply(s.most_frequent_capitalization)
print(top.set_index('TERM').head())

In [None]:
ts = pd.datetime.now().strftime("%Y-%m-%d_%H.%M") 
filename = 'output/vocabulary_{}.xlsx'.format(ts)
with pd.ExcelWriter(filename) as writer:
 top.set_index('TERM').to_excel(writer, sheet_name='terms')

print('vocabulary saved to', filename)