Converted from an OASIS Open Document
Summit work of the Spanish Golden Age and forefather of the so called picaresque novel, The Life of Lazarillo de Tormes and of His Fortunes and Adversities (henceforth: the Lazarillo still remains an anonymous text (Rico, 2011). The 400 years of attributions have left us an enormous, nearly intractable, amount of bibliography that must be reviewed and studied. Paradoxically, scholars, instead of shying away from this mystery, are still adding new proposals to the pool of candidate authors, although some of them use modern and less explored methods (mostly computational) that were not available a decade ago.
Chronologically, the Hieronymite Friar José de Sigüenza was the first to propose a possible author: Friar Juan de Ortega. Father Sigüenza’s Historia de la Orden de San Jerónimo gathers his finding of a manuscript of the Lazarillo in the cell of Juan de Ortega. Although a draft was indeed found in the Friar’s cell, the circulation of handwritten copies was a common practice during the Spanish Golden Age (Botrel and Salaün, 1974). The claim that Father Ortega was the author is hard to sustain as the draft does not seem to be enough proof. Diego Hurtado de Mendoza was proposed a couple of years later, in 1607. His candidacy as the author of the Lazarillo was proposed by Valerio Andrés Taxandro. Over the years other scholars contributed to the diffusion of Mendoza being the author, and the attribution proved to be extremely popular. For almost three centuries book catalogues all over Europe recorded Hurtado de Mendoza as the author of theLazarillo. In 2010 Mercedes Agulló provided documentary proof to support the authorship, although some dispute the validity of such evidence (Agulló y Cobo, 2011). Other humanists were also proposed. The reformist Juan de Valdés was defended by Morel-Fatio and Manuel J. Asensio, but in view of the lack of solid evidence, this candidacy was abandoned in favor of his brother Alfonso. Rosa Navarro Durán supported Alfonso de Valdés, as she found connections between the works that influenced both Valdés’ work and the Lazarillo. The lack of direct comparisons has been strongly criticized.
In the last decade, a couple of names have taken center stage. Juan Luis Vives was proposed and devotedly defended by Francisco Calero. Following the same precepts as for Alfonso de Valdés’ candidacy, the fact that all Vives’ known works were written in Latin makes the attribution lose some credibility. In 2008, after abandoning the authorship of Francisco Cervantes de Salazar, José Luis Madrigal proposed Juan Arce de Otálora. Using existing corpora such as Google Books and CORDE, Madrigal employed basic computational analysis based on the counting of words in order to find correlations between the style of authors in the corpora and the style of the author of the Lazarillo.
Computational approaches to authorship attribution are usually not considered to be enough proof to state the final truth in the disputed authorship of an anonymous text. However, the use of modern authorship attribution techniques might help to reduce the pool of candidates and contribute with evidence to support a specific possible author. After building the pool of candidates, we collected each candidate’s available texts. Half of them lacked modern editions, which we solved building and using our own crowdsourcing OCR reviewing tool. The corpus we created counts a total of 50 works in a 90 year period surrounding the publication of the first known edition of the Lazarillo in 1554. All the major candidates for the authorship of this book were included, as well as some authors who had not been considered previously to add robustness to our analysis. The rules we followed for regularizing the spelling of old Spanish were borrowed from Ocasar’s system (Ocasar, 2014). Features from the text were then extracted following the winners of several editions of the authorship attribution competition known as PAN at CLEF, which establishes the state-of-the-art in attribution techniques (Stamatatos et al., 2015). The final set of features were composed by several distributions: functions words, the 300 most common words (BOW), the 3000 most common character 3-grams (CNG), punctuation signs, the 30 most common parts of speech; tf-idf for a maximum of 1000 word bi- and tri-grams, and for a maximum of 1000 character {2,4}-grams; and average sentence length, sentence length variation, and sentence lexical diversity. A combined feature vector with all the abovementioned features was also included.
Once the texts were in digital format, we explored the dataset through distance-based methods, such as Burrows’ Delta and its variations (Burrows, 2003), which outperformed any other. For compression-based methods we applied Cilibrasi and Vitanyi’s NCD with BZIP2, RAR and PPM (Cilibrasi and Vitanyi, 2005). We tested PCA and Linear Discriminant Analysis with different settings for number of chunks per work, components and features (Burrows and Hassall, 2002). Our final approach was comprised of three steps: first we used unsupervised learning to reduce the pool of candidates. Then, applying supervised learning, we ranked the possible authors. Finally, only six of these candidates were fed into an ensemble algorithm for “unmasking” the most likely author. As for unsupervised learning techniques, we obtained that Ridge, Bernoulli, multinomial, and nearest centroid had the best performance for our total feature vector, BOW, and CNG. In supervised learning the best results were provided by SVM and maximum entropy models, for the same feature sets. This allowed us to reduce the pool of candidates for the unmasking method proposed by Moshe Koppel and Jonathan Schler since it is a computationally expensive method (Koppel and Schler, 2004). The results were consistent for all methods and in line to what we first obtained applying Burrows’ Delta. We found that the most likely author seems to be Juan Arce de Otálora, closely followed by Alfonso de Valdés. Unfortunately, although supporting previous hypotheses about the authorship of the Lazarillo and providing with evidence in the case of Valdés, the method also stated that no certain attribution could be made with the given corpus.