Word Embeddings for Processing Historical Texts Sprugnoli Rachele Fondazione Bruno Kessler, Italy rachele.sprugnoli@unitn.it Moretti Giovanni Fondazione Bruno Kessler, Italy moretti@fbk.eu 2014-12-19T13:50:00Z Name, Institution
Street City Country Name

Converted from a Word document

Paper Poster word embeddings historical texts named entity recognition natural language processing linguistics English

In the last years, word embeddings have become important resources to deal with many Natural Language Processing (NLP) tasks (Collobert et al., 2011; Maas et al., 2011; Lample et al., 2016). Several pre-trained word vectors have been released generated with different algorithms but all based on a huge amount of contemporary texts, mainly news and Wikipedia pages but also Twitter posts and crawled web pages.

The interest towards this type of distributional approach has recently emerged also in the Digital Humanities community as proved by the organization of dedicated workshops (e.g., http://dariah-tda.github.io/meeting/activity/workshop/2017/12/20/CfP-Workshop-on-Embeddings.html ) and the publication of scientific articles on vectors built from historical or literary texts for tracking semantic shifts (Hamilton et al., 2016; Wohlgenannt et al., 2018; Leavy et al., 2018).

This submission aims at expanding current research on historical word embeddings by presenting a set of English vectors pre-trained on a sub-part of the Corpus of Historical American English (COHA) (Davies, 2012) with three different algorithms. The subset of COHA we have chosen includes 36,856 texts of all the four available genres (fiction, newspaper, magazine, non-fiction) published between 1860 and 1939 for a total of more than 198 million words. We chose this specific time frame because we have a collection of travel writings of the same period of publication on which we planned to perform several NLP tasks as the one presented in the “Application” section below. In particular, this collection ( https://sites.google.com/view/travelwritingsonitaly ) contains both travel reports and guides published in a period of radical transformation in travel habits thanks to several technological, economic and sociological factors that led to the decline of the Grand Tour and the emergence of leisure-oriented travels. As for the applied models, we used the GloVe, fastText and Levy & Goldberg's approaches (Pennington et al., 2014; Levy & Goldberg, 2014; Grave et al., 2018). By adopting these three models, we cover different types of word representation: GloVe is based on linear bag-of-words contexts, fastText on a bag of character n-grams, and Levy & Goldberg’s model on dependency parse-trees.

Before applying these models, we lower-cased all the texts; tokenisation and dependency parsing (required by the Levy & Goldberg approach) were then performed with Stanford CoreNLP (Manning et al., 2014). The training was done by considering all words appearing at least 10 times in the COHA sub-corpus and a context window size of 10. In the first phase, words are mapped to their frequency count, then a context vocabulary is created taking into consideration the context window. Our pre-trained word embeddings (called HistoGlove , HistoFast and HistoLevy ) have 300 dimensions and are publicly available online ( http://dh.fbk.eu/technologies/histo ).

Table 1. Examples of the most 7 similar words as induced by different embeddings HistoFast HistoGlove HistoLevy Word2Vec gay merry, gayest, joyous, gaiety, gayly, gayety, light-hearted merry, bright, joyous, cheerful, brillant, happy, flowers merry, gorgeous, joyous, rosy, lively, bright, cheerful lesbian, bisexual, lgbt, lesbians, women, sexual, gays, homoxesual dancing dance, danced, dances, dancers, dancer, walzing, dancin dance, playing, danced, singing, music, dances, dancers singing, bathing, skating, swimming, feasting, wrestling, chattering dance, singing, dances, songs, dancers, ballroom, featuring woman girl, madwoman, lady, irishwoman, husband, maid, she girl, man, women, wife, she, husband, mother girl, man, damsel, gentlewoman, englishwoman, youngster, creature man, person, girl, child, women, children, men

Table 1 shows the top 7 similar words, in terms of cosine similarity, of a given set of target words (i.e., “gay”, “dancing”, and “woman”) as found in our three historical word embeddings and in Word2Vec trained on contemporary data (Mikolov et al. 2013). Among the words reported in Table 1, the main meaning shift is observed for “gay”, for which the reference to homosexuality is not present in the historical vectors. As for “dancing”, it is worth noticing that historical vectors brings out typical terms of the considered period ( walzing ) and that the dependency-based approach induce similarities having the same syntactic role (that is, other gerunds, i.e. singing, bathing, skating, swimming, feasting, wrestling, chattering ): instead of finding words having high domain similarity, Levy & Goldberg model finds words with high functional similarity (Turney, 2012), thus words behaving like the target word. Terms rarely used in contemporary texts are detected for the target word “woman” as well (see the visualization of the corresponding embeddings in HistoGlove in Figure 1): e.g. madwoman, damsel, gentlewoman . Social roles such as maid and wife does not appear in the list of the most similar words in Word2Vec, replaced by the neutral term person .

Visualization of the embeddings of “woman”. Image created with the Embedding Projector https://projector.tensorflow.org/

Application

Place names automatically detected and then visualized using Carto, https://carto.com/

Our embeddings can be useful resources for the development of NLP tools aiming at processing historical texts with neural architectures (Sprugnoli and Tonelli, 2019). For example, we applied them to the recognition of place names of different types (e.g. “Vesuvius”, “Venice”, “Forum Romanum”) in English historical travel writings on Italy (Sprugnoli, 2018). The deep learning architecture we adopted (Reimers and Gurevych, 2017), using a small set of in-domain training data (100,000 tokens), the HistoGlove embeddings and no feature engineering, outperformed both the CoreNLP CRF ( Conditional random fields) model retrained with the same dataset and the same neural architecture employing bigger vector spaces pre-trained on contemporary texts. Our best model achieves a precision of 86.4, a recall of 88.5 and an F-measure of 87.5. Figure 2 displays the place names, related to the center of Florence, automatically detected in the tenth chapter of “Florence and Northern Tuscany with Genoa” (Hutton, 1908).

Bibliography Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), pp.2493-2537. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y. and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142-150). Association for Computational Linguistics. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016). Neural Architectures for Named Entity Recognition. In Proceedings of NAACL-HLT (pp. 260-270). Pennington, J., Socher, R. and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). Grave, E., Bojanowski, P., Gupta, P., Joulin, A. and Mikolov, T. (2018). Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893. Hamilton, W.L., Leskovec, J. and Jurafsky, D., (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1489-1501). Wohlgenannt, G., Chernyak, E., Ilvovsky, D., Barinova, A. and Mouromtsev, D. (2018). Relation Extraction Datasets in the Digital Humanities Domain and their Evaluation with Word Embeddings. In Proceedings of CICLING 2018, Hanoi, Vietnam. Leavy, S., Wade, K., Meaney, G. and Greene, D. (2018). Navigating Literary Text with Word Embeddings and Semantic Lexicons. In Proceedings of the Workshop on Computational Methods in the Humanities (COMHUM 2018). Davies, M. (2012). Expanding horizons in historical linguistics with the 400-million word Corpus of Historical American English. Corpora, 7(2), pp.121-157. Levy, O. and Goldberg, Y. (2014). Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (Vol. 2, pp. 302-308). Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. (2014). The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). Peter D. Turney. (2012). Domain and function: A dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44:533–585. Reimers, N., & Gurevych, I. (2017). Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 338-348). Hutton, E. (1908). Florence and Northern Tuscany with Genoa. Methuen. Sprugnoli, R. and Tonelli, S. (2019). Novel Event Detection and Classification for Historical Texts. Computational Linguistics, Vol. 45, No. 2, June 2019, pp.1-38. Sprugnoli, Rachele. (2018). Arretium or Arezzo? A Neural Approach to the Identification of Place Names in Historical Texts. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), Torino, Italy, December 10-11, 2018.