The Seven Words of the Virgin: Identifying change in the discourse context of the concept of virginity in Early Modern English Susan Fitzmaurice s.fitzmaurice@sheffield.ac.uk University of Sheffield Justyna Robinson justyna.robinson@sussex.ac.uk University of Sussex Iona Hine i.hine@sheffield.ac.uk University of Sheffield Fraser Dallachy fraser.dallachy@glasgow.ac.uk. University of Glasgow Kathryn Rogers k.m.rogers@sheffield.ac.uk University of Sheffield Marc Alexander marc.alexander@glasgow.ac.uk University of Glasgow Michael Pidd m.pidd@sheffield.ac.uk University of Sheffield Seth Mehl seth.mehl.10@ucl.ac.uk University of Sheffield Matthew Groves m.i.groves@sheffield.ac.uk University of Sheffield Brian Aitken brianaAitken@glasgow.ac.uk University of Glasgow Introduction The Linguistic DNA project (LDNA) is an AHRC-funded collaborative project (AHRC grant AH/M00614X/1) between the universities of Sheffield, Glasgow, and Sussex which is designing automatic processes to investigate the emergence and development of concepts in pre-1800 CE print. Employing Early English Books Online, manually-transcribed through the Text Creation Partnership (EEBO-TCP) as its primary dataset, supplemented by Eighteenth Century Collections Online (ECCO-TCP) and other high-quality 18th-century text collections, the project is developing and refining a processing pipeline which assembles groupings of words bound together by their contextual use in printed discourse. The project is charting development of these discourse-embedded word groups across time, investigating how they are shaped by historical and literary contexts, the boundaries and overlap between the groupings, and the interaction of ‘encyclopedic' groupings with more traditional ‘thesaurus'-style semantic fields. This paper discusses results from a branch of the project which is investigating incidences of rapid change in the size of semantic categories as represented in The Historical Thesaurus of English (Kay et al., 2016). Development of concepts through size of Thesaurus categories has been investigated previously (cf. Alexander and Struan, 2013; Jurgen-Diller, 2014), although the extra dimension provided by the outputs of the LDNA processor allows a dramatic leap forward in such research by enabling identification of instances in which change in discourse-embedded word groupings acts as catalyst for corresponding rapid change in the semantic fields of English. The present case study investigates words relating to the concept of ‘virginity', utilising processed time-slice subsets of EEBO-TCP as snapshots of the discourse context for these words in Early Modern English print. By building sample word-groupings (the term ‘cluster' is here avoided to avoid confusion with cluster or network analysis word clusters) for each of the subsets, it establishes the discourse context of ‘virginity' words at different points in the timespan covered by EEBO-TCP. Comparison of these groupings suggests change in focus of language users, through which a largely religious context of use opens out to a secular and then a poetic literary context, suggesting that society's consciousness of this concept and the scale on which it was discussed enlarged dramatically in the period covered by the sub-corpora. Methodology In order to select a Historical Thesaurus category for analysis, an average pattern of change over time was established. The Thesaurus arranges the words of the English language into a semantic hierarchy that is seven category levels deep with the potential for up to four further sub-category levels within any given category. Owing to the incredibly fine-grained nature of the sense categorisation in the Thesaurus, it was necessary to ‘cut' the hierarchy at human scale using a thematic category set, developed during the AHRC- and ESRC-funded SAMUELS project (Grant AH/L010062/1), which is intended to allow Thesaurus users to find information at a level that is salient to human beings - i.e. neither too general nor too detailed. [016-1] Figure 1: Sparkline showing growth of ‘Virginity' category from 1000 CE to 2000 CE in context of surrounding thematic categories (which are themselves unusual as they are lexicalised only in later periods of English) The number of lexemes within each category level was counted, and lexemes were filtered to include only those active within the approximate time range of the EEBO-TCP collection, i.e. 1475-1700 CE. This data was aggregated so that the change in the mean contents of a category could be viewed across time, and decade-to-decade percentage changes calculated. Individual categories were then compared to this average category change, and a deviation of more than 5% from the average change considered to be significant. Out of the categories which were marked as statistically unusual from this process, category ‘AI09g Virginity' was selected as promising because the items in its lexis had a relatively low number of homographs that could skew the results towards irrelevant information. Testing of the LDNA processor outputs is being conducted on select subsets extracted to provide snapshots across the EEBO time-period. The subsets used for this paper cover the periods 1520-39, 155059, 1610-11, and 1649. They are designed to contain a similar number of tokens; the progressively contracting timespans reflect the concomitant growth of printed material throughout the 15^ th to 17^th centuries. Each token in the text is regularised, lemma-tised, and tagged with a NUPOS part of speech tag via the MorphAdorner pipeline developed by Martin Müller and Philip Burns (Burns, 2013). Data is then gathered by the LDNA processor for the token's cooccurrences within 100- and 200-word bi-directional windows which are intended to simulate paragraph-like sections of the proximate discourse (cf. Fitzmaurice et al. forthcoming). Pointwise Mutual Information (PMI) is used to provide a statistic for likelihood of word co-occurrences; a minimum PMI value of 0.5 was arrived at experimentally for identifying node-collocate pairs to be considered interesting in initial stages of investigation. Seven items - ‘maid,’ ‘maiden,’ ‘maidenhead,’ ‘undefiled,’ ‘vestal,’ ‘virgin,’ and ‘virginity’ - in the ‘Virginity’ category were found to be present consistently across the subsets (although an eighth - ‘virginal’ - was present in the 1520-39 and 1610-11 subsets). The co-occurrences were then processed to identify those which occurred with multiple items in this list. Words which co-occurred with four or more items were investigated further. Results Comparison of the co-occurrence results across the five text subsets shows a consistent shift in the patterns of word association with ‘Virginity’ category items. The words ‘woman’ and ‘widow’ remain strongly associated with the terms across all the subsets, demonstrating societal preoccupation with female rather than male virginity. The most evident change in the grouping is movement from a predominately religious discourse context into the secular world. In the 1520-39 subset, the Virgin Mary is intimately related to discussion of virginity. In the shared collocates listing, mother collocates with all seven of the ‘Virginity’ lexemes, ‘mary’ with six, ‘angel,’ ‘bless,’ ‘hymn,’ ‘nativity,’ and ‘nazareth’ with five each. Of these, only ‘mother’ maintains a strong association with ‘Virginity’ words throughout the EEBO period, appearing with four items in the 161011 text set and five in 1649. The secularisation of the term is suggested by the prevalence in later subsets of words relating to marriage, reflecting what appears to be a growing focus on wedlock being preceded by virginity. ‘Marry’ gradually increases its association with the node items, collocating with four, then five, then seven from 1520 to 1649. ‘Marriage’ and ‘wife’ both enter the shared collocate group in 1550, and remain there through to 1649, whilst ‘matrimony’ is present in 1550, drops out in 1610, and returns in 1649. The extensive list of shared collocates in the 1649 sub-corpus strongly reflects the greater prevalence of literary fiction and poetry in printers’ output and reinforces that virginity is a topic for which the discourse context is expanding; where it was easy to intuitively group ‘marry,’ ‘marriage,’ and ‘wife’ together, the 1649 collocates do not form easily identifiable groupings. Discussion The consistency of the core items found in the subsets is interesting in its disparity with the Thesaurus data, where the increase in the number of terms present in the ‘Virginity’ category suggests that there should be an expanding number of items found throughout these subsets. The most likely explanation for this is loss of low frequency information through a combination of cut-off values intended to reduce noise for later clustering experiments, and difficulty in normalising/lemmatising low frequency items. A clear outcome of the analysis is the confirmation that the category of ‘Virginity' contains core vocabulary which remains almost unchanged in over a century (i.e. 1520-1649), primarily a consistent group of seven items which co-occur with ‘Virginity' category words. This study demonstrates that understanding of semantic development can be enriched by such cross-analysis of discursive-concept word groups with Thesaurus semantic fields and the word groups which travel through time with multiple items of Thesaurus categories.