Big Data: Complex Systems and Text Analysis Kretzschmar William Department of English, University of Georgia, United States of America kretzsch@uga.edu Burkette Allison Department of Modern Languages, University of Mississippi, United States of America burkette@olemiss.edu Hettel Jacqueline Nexus Lab, Arizona State University, United States of America jacqueline.hettel@asu.edu 2016-03-15T15:10:00Z Maciej Eder, Pedagogical University in Krakow Jan Rybicki, Jagiellonian University
Institute of Polish Studies Pedagogical University ul. Podchorazych 2 30-084 Krakow, Poland maciej.eder@ijp-pan.krakow.pl

Converted from a Word document

Paper Pre-Conference Workshop and Tutorial complex systems big data text mining corpora and corpus activities data modeling and architecture including hypothesis-driven modeling stylistics and stylometry text analysis linguistics data mining / text mining English

Half-day workshop (July 11, 9:30am-1pm)

Participants should bring their own laptops

A complex system (CS) is a system in which large networks of components with no central control and simple rules of operation give rise to complex collective behavior, sophisticated information processing, and adaptation via learning or evolution. The order that emerges in human language is simply the configuration of components, whether particular words, pronunciations, or constructions, that comes to occur in our communities and occasions for speech and writing. Nonlinear frequency profiles (A-curves) always emerge for linguistic features at every level of scale. Three recent books have embraced CS and developed ideas about it much more fully. Kretzschmar has demonstrated how complex systems do constitute speech in The Linguistics of Speech (2009), focusing on nonlinear distributions and scaling properties. Kretzschmar's Language and Complex Systems (2015) applies CS to a number of fields in linguistics, including a long chapter on sociolinguistics. Finally, Burkette 2016, Language and Material Culture: Complex Systems in Human Behavior, applies CS to both the study of language and the anthropological study of materiality. In this workshop we wish to introduce some basic ideas about complex systems, including A-curves and scaling; talk about corpus creation with either a whole population or with random sampling; and talk about quantitative methods, why "normal" statistics don't work and how to use the assumption of A-curves to talk about document identification and comparison of language in whole-to-whole or part-to-whole situations like authors or text types. A knowledge of emergent patterns in the CS of a language can cut through the problem of "noise" currently faced in NLP experiments that restrict findings to probabilities little more than chance.

We will start the workshop with a 60 minute (40 minute explanation and demonstration, 20 minute experiential learning) general introduction by Burkette to basic terms in CS such as "states" and "emergence," and also apply those principles to language in the form of nonlinear frequency distributions and scale-free networks (as from Kretzschmar 2009). The introductory section will acquaint the audience with how the operation of a CS leaves characteristic patterns in language as people use it. One feature of the introduction will be the use of a computer simulation (Kretzschmar) so that the audience can see the process in action, not just observe its end products.

We will then organize the workshop in two additional parts: 1) Hettel, CS and Corpus Creation; 2) Kretzschmar, CS and Quantitative Measurement. In each part, we will offer explanation and demonstrations for 40 minutes, and allow 20 minutes for experiential learning. Hettel will present a rationale for corpus creation using methods of random sampling. For work on language in texts, this means using either an entire population of texts (such as all the novels by one author) or, more usually, a rigorously sampled selection of texts from a population. Either an entire population of texts or a random sample is required in order to avoid undue influence from any subsection of texts, since CS distributions emerge in every subgroup of texts. Using the example of documents from the nuclear power industry, Hettel will illustrate how a random sample can be created using quotas for each variable to be investigated. Kretzschmar will discuss the problem that frequency patterns that emerge from a CS are always nonlinear, never "normal" in the sense required for use of typical Gaussian statistics. He will present a method to assess just how nonlinear a frequency profile is so that A-curves can be distinguished from normal distributions, and then will discuss how emergent frequency profiles from a CS can be usefully described and differentiated. The discussion will conclude with examples of whole-to-whole and part-to-whole comparisons that take advantage of emergent A-curve patterns.

The organizers will provide data for participants to use on their own laptops. Participants should have a spreadsheet program (Excel, or something that reads Excel files) in order to process the data.'

Bibliography Burkette, A. (2016). Language and material culture: Complex systems in human behavior. Amsterdam: John Benjamins. Kretzschmar, William A., Jr. (2009). The Linguistics of speech. Cambridge: Cambridge University Press. Kretzschmar, William A., Jr. (2015). Language and complex systems. Cambridge: Cambridge University Press.