New Analytical Approaches to the Corpus
The third and final week will take the idea of the digital texts established through digital surrogates and electronic editions and discuss new ways of analyzing them. Coordinated by Director Jonathan Hope, the sessions in week three will look forward to new tools, new methods, and new opportunities as well as discussing the new problems they introduce. Against the backdrop of scholarly articles on corpus linguistic analysis, visiting faculty will guide discussion on these overarching questions: how will the availability of massive corpora of historical English change the subject? What tools are being developed to enable new kinds of searching (and at what cost)? How can scholars use DH in a way that is genuinely transformative of the subject? How do they bring their literary knowledge of texts (their genre, their relationships within literary history as it is currently understood) into a meaningful relationship with the vectors that can be drawn to visualize statistical relationships between those texts?
Monday’s visitor will be Mark Davies (Brigham Young University), who has pioneered the use of “mega-corpora” for the lexical analysis of English. In an initial morning session, Professor Davies will provide a hands-on demonstration of how his NEH-funded, 400-million-word Corpus of Historical American English (COHA) and his Google Books Corpus can be used to study lexical change in English. The possibilities for uncovering useful data in meaning and usage include the frequency of any word, semantically related word, or phrase across time; searches conducted by parts of speech or lemma (i.e., the headword as it would appear in a dictionary); comparisons of the English language’s word-stock in contrasting time periods; and discovering instances of collocates (i.e., words which co-occur more often than would be expected by chance). Participants will experiment with some guided corpora searching. When they are comfortable using the interface to make queries of the datatset, they will divide into small groups in the afternoon to perform experimental searches related to their fields of interest.
On Tuesday morning, participants will reconvene to discuss and analyze their searches and discoveries with Professor Davies. Two questions will focus the discussion: what chronological and word-type restrictions do scholars of early modern literature face, and how does modernization of early modern orthography currently reduce the usefulness of such corpora?
On Tuesday afternoon, Dr. Marc Alexander (University of Glasgow) will lead discussion. His work on semantic searching and the Oxford Thesaurus builds on that of Davies in interesting ways. While simple string-searching for word-forms can be productive, making the theoretical leap from word-form to meaning and semantic relationship is not straightforward. Dr. Alexander will discuss the thorny problem of integrating meaning into the digital study of English texts. Current practice focuses on words and those things which can be identified from words (such as grammatical classes); investigating meaning has been a much harder task. Of the various resources which aim to provide semantic “gateways” into texts, Dr. Alexander will introduce participants to the Historical Thesaurus of English (HTOED) and demonstrate its usefulness. This session will discuss the possibilities provided by HTOED, by looking at the English language as a whole, and through a narrower exploration of the ways early modern semantic fields change, for instance, for the words meaning man and woman. Once again, participants will divide into small groups to conduct searches relevant to their interests.
A semantic arrangement of information about text (rather than, say, an alphabetical organization) lends itself to techniques of displaying and clustering data visually. On Wednesday morning, Dr. Alexander will shift discussion to consider visualization methods and their appropriateness to certain types of projects. The participants will compare ways of visualizing data provided from HTOED using the University of Maryland’s Treemap software and discuss the USAS tagger available from Lancaster University. He will invite discussion on how these applications may be useful for the participants’ research.
How does visualization provide a tool that offers serious inroads into scholarly data using new techniques? How can visualizations allow scholars to investigate rather than simply view data? On Wednesday afternoon, NEH Institute Director Jonathan Hope will take up these major questions. His case study will involve the work of the “Visualizing English Print” (VEP) project, a major Mellon-funded initiative coordinated by scholars at the Folger, the University Wisconsin at Madison, and Strathclyde University. Its team seeks to develop tools and protocols that enable researchers to analyze and visualize the data being made available as part of the Text Creation Partnership through EEBO and other archives. The VEP project addresses the possibilities, and problems, of dealing with mega-datasets. One of the most striking methodological issues facing researchers is the vast quantity of data that is becoming available, as corpora shift from 40 texts to 400, and on to 400,000. If scholars are focused on a history of words, then such data sets are an advantage. But when scholarship seeks to move beyond words to study the development of genres, for example, then the quantities of data pose significant problems for the researcher. After introducing participants to some of the problems of dealing with such data sets, Dr. Hope will demonstrate the analysis and visualization tools being developed by the project team. In addition to lexical and semantic searching, participants will consider comparative rhetorical analysis using Docuscope, which allows scholars to trace the development of genres and modes of discourse through time. His presentation will culminate with a discussion of the mathematics of comparison: the “spaces” in which scholars project texts in order to compare them.
On Thursday morning, participants will reconvene to discuss how the methods and tools used for the VEP project might be amenable to their own work. There will be an opportunity to run Docuscope and the tools developed by the VEP team. They will focus on how scholars develop the ability to read, interpret, and evaluate visualizations, and the importance of understanding the statistical procedures that lie behind visual representations.
In the final three sessions, on Thursday afternoon, Friday morning, and Friday afternoon, participants will respond to the themes of the institute and lay out plans and issues for their future research. They will discuss what they have learned, speculate on what needs to be done or made available to researchers in the field, and describe what they have been inspired to investigate. They will also indicate what their continuing contribution to the Institute’s digital footprint will be. These sessions are the culmination of the three week program, but they also mark the beginning of the work participants will continue after the institute.