Teaching Text Analysis for Historical Language Research

By Patrick J. Burns

What am I working on at my desk on the third floor of ISAW on any given day? 

My position covers broadly COMPUTERS + ANTIQUITY, but a core part of my research agenda entails the large-scale mining and analysis of ancient-language text. I tend to use the term “computational philologist” these days and think of this role as something like “How can I leverage the speed, power, and endurance of computers to build meaningful arguments about Latin texts?” And while my work primarily concerns Latin—I mention the language here only because my doctorate is in Latin literature and that has been the main focus of my recent work—we can insert any historical language. I have long been involved, for example, with the Classical Language Toolkit, a natural language processing framework specifically dedicated to multilingual work. It is in the spirit that I am offering the “Text Analysis for Historical Language Research” course at ISAW this fall. In this post, I describe in some more detail what I tend to work on and how I plan to support the ISAW graduate students and research community in using computational methods for their research questions involving textual data, no matter which language or languages they work on.

So, what text projects have I been working on recently? A few examples:

  • I trained a new “pipeline” for processing Latin texts on roughly 1 million words of annotated textual data in order to return with high accuracy, among other things, the correct lemma, part-of-speech label, morphological information, and role in a sentence (i.e. subject or object etc.). The pipeline shows, for example, ~97% accuracy on POS labeling and ~94% accuracy on lemmatization. 

  • I presented a paper at the “Historical Psychology” preconference workshop at the annual meeting of the Society for Personality and Social Psychology that looks at emotion expressions in hundreds of years of Latin text. I created a dataset of ~13,000 sentences, automatically labeled as either being about “love” (amor) or “hate” (odium) and trained a machine learning-based classifier to construct a lexicon of the words most associated with either class.

  • On a lark, I used ChatGPT to build a reading comprehension app for Latin texts. The app takes a random paragraph from the Fabulae of the mythographer Hyginus and automatically generates basically an infinite number of reading comprehension questions and answers based on the text. And it does so in Latin!

What do all of these research projects have in common? The need to work with hundreds of thousands of data points related to a historical-language text in a systematic fashion. It is these (computational) strategies, approaches, and methods that I want to share with my ISAW colleagues, helping our research community build the skills necessary to design similar projects in their research areas and with attention to the wide range of historical-language studies here. This is what “Teaching Text Analysis for Historical Language Research” can contribute to the ISAW curriculum.

Here are some of the key questions organizing the course: 

  • When presented with a lot of textual data—think thousands, if not millions, of words—how do we organize such a collection? how do we sift through its contents rapidly and efficiently?

  • How do we extract the information we want?

  • How do we format what we find so that it can be further classified, clustered, visualized, and so on? 

With this course students—not to mention research scholars, associates, and faculty—will get an introduction to the methods most relevant for ancient world study from text analysis, corpus linguistics, natural language processing, information retrieval, among others, learning think in terms of the core tasks of computational text work.

The five tasks that we will cover in detail in this course are: text classification, text clustering, sequence labeling (e.g., given a sentence, being able to label each word in the sentence with a POS label), entity extraction (e.g., given a sentence, being able to identify all of the named people or geographic locations), and text visualization. In addition, we will start the course with some basics of text data management—when you start working with hundreds or thousands of text files for your research, any task you want to undertake will made that much easier with attention to file structure, data format, and so on.

Perhaps the most exciting part of designing this course is the way it expands the curricular offerings at ISAW. The Digital Humanities intro course has become a cornerstone of the graduate curriculum here. I have co-taught this course with Sebastian Heath, Tom Elliott, and David Ratzan, specifically teaching the text analysis weeks on the syllabus. Yet language work is such a core part of ancient world study that it deserves an entire semester, a timeframe in which I can provide deeper training in the methods and give participants time to develop the kinds of new research questions these methods enable. 

Ted Underwood speaks in his book Distant Horizons of a “missing curricular foundation” that serves as an practical impediment to the widespread adoption of large-scale text analysis in the literary studies: “If we literary scholars also want to make a home for large-scale quantitative research in our departments, we will need to create a curricular path for it” (163). Mutatis mutandis, ancient world scholars. The alternative is what Adam Crymble has dubbed the “invisible college,” i.e. the patchwork of tutorials, workshops, and other “self-learning resources” that can introduce students to methods helpful for their research but not necessarily in a systematic way helpful for their careers. Expanding the DH curriculum at ISAW provides a foundation where cutting-edge methods can be productively applied to our disciplinary strengths—expanding the range of research questions that can be asked and the scale at which they can be pursued. ISAW has long offered this kind of curricular continuation for DH topics largely in archaeology and material culture (e.g. “Data Modelling and Querying”, “QGIS for Archaeologists and Historians”, “3D Modelling and Related Technologies”). With “Text Analysis for Historical Language Text,” we expand our beyond-the-intro offerings into philology and other text-focused areas of ancient world study.