The Classical Language Toolkit: Natural Language Processing for Historical Languages

By Patrick J. Burns
06/24/2016

How should I catalog recent acquisitions in Chinese? How would I go about scanning documents in Russian so that they can be better searched? How many ancient languages are represented at ISAW? These are just a few of conversations I have found myself involved in during my first week as the new Assistant Research Scholar for Digital and Special Projects at the ISAW Library. As someone obsessed with languages, and the languages of antiquity in particular—my graduate training is in Classical philology with a focus on Latin literature—the Library is an exciting place to work. David Ratzan, the Head Librarian, asked me to write a blog post about my research and I thought it would be a good idea to introduce myself through a project I will be working on this summer that combines my philological interests with the focus on digital projects in my new role at the ISAW Library, namely the Classical Language Toolkit, a collection of resources and tools for digital research on historical languages.

But first, a bit about me: My graduate training in Classics at Fordham University was largely a blend of traditional philology and literary criticism, resulting in a dissertation on intertextuality and experimentation with genre in Latin poetry of the 1st century CE. Towards the end of my time at Fordham, however, my research took up an increasingly digital direction. After participating in a NEH Institute for Advanced Technologies in the Digital Humanities at Tufts University's Perseus Project called Working with Text in the Digital Age, I recognized that my background in text processing and database development could be applied to the ancient world.

The digital project to which I have devoted the most time over the past year is the Classical Language Toolkit. The CLTK aims to extend text analysis and natural language processing methods to historical languages. By natural language processing, I mean the application of computer-assisted methods for describing, interpreting, and even generating human language. The CLTK aims to make available open-access corpora specifically for historical languages and develop the tools necessary to pursue reproducible, scientific research that advances the study of the languages and literatures of the ancient world.

This summer, I am working on improving the lemmatizers for both Latin and Greek in the CLTK, a project which has been included in Google's Summer of Code (GSoC). Lemmatization is a basic NLP task that allows dictionary headwords to be returned for a given word. So, for example, you would lemmatize "errare humanum est" (Latin for “to err is human”) as "erro humanus sum"; that is, "erro" is the word you would need to look up in a Latin dictionary to learn the lexical meaning or definition of the Latin infinitive "errare" (to wander, or make a mistake). (Verbs in Latin dictionaries are conventionally organized by the first person singular, active, indicative form.) This process continues with the adjective “humanus” (the headword for "humanum") and the verb “sum” (for “est”). The lemmatizers that are currently available rely for the most part on matching words against a database of known forms. But because of ambiguous forms and unrecognized words (for example, the homonym “cum” is used in Latin for both the conjunction “when” and the preposition “with”), the accuracy of the current lemmatizing tools tops out at around 80-85%. My GSoC project aims to increase accuracy significantly by employing rules-based prediction of forms and training the lemmatizers to make better use of word context in making decisions between ambiguous forms. You can read more about the summer project here and here.

The ISAW Library is a particularly inspiring place to pursue this kind of language work. There is a depth and breadth of language study that suffuses all of the scholarship at the Institute. As the website notes, the Library's collections spans "the sweep of the ancient Eurasian world, from the western Mediterranean, across the Near East and Central Asia, to northern China." With this impressive geographical range of course comes an equally impressive linguistic range. Natural language processing has made great strides in modern languages, but the work in historical languages has been slower to develop and there are enormous opportunities still to pursue. While my focus on the CLTK team is in working on Latin and Greek corpora, resources, and tools, the project has seen recent development in Classical Chinese, Coptic, Sanskrit, Hindi, Pali, among many others. The application of NLP to such a broad range of historical languages seems to me in great sympathy with ISAW's mission of study focused on an "unshuttered view of antiquity across vast stretches of time and place." Accordingly, one of my academic goals during my time at ISAW is to learn more about the languages scholars and students are working with here and to see how my computational approaches to these languages can help them better explore their materials, suggest new research questions, and open up avenues of inquiry. I look forward to discussing these ideas over morning coffee, at upcoming talks, and in casual conversation around the building.

Dictionaries in the ISAW Library reference section.