Current Projects

Summer 2025

This summer we are have two volunteer research opportunities for high school, college, or graduate students:

Representing Women Authorship in the Latin Treebanks (RWALT)
Post-OCR Correction Using Latin Annotations

We are now fully committed for summer internships. Please write to David Ratzan at dr128@nyu.edu if you are interested applying for the Fall 2025.

Representing Women Authorship in the Latin Treebanks

RWALT is a text data curation and annotation project led by Patrick J. Burns and David M. Ratzan at NYU’s Institute for the Study of the Ancient World Library. The project seeks to annotate syntactic and morphological features of Latin texts written by women with the goal of collecting these texts into a Universal Dependencies-compatible treebank. While the existing Latin UD treebanks cover a range of time periods, geographic areas, authors, works, and genres, there is significant progress to be made with the number of sentences in the treebanks that can be attributed to women. RWALT aims to redress this imbalance in the data. The project uses the LatinCy pretrained NLP pipelines developed by Burns to annotate Latin written by women authors. The project began in October 2024. So far our annotation efforts include such authors as Sulpicia, Proba, Luisa Sigea, and Anna Maria van Schurman, with more to come.

Volunteer contributors to the RWALT project have the opportunity to learn:

fundamental skills in reading and translating Latin texts;
about women writers of Latin from the Classical world to the 20th century;
about the curation and annotation of ancient-world datasets; and
digital skills around processing, analyzing, and modeling Latin language data.

Contributors will help ISAW Library build this exciting new corpus of women writers of Latin by:

collecting authors that are in scope for the corpus;
conducting basic biographical and bibliographic research about these authors and the Latin works;
authoring translations in English of the Latin texts;
collecting and editing metadata about the authors and their works;
correcting scanned text in order to produce accurate digital texts of Latin works;
creating systematic annotations and tags in the digital texts in order to make the texts computationally tractable; and
(at later stages) training and testing language models on the growing corpus of women’s Latin.

Post-OCR Correction Using Latin Annotations

Large-language training data repositories like common_corpus (CC) from Pleias ostensibly contain large amounts of digitized Latin text: at an advertised 34B tokens, CC is magnitudes more than disciplinary-specific curated online repositories like the Perseus Digital Library (6.5M) or the Patrologia Latina (29.3M). Ostensibly, because much of the CC text has been corrupted by the scanning and optical character recognition (OCR) processes during digitization. As rescanning (and re-OCRing) the physical books would be a massive undertaking—expensive, labor-intensive, logistically impractical—the best path forward for remediating textual corruption in these Latin texts is through the natural language processing task of post-OCR correction. Post-OCR correction (Nguyen et al. 2021; Ramirez-Orta et al. 2022; Guan and Greene 2024) uses machine learning to train a computer to recognize corruption patterns in the digitized text and restore the texts to better readings. In this respect, the process is a computational enactment of philology’s defining scholarly activity, namely correcting and establishing historical-language texts (West 1973; Tarrant 2016). Although there has been good progress in historical-language correction (Thomas, Gaizauskas, and Lu 2024; Smith and Cordell 2023), no post-OCR correction dataset for Latin currently exists. Pleias itself has released a large post-OCR multilingual correction dataset, covering four languages (English, French, German, and Italian) in its repository. The Post-OCR Correction Using Latin Annotations project (POCULA) aims to supplement this work by compiling Latin sentences that can be used for developing post-OCR correction models for Latin. The initial dataset will consist of 10,000 pairs of Latin sentences with metadata; annotators on this project will be responsible for adding a column to the dataset with the corrected version of the sentence, referring to the scanned document when available. This manual transcription and correction process will be aided by lexical and morphosyntactic annotations from Latin NLP pipelines and language models. To give a sense of the payoff of the POCULA dataset: if we can correct at scale only 1% of the CC, those texts at 340M tokens would be among the largest available collections of digitized Latin and would be in turn a substantial foundation for training updated and improved languages models.

The project will consist of the following initial phases:

correction of 10,000 sentences chosen from existing documents in the Latin subcollection of CC text repository;
matching, where possible, the CC documents to their original scans; and
training interim correction models at sentence milestones (1,000; 2,500; and 5,000) to assist the transcription process.

This project is led by Patrick J. Burns, who has extensive research experience in developing Latin language-analysis pipelines (e.g. LatinCy) and Latin language models (e.g. Latin BERT). The work on Latin BERT suggested the effectiveness of using large amounts of Latin text (~561M from Internet Archive texts) even despite a significant amount of “noisy,” that is OCR-corrupted, training data. The ultimate goal of having a functional, high performing post-OCR correction model for Latin is training the next generation of large Latin language models on increasing amounts of Latin text that are correspondingly composed of decreasing amounts of textual corruption.

Internship details

The internship is virtual: all meetings are handled via Zoom.

Interns or volunteers are expected to work independently between 4-8 hours a week (actual time will vary depending on a variety of factors), with one to two hour-long meetings each week with collaborators and the internship mentors.

The internship period is seven weeks long, with the earliest start date Tuesday, May 27, 2025 and closing Friday, July 11, 2025. Interns may participate for the entire internship period, but they must commit to at least 4 weeks.

ISAW and NYU cannot offer credit or certificates for internships or to volunteers, but can acknowledge the successful completion of the internship.

While volunteer contributors may join the project at any time, those under 18 need to be registered with NYU’s Office of Youth Programs Compliance in advance of the volunteer’s start date. Such unpaid volunteers can work for up to three months at a time, after which they need to renew the volunteer agreement.

All interns and volunteers are required to submit a final report, due by the last day of the internship or volunteer period.

Dates

May 1, 2025: Application
May 27, 2025: Earliest internship start date
July 11, 2025: End of internship period

Requirements

Three years of Latin or reading proficiency
There are no specific technical prerequisites

Those interested should be able to demonstrate solid intermediate level Latin reading skills, as they will be necessary to complete the correction, translation, and syntactic annotation work. We have found in our experience that success correlates directly with the contributor’s interest and ability to invest a certain amount of time in learning new digital techniques, tools and methods, and an openness to combining traditional reading techniques with novel digital approaches.