Interning in the ISAW Library: Developing an Error Correction Dataset for Latin Texts

By Patrick J. Burns
01/22/2026

Scanned text collections like those found in the Internet Archive and large-language model repositories are a treasure trove of Latin texts, something which I have discussed in recent research presentations at U. Cincinnati and U. Aarhus and at this year’s Congress of the International Association of Neo-Latin Studies. That said, the quality of the texts can often be quite poor due to errors introduced during optical character recognition, that is the computational “transcription” of scanned images to digitized text. What we see is something akin to the computational-scale version of scribal error: various textual defects introduced during the copying process become the responsibility of future readers to deal with and correct. This is a core activity of philologists, and textual critics in particular.

James Zetzel has written that the role of the philologist is “reconstructing what [was] written rather than enshrining or embalming the errors.” We might see the work then of computational philologists as developing strategies to take on this act of reconstruction when the extent of the textual corruption is well beyond human scale. With this in mind, I started the Post-OCR Correction Using Latin Annotations (POCULA) project. Training computational models to correct scanning errors appears to be one way forward. Unfortunately, few models exists at present for our task, and there are also only a few post-OCR correction datasets for Latin upon which such a model could be trained. This summer I worked with Patrick Liu, a Latin student at the Dalton School, on developing such a dataset.

In the following blog post, Patrick reports on his summer internship experience with the ISAW Library on helping to develop a post-OCR correction dataset for Latin. [PJB]

Over the summer, I was fortunate enough to work on an internship at the Institute for the Study of the Ancient World Library under the supervision of Patrick J. Burns and David M. Ratzan, contributing to the research project Post-OCR Correction Using Latin Annotations (POCULA). This project is dedicated to post-OCR correction, that is creating clean, annotated Latin textual data that can be used to train a language model capable of correcting corrupted OCR errors in digitized texts. My role was to carefully review and correct corrupted passages so that the model can learn from correct human input.

To do this, I worked line-by-line through paired text files: a scanned document of the book on one side, and a raw OCR output on the other. My job was to identify every error, whether a missing letter, corrupted ligature or a scrambled word, and annotate the exact correction. As I progressed, the process of correction revealed fascinating patterns about the initial OCR of the Latin text. Certain OCR errors appeared again and again, most commonly the mix-up between the letters f (but really long s, or ſ) and s, the ampersand (&) in place of et, and the letter v instead of u. In some more extreme cases, entire words were dropped into nonsense. Still, in these cases, with careful comparison to the scanned document, I could reconstruct what the text was supposed to say. Over time I became more familiar with these mistakes, almost like I learned the dialect of OCR errors. Seeing how letters and words were so easily bent or altered made me further appreciate the fragility of digital transference.

One of the most exciting parts of the project was the chance to read selections from Conrad Gessner’s Historia Animalium, a kind of Renaissance zoological encyclopedia. This is a Latin text that one would rarely encounter in a high school curriculum. Working with these pages gave me an opportunity to read about crabs (cancri), parrots (psittaci), and even mythical creatures like unicorns (monocerotes)—all fascinating reading with their vivid descriptions, finely worked woodcuts, and bursts of Ancient Greek embedded within the Latin.

Another aspect that struck me was the multilinguality nature of Gessner’s encyclopedia. Latin was the primary text, but again Greek words and references to Aristotle, Aelian, and others are commonly found in the text. This reminded me of the interconnectedness of scholarship between the classical and Renaissance worlds, where classical languages (both Latin and Greek) were crucial to the intellectual conversation.

I am grateful not only for the technical skills I gained but also the opportunity to contribute to the important digital future of Latin. I am thankful to Patrick and David for their guidance and I look forward to continuing work on this project and seeing how this language model develops as we annotate more and more data.—Patrick Liu