ISAW Library Internship Report: Developing a Linguistic Dataset of Latin Works by Women

By Patrick J. Burns
02/03/2025

In fall 2024, I launched a pilot data curation and annotation project called “Representing Women Authorship in the Latin Treebanks” (RWALT). While working on the training of the LatinCy natural language processing pipelines, I noticed that the datasets used as the basis for this activity, namely the Universal Dependencies Latin treebanks, contained no material that could be attributed to a woman. Substantial amounts of Caesar, Thomas Aquinas, Dante and many other authors appear among the 58,000 sentences of the treebanks, but no Sulpicia, no Hrotswitha, no Hildegard, nor any of the other women writers of Latin from the last two millennia. These women have received increasing attention in scholarship, as in the works of Jane Stevenson, as well as in pedagogy, as in the work of Skye Shirley and Lupercal, but have not yet received similar attention with respect to computational resources for Latin. What we see is, to use the words of Carolina Criado Perez, a “gender data gap” in these important computational-philological datasets. This fall, Head Librarian David Ratzan and I worked with Lily Hegener, a sophomore at the Hackley School in Tarrytown with interests in Latin and math, on our first RWALT student internship. Through weekly meetings, Lily read and translated Latin selections from Anna Maria van Schurman, Luisa Sigea, and Proba with us and—an important milestone for the project—contributed the first annotations for what will eventually become the RWALT treebank. In the following blog post, Lily recounts her fall internship experience with the ISAW Library. [PJB]

I have just finished my first semester working on an internship at the Institute for the Study of the Ancient World Library under the tutelage of David M. Ratzan and Patrick J. Burns, contributing to the research project “Representing Women Authorship in the Latin Treebanks.” I feel extremely lucky to be part of such a rewarding project that is bringing unheard voices of the past to light.

As a high school sophomore studying the AP Latin curriculum this year, I had only ever read Vergil’s Aeneid, Caesar's Gallic War, Ovid’s Metamorphoses, and poems written by Catullus. Aside from these author being some of the most recognized Latin writers, they also all happen to be male. I had yet to be introduced to works written by women authors of Latin due to two main reasons: there were very few women writers of Latin in antiquity (and even fewer had survived) along with the fact that women writers from later periods have only rarely been studied, much less translated. These factors made access to their works difficult. When I was initially introduced to this project by David and Patrick, I was drawn to the idea of translating works that are less frequently read today—especially in high school courses—and making them more available to future Latin scholars. When they further explained that the end goal was to digitally tag these texts in order to make them easier for students to work with online, I felt that this would be the first step to not only bring more female representation to the study of the Latin language, but also to help to give these women the recognition that they deserve.

During this semester’s internship, I read and translated Luisa Sigea’s Epistula 16 [PJB: this edition was published just this last October by the Dickinson College Commentaries series] along with the praefatio to Proba’s Cento Vergilianus. I also collected metadata for both writers, that is general information about their date and place of birth and specific information like their Wikidata identifiers, in order to make a better sense of the contexts in which they wrote. With the help of my translations, I also collected linguistic data points, including the lemma, part-of-speech tag and morphological tags for roughly 750 words. For this work, I used a Latin text analyzer that Patrick created, which uses his LatinCy models. This analyzer can annotate words with good accuracy (for example, 95% for lemmata or 97.5% for POS tags); however, we are working to improve this number by using my annotations from this project.

The work that I have done has not only been rewarding to me as a Latin student, but also as a person, because it has allowed me to contribute to work that I believe will have a lasting impact on the study of the Latin language. I want to thank Patrick and David for giving me this opportunity, and I am extremely excited to continue this work over this spring semester.