Trip Report: Tom Elliott on the "Named Entity Hackathon" at Tufts University

By Tom Elliott
03/20/2014

The workshop was billed as a "hackathon"; its goal was to bring together scholars and technologists for a week of hands-on digital humanities programming. Perseids is a project of the Perseus Digital Library aimed at developing a collaborative online environment in which users can edit, translate, and produce commentaries on a variety of ancient source documents. A key consideration in many types of commentaries is the identification and explication of "named entities", things like persons and places that appear as proper nouns in an ancient text or modern scholarly work. The Perseids team, therefore, invited several outside projects with complementary interests to come to Tufts and spend the week working on datasets and software for working with named entities that can be shared across projects. The Duke Collaboratory for Classics Computing (DC3), partners with both ISAW and Perseus on a variety of projects including Papyri.info, sent a contingent, as did the Pelagios Project, which has focused on techniques for cross-project sharing of geographic information and the computational parsing of ancient geographic texts.

I represented ISAW and our extramural partners with regard to several on-going projects, including: Pleiades (a digital gazetteer of the ancient world co-managed by ISAW and the Ancient World Mapping Center at the University of North Carolina at Chapel Hill), the Digital Latin Library (being organized by Samuel Huskey at the University of Oklahoma on behalf of three learned societies), the EpiDoc community (which develops guidelines and tools for encoding ancient documents), the Syriac Gazetteer (due for release this month as part of the Syriac Reference Portal project), and the Digital Corpus of Literary Papyrology (a joint project with the Institut für Papyrologie at Heidelberg University, the Trismegistos Project, DC3, and others).

Outputs of the project are many and varied. Announcements of new tools and resources stemming from this meeting are expected from many of the participants in coming months, including several joint initiatives with ISAW. I am happy to be able to report on the first fruits here.

Thanks to the efforts of Maxim Romanov and his students at Tufts, Pleiades will see a significant increase in content related to the Islamic world. This content will take several forms, including new Arabic-language placename records for sites already cataloged in Pleiades, as well as many entirely new entries for sites important to the study of the Islamic and Persian socio-cultural spheres. We got a significant boost at the hackathon from the Pelagios and Perseids teams who collaborated with us on software that will facilitate Maxim's efforts to capture and catalog data from multiple sources. Maxim has dubbed the collaboration "al-Thurayya" (Arabic for Pleiades). You can read more about it on his blog.

I also had the opportunity to work with Leif Isaksen of Pelagios and Ryan Baumann from DC3 to produce a new geographic dataset that makes Pleiades more broadly and immediately useful to other projects. In 2011 while working on a project called Google Ancient Places, Leif had prototyped a dataset called Pleiades Plus (or Pleiades+). The dataset was assembled by a computer program that attempted to match entries in the Pleiades gazetteer with entries in GeoNames.org, the largest open digital gazetteer in the world. The results, based on a combination of name similarities and distance measurements, had the effect of annotating Pleiades entries with additional modern names from GeoNames, thus making it easier for third party projects to match up their contents with Pleiades. Pleiades Plus was subsequently used by many of the Pelagios Project partners (including the German Archaeological Institute and Perseus) for this purpose. Unfortunately, the original prototype code proved  difficult to operate and maintain, so the original Pleiades Plus dataset remained static even as the content in Pleiades and GeoNames continued to improve and expand. While at Tufts, Ryan rewrote the code from scratch, and he has now set it up to run nightly in order to capture and align new changes in both Pleiades and GeoNames. The code is freely available online for inspection and reuse, and the results of the nightly runs are published on the Pleiades Downloads page.

I am grateful to Bridget and the entire Perseus/Perseids team for the invitation and the experience of the Named Entity Hackathon, and to the Andrew W. Mellon Foundation, which partially funded the event. I look forward to continued collaboration with the rich and growing social/scholarly network that has emerged around linked data and named entity analysis for ancient studies.