Preservation for the (digital) ages

Preservation for the (digital) ages
Developed incrementally as a personal research tool over a long period of time and eventually made public, the companion website and database for Speech Presentation in Homeric Epic has been accessed by more than 5,000 researchers, but faced the threat of obsolescence. Credit: Deborah Beck, The University of Texas at Austin

When Deborah Beck was preparing her book, Speech Presentation in Homeric Epic, her publisher suggested she make the database she had started in 2008—a searchable catalogue of features from every speech in the Homeric poems—available to the public as a web application and companion resource.

Since the application went live in 2013, more than 5,000 researchers have used it to parse the thousands of speeches found in the Iliad and the Odyssey and to explore different connections from those Beck investigated in her book.

"I get emails from people around the world expressing their appreciation for the database," said Beck, an associate professor of Classics at The University of Texas at Austin. "I heard in June from a student in Mexico who used the application to write his bachelor's thesis."

However, as new web and database capabilities became available, Beck was finding it challenging to update the application, which was developed using the technologies from the 2000s.

Perhaps more worryingly, as browsers change and university web-servers retire, there was a chance that in the future the database might be lost to the sands of time.

"As a classicist, the very long-term accessibility of texts is a fundamental prerequisite of our entire discipline," Beck explained. "I can pick up a manuscript from 1,000 years ago and if I know how to read the handwriting, that resource is still available to me. However, I don't have the slightest idea what the availability of resources that are currently digital will look like in 100 years."

Papers she wrote as an undergraduate are inaccessible because the writing programs and file formats she used are now obsolete. "I don't want that to happen to projects that I'm connected to."

She asked for assistance from the University's General Libraries, who suggested she talk to researchers from the Texas Advanced Computing Center (TACC) with expertise in digital archiving and preservation. Together, they set about developing a new way to preserve digital humanities databases.

At the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Beck along with Weijia Xu, a research scientist at TACC, Maria Esteva, a digital archivist at TACC, and Yi-Hsuan Hsieh, a Ph.D. computer science student at UT Austin supported by TACC's Science & Technology Affiliates for Research (STAR) Scholars Program, presented a solution that preserves Beck's database of Homeric speeches, including the multivariate connections among the texts and the insights Beck developed over years of study.

"The value of research data not only resides in its content but in how it is made available to users," said Esteva. "Research data is often presented interactively through a web application, the design of which is often the result of years of work by researchers. Therefore, preserving the data and the application's functionalities becomes equally important."

The preservation strategy they developed allows scholars to re-launch the database application in a variety of environments—from individual computers, to virtual machines, to future web servers—without compromising its interactive features. It preserves the data separately from the interactive application, so scholars can reuse it in other technical and functional contexts.

The process exploits aspects of emulation and virtualization - techniques applied in business and technology—but goes beyond these approaches.

Preservation for the (digital) ages
An overview of the preservation workflow developed by digital archivists at the Texas Advanced Computing Center and classicists and computer scientists from The Texas Advanced Computing Center Credit: Weijia Xu; Maria Esteva; Deborah Beck; Yi-Hsuan Hsieh, The University of Texas at Austin

It dissociates the web code from the data and re-deploys the entire application on different platforms, including virtual machines. The process has four stages:

1. extracting the data and application code;

2. identifying dependencies (where one object relies on a function of another object) and decoupling the application from the data;

3. redeploying the application and validating its results; and

4. distributing the application to end users.

Using this method, a researcher can reboot the application at a later date by starting up a virtual machine image that contains the fully functional application. This approach fits well with the evolving nature of digital preservation and with the requirements for data reuse, the researchers say.

For Beck, the project provided an avenue to preserve the research she had done over many years.

For Yi-Hsuan Hsieh, it presented an opportunity to apply the computer science principles she is learning in her graduate program to a mature project of value to the classics community.

Her main task on the project was to test a dependency detection algorithm that identifies the relations between the web code and the libraries required to redeploy and run the application.

"It was exciting to gather experts' ideas from different fields," Hsieh said. "Dr. Beck gave us the motivation to preserve humanity digital projects. Dr. Esteva provided the requirements and goals of digital preservation and Dr. Xu gave ideas about automating the process of identifying dependencies from the web code to significantly reduce human efforts in preserving a web application," she said.

The team is currently working on further automating the stages of dependency detection to make the strategy generalizable for other projects and hosting environments.

As with any digital preservation method, one must still monitor and update the project occasionally. However, the risk of incompatibility is lower because updates to new web technologies or hosting services can be carried out at any point in a project's lifecycle from the application code and the data.

"I come at this project from the perspective of long-term preservation, but the main thing that I came to understand over the course of the work is that having an interactive, accessible digital component to your research means that it reaches more people and it reaches them in different ways," Beck said. "That to me is really important and having a preservation strategy in place that makes it achievable over a longer period of time and with a wider variety of users is critical."

Explore further: A new mobile application helps scientists map the sound environment

More information: Weijia Xu et al, A Portable Strategy for Preserving Web Applications Functionality, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2017). DOI: 10.1109/JCDL.2017.7991578