Leveraging Distributed Proofreaders for Open Content Enhancement (and some unexpected consequences)

Oct 24, 2016

For many years, Villanova University’s Digital Library has been providing open access to out-of-copyright books, journals, newspapers, and other content. From the beginning, this content was provided in the form of scanned images of the original materials, sometimes augmented by computer-generated OCR. While this approach provides an informative view of the historical artifacts and allows some degree of searchability, there are some barriers to discovery and consumption of the content.

Early in 2012, the Digital Library team began to work with the Distributed Proofreaders project,which uses crowdsourcing to proofread and format OCRed text into Project Gutenberg eBooks. While a slow and labor-intensive process, the Distributed Proofreaders workflow results in documents that fulfill some of the needs that scans and OCR alone cannot: it is fully searchable, it is conveniently and comfortably readable on a wide range of platforms, and it can be easily reused and reformatted for any number of purposes.

Since beginning the Distributed Proofreaders experiment, more than 100 texts from the Villanova University Digital Library have been converted into Project Gutenberg eBooks. Most of these projects where shepherded through by Villanova staff in one form or another, but some were initiated and completed entirely through the efforts of Distributed Proofreaders volunteers. While the work involved in producing these editions necessarily means that only a small percentage of the total collection can receive the Distributed Proofreaders/Project Gutenberg treatment, this has proven to be an excellent way to draw attention to key titles and to inform the wider public about the availability of our digital collections.

An interesting and unintended side effect of releasing texts into Project Gutenberg was a flurry of commercial activity around the public domain texts. Nearly every title we release quickly appears on Amazon and in other venues, sometimes in a variety of editions, sometimes with amusingly incongruous cover artwork. Many of the1se products seem to be programmatically generated with the intent of exploiting unsuspecting consumers – probably through a process akin to that which has led to the sale of bundles of Wikipedia articles in some markets – but a few show greater care or offer added value. Some of our titles have been translated into other languages; some have received annotated, critical editions. For better or worse, it is unlikely that any of these things would have happened if we had continued to make all of our content available only as raw images and OCR.

PICTURED ABOVE: the original cover of ‘They Looked and Loved’ from Villanova’s Digital Library

PICTURED BELOW: a copy posted on Amazon with a bizarre new cover

If you are interested in replicating this experiment with your own content, it could be as simple as reaching out to the Distributed Proofreaders community and registering your library as a content provider. A bit of community outreach could then inspire some of the existing project managers to select your content and begin feeding it into Project Gutenberg. Realistically, though, you will get more work done if someone on your staff takes some initiative. This requires a significant investment of time, since a new Project Manager first needs to learn some basic skills by proofreading and formatting hundreds of pages and then participating in a mentoring process. However, if you have an existing DP participant on your staff (as was the case for Villanova University), it is a resource that you can take advantage of to everyone’s benefit.

Blog articles about all of Villanova’s Project Gutenberg releases can be found at https://blog.library.villanova.edu/tag/project-gutenberg/. Demian Katz, the coordinator of Villanova’s Distributed Proofreaders efforts, can be reached at demian.katz@villanova.edu and is always happy to answer questions about the process.