e-Research opens up nineteenth-century texts
ENGAGE funds HiTHeR project
‘An examination of Mr Hume’s Objection to Miracles’. One small excerpt from the 430,000 articles held in the nineteenth-century serials edition
The Monthly Repository of Theology and General Literature might not immediately spring to mind when one thinks of e-Research. Yet, through the Engage-funded HiTHeR project (High Throughput Humanities e-Research), the latest e-Research techniques are being applied to help scholars find materials within the Nineteenth-Century Serials Edition (NCSE). The NCSE is a free, online scholarly edition of nineteenth-century periodicals and newspapers. The corpus contains about 430,000 articles that originally appeared in roughly 3,500 issues of six different periodicals. Currently, the corpus is explored using a keyword classification, which is derived through a combination of manual and automated techniques. Full-text searches are useful for conducting 'known-item' searches. Searching metadata (e.g. bibliographic description, keyword assignment, etc.) improves the precision of the returned items and their recall. The Centre for Computing in the Humanities (CCH) has investigated the use of computational linguistics techniques for the extraction of keywords from full-text content, because a great deal of human effort and skill is required to add metadata to material in digital archives. One goal is to create a ‘semantic view’ that will allow users of the resource to find information more intuitively. Automated methods, which could help create a semantic view, require processing power that is currently not available to CCH researchers. Gerhard Brey, Lead Analyst and Research Fellow at the CCH, has implemented a simple, document-similarity index that finds journals with similar content. The program uses the lingpipe software to calculate similarity measures based on the intermediate n-gram shapes produced during the digitisation of the texts. A test based on 1,350 articles – requiring a total of 910,575 separate comparisons – took two days to execute on a desktop computer. At this rate, it would take a single machine 1,000 years to run a complete set of comparisons for the corpus! The HiTHeR project aims to use document-similarity processing as an opportunity to start building the e-Infrastructure required to support advanced research in the (digital) humanities, which will tie in with campus grids at King’s College London (KCL) and the National Grid Service. The support from the Engage project will help KCL to develop a software solution for a genuine problem in humanities research, which can then be re-purposed for similar projects. By doing so, more humanities researchers will become aware of the potential advantages of e-Infrastructure for e-Research, and it will become easier for everyone to find out more about the responses to the writings of, say, David Hume. HiTHeR is a collaboration between Birkbeck, University of London, King's College London, the British Library, and Olive Software. It was funded from January 2005 to December 2007 by the Arts and Humanities Research Council. The JISC-funded Engage project promotes the greater engagement of academic researchers in the UK with the UK's e-Infrastructure. Through a series of interviews with researchers, a number of development projects have been identified which will involve the creation and deployment of a small set of domain-focused systems. These systems will enable a new set of academics to use e-Infrastructure in pursuit of their research results. Neil Chue Hong. This article originally appeared in the September 2008 release of the OMII-UK Newsletter.


