According to the British Library, the average life expectancy of a Web site is between 44 and 75 days and every six months, 10% of .uk Web pages vanish or are replaced by new material.
"With so much material now published online, and considering the growing influence of the Internet on British culture and society, the Web is now a key part of the nation's memory," said Margaret Hodge, the U.K.'s Minister of Culture and Tourism, in a statement. "A failure to record and preserve the UK domain would not just be detrimental to future research but leave a significant gap in our digital heritage."
The .uk Internet domain currently consists of about 8 million Web pages and is expected to reach 11 million by 2011. The British Library currently has 10 people manually archiving the 5 terabytes of U.K. Web page data.
IBM's contribution to the archiving project, BigSheets, is built atop the Apache Hadoop framework, a system for distributed data processing inspired by Google's MapReduce and Google File System, and developed in recent years by Yahoo and others.
"We think of these as big worksheets," said Rod Smith, VP of emerging Internet technologies at IBM, who stresses that the project goes beyond archiving. "You'd like to be more valuable to people than just an archive. In the British Library's case, you'd like to be known as the accurate holder of historical information."
BigSheets will allow British Library researchers, and eventually library patrons, to access Web archive data, conduct queries and visualize the results in forms like a tag cloud or pie chart, for example.
It's about ways to explore and sift data, says Smith.
Smith says it's still too early in the project's evolution to determine whether BigSheets will be adopted by other archiving organizations, like the Internet Archive.