Digital preservation is hard when older content can fall through cracks

On The Atlantic, Meredith Broussard was inspired by an earlier The Atlantic article on the difficulty of preserving web news content online to take a look at the difficulty of preserving that specific web news content online. (David Rothman covered the earlier article for TeleRead last month.) Given that the web content that article talked about was a news series on a 46-year-old event that had itself been forgotten, the whole issue seems to partake of several layers of self-referentialism. But that only serves to underscore its severity.

I was just starting out at college when the World Wide Web entered use, and I still remember the days when web pages consisted of simple HTML. (Indeed, my own fossilized ‘90s-vintage homepage dates directly back to my first Lynx bookmarks file, which I simply hand-edited into an index.html file.) But compared to the modern web, that’s effectively the equivalent of stone tablets. Modern web sites are governed by one or more content management systems (CMS) which weave together multiple sets of source files to create slick-looking sites that catch the eye.

The problem is that the complicated nature of this process makes archiving extremely difficult. Many news sites change their CMSes for new ones over time, which can change the site’s entire URL schema in one fell swoop—and suddenly extant links to older stories no longer work. Sometimes even the sites themselves are no longer able to find them. Broussard points out that the Internet Archive’s Wayback Machine does a great job in preserving older content, but it doesn’t have the kind of indexing and search function needed to make it useful if you don’t know exactly where or when the article you want was published.

Anyone who’s ever had to search for older stories on the web has learned how this works. Often is the time I’ve found an older story by googling it, clicking a link that no longer works, and then copying the link’s URL to take to the Wayback Machine and see if it works there. But even that only works if there are still enough links to the older content out there to give you a handy way marker.

And this is a problem that we’re no stranger to here at TeleRead. The very earliest articles in our present database go back to 2002, leading me to misestimate how old TeleRead was in an article I wrote about another early e-book site. As David pointed out in response, earlier versions are available in the Internet Archive, but how would I have known that without going to look? And if I wanted to find some specific post from that era, how could I do so since that version of the site no longer even exists to site-search?

Apart from that, some things got shuffled around or lost in our move to NAPCO’s servers and back. In researching older TeleRead articles for backlinks in current stories, I still run across situations where I know I wrote some article about a particular subject, but a Google on a keyword that should get it doesn’t pull it up and I have to use WordPress’s internal post search function to locate it instead. Even in cases where older articles can be found, the images that used to go with them are often missing without trace. If that sort of problem can affect as relatively simple a site as TeleRead, it’s easy to see how badly it could hit far more complex news sites, especially ones with decades of history to keep track of. And who knows how it will affect TeleRead if we should make any other site changes in the future?

This is an important issue for future historians and researchers. After all, our culture is by and large a digital one now, and many of our most important day-to-day news sources don’t even have print versions to archive anymore. If we lose track of that aspect of our culture, how will we get it back?

One bright spot is that—unless blocked by robots.txt files—the data is still being stored on the Internet Archive in some form—which means that, sooner or later, if they are able to implement a better search and indexing function on it, historians will derive more benefit out of all those saved pages then. But it may be small consolation for people who want to dig up ancient Internet history now.

Editor’s note – republished with permission of the author – first published on TeleRead.

Posted in: Information Architecture, Internet Resources, Internet Trends, Technology Trends, Web Management