After Three Years of Increases, a Link Rot “Plateau” in 2011?
Link Rot and Top-Level Domains: New Patterns Emerge
Link Rot and the Digital Archive Today
A New Look at Link Rot and Top-Level Domains
The Chesapeake Digital Preservation Group has completed its fourth annual investigation of link rot among the original URLs for online law and policy-related materials archived though the group’s efforts.
Originally launched as a Web-preservation pilot project in 2007, the Chesapeake Group is today part of the Legal Information Archive. Group participants include two academic law libraries, the Georgetown Law and Harvard Law School Libraries, and the State Law Libraries of Maryland and Virginia.
The Chesapeake Group focuses primarily on the preservation of Web-published legal materials, which often disappear as Web site content is rearranged or deleted over time. In the four years since the program began, the Chesapeake Group has built a digital archive collection comprising more than 7,400 digital items and 3,200 titles, all of which were originally posted to the Web.
For this study, the term “link rot” is used to describe a URL that no longer provides direct access to files matching the content originally harvested from the URL and currently preserved in the Chesapeake Group’s digital archive. In some instances, a 404 or “not found” message indicates link rot at a URL. In other cases, the URL may direct to a site hosted by the original publishing organization or entity, but the specific resource has been removed or relocated from the original or previous URL.
All of the Web resources described in this report that have disappeared from their original locations on the Web remain accessible via permanent archive URLs here at legalinfoarchive.org, thanks to the Chesapeake Group’s efforts.
Although link rot continued to increase through 2010 and early 2011, the rise in lost content was much less dramatic in comparison to previous years.
The Chesapeake Group conducted its first link rot assessment at the program’s one-year mark in 2008. During the program’s first year, 1,266 online titles were harvested preserved within the digital archive. A random sample of 579 titles from the archive was generated for the link rot study, ensuring results at a 95 percent confidence level and confidence interval of +/- 3. When this sample was first analyzed in March 2008, link rot was found to be present in 48 of 579 URLs, or 8.3 percent.
One year later, in 2009, the sample was analyzed a second time as part of the program’s second-year evaluation. The second analysis demonstrated that link rot was present in 83 out of the original sample of 579 URLs. In other words, 14.3 percent of the archived titles had disappeared from their original URLs within 12 to 24 months of harvest.
By March 2010, the prevalence of link rot had increased to 160 out of 579 URLs. Within two to three years of harvest, link rot among the sample URLs had increased to 27.9 percent, compared to 14.3 percent in 2009 and 8.3 percent in 2008.
The current March 2011 analysis shows that 176 URLs have succumbed to link rot within a period of 12 to 48 months. This means that 30.4 percent, or nearly one-third, of the archived titles have disappeared from their original URLs. Although this figure is significant, it represents only an additional 2.5 percent of URLs lost to link rot within the past year.
Whereas the prevalence of link rot among URLs in the sample nearly doubled every year during the first three years of the study, it slowed significantly in the fourth year.
The ratio of URLs with link rot to working URLs, as of 2008, 2009, 2010, and 2011 is illustrated in the figures below.
More than 90 percent of the top-level domains in the sample were state-government (state.[state code].us), organization (.org), and government (.gov) URLs, representing approximately 41 percent, 32 percent, and 17 percent of the sample, respectively. Other top-level domains, which accounted for approximately 7 percent of the sample, combined, were .edu, .com, and .net, which respectively represented 2.9, 2.2, and 1.9 percent of the sample. Less than 3 percent of the sample was represented by a combination of .mil, .us, .info, .uk, .au, .ca, and .int top-level domains. The sample also included one IP address.
In the original 2008 analysis, link rot was present in 10.8 percent of URLs with state top-level domains, 10 percent of URLs with government top-level domains, and 8.3 percent of URLs with organization top-level domains. Education (.edu) and commercial (.com) URLs were found to have relatively high inactivity levels of 11.8 and 15.4 percent in 2008, respectively.
In 2009, the prevalence of link rot increased among URLs with state, government, organization, education, network (.net), military (.mil), and information-oriented (.info) top-level domains. URLs with organization top-level domains increased significantly in 2009, to 35.3 percent from 11.8 percent in 2008, while no increase in link rot among commercial URLs was observed.
The 2010 analysis of the sample showed link rot to be present in more than 32 percent, nearly one-third, of the URLs with a state-government top-level domain. Link rot was found in more than 22 percent of URLs with an organization top-level domain and in 25 percent of government URLs. Commercial and network URLs both experienced a jump in link rot to nearly 30 percent among .com domains, and to more than 27 percent among .net domains. The single IP address and.uk top-level domain in the sample also succumbed to link rot in 2010.
New and interesting patterns among top-level domains emerged in 2011. While .org and .gov URLs continued to demonstrate an increase in link rot, link rot among state government and academic URLs actually began to reverse.
Four state government URLs that were inaccessible in 2010 were once again accessible when re-checked for the 2011 analysis. (It is worth noting that all four of these URLs are from the same domain.) Likewise, three .edu URLs that were observed to have link rot in 2009 and 2010 had become accessible in 2011. (Again, two of these three URLs were from the same institution.) With these new trends among top-level domains, link rot among organization URLs has surpassed that of state government URLs for the first time since the sample analyses began in 2008.
A list of all top-level domains found in the sample, along with link rot detected in 2008, 2009, 2010, and 2011 is available in the table below.
|Top-Level Domain||Total in Sample||Link Rot 2008||Link Rot 2009||Link Rot 2010||Link Rot 2011|
|.state.__.us||240||26 (10.8%)||38 (15.8%)||77 (32.1%)||73 (30.4%)|
|.org||184||7 (8.3%)||21 (11.4%)||41 (22.3%)||57 (31%)|
|.gov||100||10 (10%)||13 (13%)||25 (25%)||31 (31%)|
|.edu||17||2 (11.8%)||6 (35.3%)||6 (35.3%)||3 (17.6%)|
|.com||13||2 (15.4%)||2 (15.4%)||4 (30.8%)||4 (30.8%)|
|.net||11||0||1 (9.1%)||3 (27.3%)||3 (27.3%)|
|.mil||3||0||1 (33.3%)||1 (33.3%)||1 (33.3%)|
|.info||2||1 (50%)||1 (50%)||1 (50%)||2 (100%)|
|.uk||2||0||0||1 (50%)||1 (50%)|
|[IP address]||1||0||0||1 (100%)||1(100%)|
For the present analysis, a new, separate sample was generated representing all of the content in the archive at the time of the program’s fourth anniversary. In the four years since the program began, 3,246 born-digital online titles were harvested from the Web and preserved within the digital archive. A random sample of 803 titles was selected for the link rot study, ensuring results at a 95 percent confidence level and confidence interval of +/- 3.
Out of these 803 titles, link rot was found to be present in 157 URLs. In other words, 19.6 percent of the original URLs for titles harvested and archived over the previous four years had succumbed to link rot by March 2011. The ratio of working URLs to those with link rot is illustrated below.
In 2011, the number of titles in the archive with URLs from organization (.org) top-level domains surpassed those from state government (state.[state code].us) domains for the first time. Roughly 85 percent of the top-level domains in the sample were state-government, organization, and government (.gov) URLs, which represented 27.9 percent, 36.1 percent, and 20.8 percent of the sample, respectively. Of these three top-level domains, link rot was present in 25.4 percent of URLs with state top-level domains, 15.5 percent of URLs with organization top-level domains, and 19.2 percent of URLs with government top-level domains.
URLs with .com, .edu, and .us top-level domains were found to have inactivity levels of 16.7, 17.5, and 13.3 percent, respectively, while .net URLs, which represented a smaller portion of the sample were found to have a higher inactivity rate of 30 percent. A list of all top-level domains found in the 2011 sample, along with their inactivity rates, is available in the table below.
|Top-Level Domain||Total in Sample(2007-2011)||Link Rot Frequency 2011|
|.state.__.us||224||57 ( 25.4%)|
First published by the Chesapeake Digital Preservation Group