link rot

New Link Rot report from Chesapeake

For the past five years, the Georgetown Law Library and the Chesapeake Digital Preservation Group have been doing doing studies on "link rot." This year, they discovered that "link rot has increased to 37.7 percent within five years."

The Chesapeake group gathers information from the web and preserves it for their users and each year they study how many of the URLs from which they originally gathered information "no longer provide access to the content that was originally selected, captured, and archived by the Chesapeake Group."

This study is particularly relevant to government information specialists because more than 90% of their sample URLs were from state governments (state.[state code].us), organizations (.org), and government (.gov) the top-level domains.

For "dot-gov" domains (URLs ending in ".gov") the studies have shown cumulative link rot of:

10% in 2008
13% in 2009
25% in 2010
31% in 2011
36% in 2012

Cumulative link rot of state government URLs (.state.__.us) were almost as bad: 10.8% in 2008 15.8% in 2009 32.1% in 2010 30.4% in 2011, and 33.8% in 2012.

The total cumulative link rot for all URLs was 37.7% in 2012. Another way of looking at this is that, of the documents the Chesapeake Project has preserved, only only 62.3% were still available at their original URL as of the 2012 study.

This year's report includes two samples of URLs. The first sample includes 579 URLs that Chesapeake captured during 2007 and 2008. They use this sample to examine how link rot changes over time.

The second sample is a new and represents the full content of the Chesapeake archive at the time the study was conducted. Using this second, broader sample the study reports a link rot rate of 25.9%.

For libraries that rely on pointing to URLs rather than preserving information in their own digital libraries, the new report from the Chesapeake Project provides sobering, factual data on the reliability of that strategy.

Links to USGS Publications Changing

Richard Huffine, Director of USGS Libraries Program, announced on govdoc-l this week that direct links to USGS publications will be changing by September 1, 2011.

  • Direct Links to USGS Publications Changing by September 1, 2011, Richard L Huffine, Discussion of Government Document Issues, (11 Aug 2011).

    The U.S. Geological Survey's Publications Warehouse (pubs.usgs.gov) will complete a process to migrate all of its' on-line publications into Portable Document Format (PDF) files by September 1, 2011. At that time, the USGS will no longer support the previous DJVU format for its on-line publications. Libraries and Web site managers should link to the publications citation page for USGS publications. At sometime after September 1, 2011, direct links to DJVU files will stop working and there will be no automatic redirect to the PDF version of those materials.

    A direct link to a USGS DVJU file currently looks like:
    http://pubs.er.usgs.gov/djvu/B/bull_1967.djvu
    Once loaded in PDF, individual publications will have a link like this:
    http://pubs.usgs.gov/bul/1967/report.pdf

    However, the preferred link to this publication is:
    http://pubs.er.usgs.gov/publication/b1967

    The citation link is the preferred link because it may include links to plates, maps, appendices, etc. as well as links to the USGS Store to purchase paper copies if they are available. This migration has been sought by members of the research community for some time. The DJVU format offered many benefits at a time when bandwidth was a challenge. The PDF format offers a consistent format for both historical and current publications and it allows users to download and use information from USGS publications in the same way that they use research journal articles and other scientific research products.

    Over 70% of all USGS-published reports are available in an on-line format from the USGS Publications Warehouse. The system currently includes citations to over 100,000 research articles, reports, and products produced by the USGS over the last 130 years. The system also offers an RSS feed to keep users of earth and natural science research informed about the products of the USGS.

    Richard Huffine, Director
    USGS Libraries Program

From Link Rot to Web Sanctuary

Here is an interesting story about preserving British government information.

Bernard M. Scaife, Technical Services Librarian at the University of London Institute of Education, writes about dealing with the broken links in their catalogue. Finding that ten percent of the links to external resources in their bibliographic records referred to documents which no longer existed and that many of those were official publications from government departments, he started looking for a way to eradicate their link rot problem. Since they already had Eprints software running on campus, they decided to use it:

It occurred to us that this software could enable us to eradicate our link rot problem, whilst building in a core level of digital preservation and increasing the discoverability of these documents. We were convinced that a citation which linked to a record in a Web archive was far more likely to survive than one which did not.

They knew that government budget cuts were increasing the risk of losing content from government departments. The article describes their experiences and summarizes what they learned:

  • Placing files in a repository gives digital preservation to key documents in the subject field and eradicates the link rot problem.
  • Adding high-quality metadata enhances the resource and allows it to hold its head high and become an integral part of a library's collection.
  • A specialist library can play an important role in preserving domain-specific government content as part of its long-term strategy and ensure high-quality resources remain available.
  • Provided you are prepared to get to grips with its complexity, the EPrints software is well suited to the task and provides good interoperability with other legacy systems for importing metadata
  • The added value of being able to search the full text provides a potentially very rich resource for data mining whether by current or future researchers of educational history.

Sometimes even the live links are dead (or languishing)

Readers of FGI are well acquainted with link rot, where internet links break over time.

Today I'd like to talk about something more subtle with no obvious way to detect the problem.

On the Alaska page of the State Agency Databases Across the Fifty States project, I had a link to APOC InfoQuick, a database of disclosure information for public officials and lobbyists from the Alaska Public Offices Commission. Today I visited the link at https://webapp.state.ak.us/apoc/index.jsp and chose the "lobbyist reporting" menu item because I thought it would be fun to list BP lobbyists in a personal blog entry I was drafting.

The lobbyist reporting section had a Search Lobbyist Registrations link. I clicked on it, searched for BP and got some listings. But only from 2007, the first year that Sarah Palin was Governor.

Searches in other parts of the lobbyist reporting system confirmed that NO information was available after 2007. I started to wonder if I'd missed the session law that repealed lobbying reporting requirements.

Then I noticed that the URL started with "webapp" and thought that it might be good to see if this database was still linked from the APOC home page.

It wasn't. Now they had a link called "search reports" at http://doa.alaska.gov/apoc/SearchReports/index.html. The page features two reporting systems for public officials - An "interim reporting system" for reports filed 2010 and later and "searchable campaign reporting" which is the public official/candidate portion of APOC InfoQuick. This explains why APOC InfoQuick wasn't taken off the live web.

Current information on lobbyists in Alaska is still available, just not database searchable. You can access various PDF lobbyist reports from 2005 forward at http://doa.alaska.gov/apoc/TrainingReports/lobbyist.html.

I have no information on why lobbyist information is no longer database searchable and speculating why would take me out of my comfort zone of not discussing policy choices made by the level of government I work for.

The main point I'm making is that most librarians and other information specialists are pretty comfortable with link checking and fixing broken links when we find them. But what can we do when a site remains on the web but has stopped being updated? Especially when there's no note on the old site about the change?

Pinboard report on link rot

A new report on link rot on the blog of the social bookmarking service Pinboard:

Using a random sample of 300 URLs stored in Pinboard for each year 1997-2011 (based on the year the bookmark was created), Ceglowski says "you can expect to lose about a quarter of them every seven years."

Ceglowski also shows year by year lists and the raw data.

In my quick look: of the 47 .gov web sites checked over the entire period 7 were not found (status code 404) and 7 had internal server errors (status code 500) for a 29.7% link rot rate.

Hat tip to INFOdocket!

New Link Rot Report

For libraries that rely on pointing to URLs rather than preserving information in their own digital libraries, the new report from the Chesapeake Project provides sobering, factual data on the reliability of that strategy.

In an examination of "link rot" the project found that 30.4% of URLs examined no longer provide access to their original information.

This study is particularly relevant to government information specialists because more than 90% of their sample URLs were from state governments (state.[state code].us), organizations (.org), and government (.gov) the top-level domains.

The Chesapeake Project Legal Information Archive, which harvests and preserves relevant digital legal information from the web, has been producing reports on "link rot" for several years. They define link rot as "a URL that no longer provides direct access to files matching the content originally harvested from the URL and currently preserved in the Chesapeake Group's digital archive."

Their new report is now available:

In one interesting finding, the report says that the rate of loss of information slowed in the last year: "Whereas the prevalence of link rot among URLs in the sample nearly doubled every year during the first three years of the study, it slowed significantly in the fourth year." The report makes clear that although 30.4 percent, or nearly one-third, of the archived titles have disappeared from their original URLs since the beginning of the program in 2007, only 2.5 percent of URLs were lost to link rot within the past year.

Their data show that cumulative link rot frequency for .gov files was 10% in 2008, 13% in 2009, 25% in 2010, and 31% in 2011. There was an interesting development in that some state-level URLs that were inaccessible in 2010 were once again accessible when re-checked for the 2011 analysis. The cumulative link rot frequency for state level URLs was still almost as high as for the federal URLs: 10.8% in 2008, 15.8% in 2009, 32.1% in 2010, and 30.4% in 2011. Even with that slight improvement at the state level, the overall cumulative link rot percentage rose in 2011 (30.4%) over 2010 (27.9%). Another way of looking at this is that of the documents the Chesapeake Project has preserved, only 69.6% were still available at their original URL as of the 2011 study.

In an earlier study, the authors qualified their findings, noting that the findings are "not meant to be broadly applicable or to provide a representation of link rot throughout the universe of web resources" but only reports on those items in the Chesapeake Project archive. The studies do provide "insight into the vulnerability of law- and policy-related web resources selected by experienced law librarians from seemingly stable open-access web sites hosted by reputable organizations and state and federal governments."

Significantly, "All of the Web resources described in this report that have disappeared from their original locations on the Web remain accessible via permanent archive URLs here at legalinfoarchive.org, thanks to the Chesapeake Group's efforts. "

2011 Report on Link Rot

How reliable are those URLs in your OPAC? The Chesapeake Project Legal Information Archive which harvests and preserves relevant digital information from the web, has been producing reports on "link rot" for several years. They define link rot as "a URL that no longer provides direct access to files matching the content originally harvested from the URL and currently preserved in the Chesapeake Project's digital archive."

Their new report is now available:

This study is particularly relevant to government information specialists because more than 90% of their sample URLs were from state governments (state.[state code].us), organizations (.org), and government (.gov) the top-level domains.

Their data show that link rot frequency for .gov files was 10% in 2008, 13% in 2009, and 25% in 2010. State-level URL link rot was even worse: 10.8% in 2008, 15.8% in 2009, and 32.1% in 2010.

The authors qualify their findings, noting that the study is "not meant to be broadly applicable or to provide a representation of link rot throughout the universe of web resources" but only reports on those items in the Chesapeake Project archive. It also says, however, that the study provides "insight into the vulnerability of law- and policy-related web resources selected by experienced law librarians from seemingly stable open-access web sites hosted by reputable organizations and state and federal governments."

Significantly, "none of the content analyzed in this study has been truly lost; all of the content has been preserved in a digital archive" at The Chesapeake project.

New report on link rot

The Chesapeake Project Legal Information Archive has released the results of its third annual analysis of link rot among the original URLs for law- and policy-related materials published to the Web and archived though the Chesapeake Project.

The 2010 analysis reveals that nearly 28 percent of the online publications archived between March 2007 and March 2008 have now disappeared from their original locations on the Web but, due to the project’s preservation efforts, remain accessible via permanent archive URLs....

These findings demonstrate a dramatic increase in link rot among archived Web content over time.

The Chesapeake Project was designed to preserve born-digital legal information published directly to the Web. It was launched in early 2007 by the Georgetown Law Library and the State Law Libraries of Maryland and Virginia.

The Chesapeake project has expanded by adding a new law library to the project. It has also become the model for The Legal Information Preservation Alliance (LIPA) which has announced the formation of its Legal Information Archive, a collaborative digital preservation program for the law library community.

See also, an earlier report: Recent report on Link Rot.

Recent report on Link Rot

A recent report evaluating a two-year web harvesting project found that 14.3 percent of the original URLs of all titles harvested from the Web and archived during the first year of project had become inactive within at least one year of harvesting. (p.33)

The report is an evaluation of the Legal Information Archive of The Chesapeake Project, which was designed to preserve born-digital legal information published directly to the Web. The project was implemented in early 2007 by the Georgetown Law Library and the State Law Libraries of Maryland and Virginia.

The report notes that more than 95 percent of the titles in the sample were PDF files. Of these titles, 8.2 percent were found to have inactive original URLs in 2008 and 14.1 percent in 2009. (p.35)

Ten percent of government (.gov) URLs became inactive in the first year and an additional three percent became inactive in the second year. (p.34)

The report concludes:

More than 4,300 digital items, representing nearly 1,900 titles, have been harvested from the Web and archived, and roughly 14 percent of these titles have already been removed from their original locations on the Web, demonstrating the importance and effectiveness of the project’s efforts. Moreover, the project’s access figures demonstrate both the broad, international reach of the project’s efforts, as well as the successful selection of high-interest and high-use materials by project participants.

Syndicate content Syndicate content