The State of the Federal Web Report issued in late 2011 noted that Federal agencies planned to eliminate or merge several hundred domains, as part of the President's Campaign to Cut Waste. The goal was to reduce outdated, redundant, and inactive domains. As part of this work, the .gov Task Force overseeing the process asked members of the National Digital Stewardship Alliance (NDSA) to archive and preserve all .gov Executive branch domains slated to be decommissioned or merged. NDSA members immediately agreed that an important step in this process was to preserve the content of these sites as part of our national digital heritage - instead of simply eliminating them.
Rather than start a separate, standalone project, we chose to launch a collaborative crawl under the auspices of the End of Term Web Archive project (EOT). Although the EOT project has primarily focused on transitions occurring at the end of administrative terms, part of the goal of the project is to document changes in all online presences of the US Federal government during key periods of transition, regardless of when or under what circumstances they occur. So, a comprehensive harvest, using a targeted list of domains supplied by the .gov Task Force and a general list of all Executive branch domains downloaded from data.gov, began on Saturday, October 8, 2011. The crawl concluded on November 5, 2011 and encompassed 46,278,384 captures and ~13TBs of data compressed.
Here's a general outline of the sequence of events of the Fall 2011 crawl:
- Agencies identified recommended actions for domains in their Interim Progress Reports and Web Inventory
- The .gov Task Force collected a list of outgoing .gov domains and shared those with the NDSA
- Internet Archive crawled outgoing sites and the full suite of Executive branch domains (note: for some resources it took several weeks to crawl sites in their entirety)
- GSA eliminated domains after they were archived
The End of Term Web Archive project, including the archival capture of Executive Branch domains last Fall, is not meant in any way to satisfy agency records management obligations. The domains are archived solely for the purpose of preservation and posterity. Agencies separately discuss records management obligations and handle those processes independently. However, we do make every effort to replicate resources in their entirety – at least what can be supported by available tools, techniques and best practices. Some portion of every web site is housed server-side and that subset of content and/or user experience cannot be archived and replicated using traditional web crawler/capture software that is dependent on files being downloaded to the client.
The biggest challenge of this project, however, was not Web 2.0/Web 3.0 server side rendering or content serving. The biggest limiting factor was time. When we archive resources, there is a big difference between visiting and sampling a web resource using a set of scoping rules and guidelines versus going out and attempting to “drain” a site, i.e. replicate it soup to nuts as fast as the server can respond to your requests. Some of these resources house thousands to tens of thousands of PDF files, videos &/or other network intensive resources. And, most servers are programmed to meter how fast they respond to requests from the same IP address or an IP address range, so we have to wait appropriate intervals between requests in order to avoid being ignored or blacklisted by an automated process. There are ways to parallelize capture, but without dedicated funding, few institutions are able to marshal those kinds of resources on a volunteer basis.
The End of Term project is built on the collaborative best efforts of a network of partners who share a passion for preservation of online government.
For more information on the End of Term Web Archive project, please visit http://eotarchive.cdlib.org, and follow us @eotarchive.
Kris Carpenter Negulescu
Director Web Group
Here is an interesting organization that you might not be aware of, the Digital Government Society of North America.
The Digital Government Society of America (DGSNA) is a global multi-disciplinary organization of scholars, researchers, educators, students, government professionals, and practitioners who are interested in the development and impact of digital government or e-government. DGSNA focuses on creating a support network of individuals interested in the linkages among the democratic process, government management, innovation, information, and technology.
Benefits of membership include:
- Opportunity to exchange knowledge and information with other members in North America and throughout the world
- (In development) access to a membership database to find others who share your interests or have special expertise
- Discounted registration fees for our annual conference
- Subscription to a monthly e-newsletter, dgOnline
- Access to a library of over 2,000 articles and papers
- Discounted access to scholarly and professional journals
To join you are required to pay a membership fee, however, they do provide some excellent resources for free, including a nice collection of references, as well as a very useful library of citations (2000+ peer reviewed articles).
* Although this is not intended to be a plug, in matters of full disclosure, I am a current member of DGSNA.
The Center for Digital Government has posted survey results of the most tech-savvy counties, cities, and states in 2008. They are:
In addition, the Center also declared winners of the Best of the Web and 2008 Digital Government Achievement Awards (DGAA). “Best of the Web recognizes the most innovative, user-friendly state and local government portals while the Digital Government Achievement Award recognizes outstanding agency and department Web sites and applications.”