Web harvesting
Its Not Your Grandfather's Web Any Longer
Submitted by jajacobs on Thu, 2013-04-04 17:59.David Rosenthal gave another fascinating talk about the state of the web and whether or not we can expect to preserve it by harvesting it. This talk was at the 2013 Spring CNI Membership Meeting in San Antonio, TX. David presents an edited text of his talk with links to the sources on his blog:
- Talk at Spring 2013 CNI, David Rosenthal, DSHR's Blog (April 4, 2013).
David and co-presenter Kris Carpenter Negulescu note, among other things, that the days of a document-centered web are long over and that today, what most web pages do "is download and run programs in the current Web's primary language, Javascript. Javascript is a programming language, not a document description language. Your browser is only incidentally a document rendering engine, its primary function is as a virtual machine."
This presents problems for those wishing to preserve information. Among these problems:
- Database driven features & functions
- Complex/variable URI formats & inconsistent/variable link implementations
- Dynamically generated, ever changing, URIs
- Rich Media
- Scripted, incremental display & page loading mechanisms
- Scripted, HTML forms
- Multi-sourced, embedded material
- Dynamic login/auth services: captchas, cross-site/social authentication, & user-sensitive embeds
- Alternate display based on user agent or other parameters
- Exclusions by convention
- Exclusions by design
- Server side scripts & remote procedure calls
- HTML5 "web sockets"
- Mobile publishing
For more about these problems, see also: IIPC Future of the Web Workshop -- Introduction & Overview, International Internet Preservation Consortium (May 17, 2012).
Read David's complete post for a rich discussion of the issues.
- jajacobs's blog
- Add new comment
- 496 reads
Can we rely on trying to 'harvest' the web? part 2
Submitted by jajacobs on Sun, 2012-06-03 08:52.Recently, we posted here a link to David Rosenthal's list of problems of we have with harvesting and preserving the Web.
Here is more on the same topic.
- IIPC Future of the Web Workshop - Introduction & Overview (May 17, 2012)
It is a 22 page PDF that presents in some detail an overview of challenges to capturing web content. It was presented at The Future Web workshop, which was held in May as part of the 2012 International Internet Preservation Consortium General Assembly meeting (IIPC GA) hosted by the Library of Congress. The purpose of the paper was to provide a shared context for participants.
The problems:
- Database driven features and functions
- Complex/variable URI formats and inconsistent/variable link implementations
- Dynamically generated, ever changing, URIs
- Rich Media
- Scripted, incremental display and page loading mechanisms
- Scripted, HTML forms
- Multi-sourced, embedded material
- Dynamic login/auth services: captchas, cross-site/social authentication, & user- sensitive embeds
- Alternate display based on user agent or other parameters
- Exclusions by convention
- Exclusions by design
- Server side scripts & remote procedure calls
- HTML5 "web sockets"
- Mobile publishing
The paper also lists "Current Mitigation Strategies" but, as Rosenthal pointed out, all of these are aimed at capturing a "user experience" -- and our ability to meet even that goal is limited:
But the clear message from the workshop is that the old goal of preserving the user experience of the Web is no longer possible. The best we can aim for is to preserve a user experience, and even that may in many cases be out of reach.
A different question libraries should be asking is, How can libraries capture the content behind the user experience? The presentation is important, but, even more important is the raw data that sites use to provide those experiences. This kind of information used to be instantiated in books and magazines and maps and pamphlets and newspapers.
Today that "raw data" is stored in databases, XML files, GIS applications, and other data stores.
Web harvesting can do little more than capture a snapshot of how that information was presented at a given time in the past by a particular information provider.
Libraries should be capturing those raw data sources. By doing that, libraries will ensure that current and future users of libraries will be able to actually use, analyze, and mine the data in new and interesting ways. Seeing how a user in the past might have seen a web page at a particular point in time will be of interest to some cultural historians and is therefore certainly important. But it is only a very small part of what future users will expect from their libraries.
As the report says, in passing, "the classical model of web archiving is no longer sufficient for capturing preserving, and re-rendering all the bytes of interest we care about."
There's a quick overview of the workshop and lots more links here:
- Harvesting and Preserving the Future Web: Content Capture Challenges, by Nicholas Taylor, The Signal (June 1st, 2012).
- jajacobs's blog
- Add new comment
- 2085 reads
Can we rely on trying to 'harvest' the web?
Submitted by jajacobs on Wed, 2012-05-09 06:17.Dr. David S.H. Rosenthal, who is Chief Scientist at LOCKSS, and Kris Carpenter Negulescu of the Internet Archive recently organized a workshop on the problems of harvesting and preserving the Web as it evolves from a collection of linked HTML documents to a programming environment whose primary language is Javascript.
David and Kris, with help from staff at the Internet Archive, put together a list of 13 problem areas already causing problems for Web preservation:
Database driven features
Complex/variable URI formats
Dynamically generated URIs
Rich, streamed media
Incremental display mechanisms
Form-filling
Multi-sourced, embedded content
Dynamic login, user-sensitive embeds
User agent adaptation
Exclusions (robots.txt, user-agent, ...)
Exclusion by design
Server-side scripts, RPCs
HTML5
Read more about this on David's blog:
- Harvesting and Preserving the Future Web, by David Rosenthal, DSHR's Blog (May 7, 2012).
- jajacobs's blog
- Add new comment
- 1568 reads
NBII goes dark. Libraries do what they do: harvest and preserve it for future access #opendata
Submitted by jrjacobs on Sun, 2012-01-15 12:45.Many of us in the government documents world woke up to 2012 with the following message posted on the Web site of the National Biological Information Infrastructure (NBII) and distributed around to various library listservs:
In the 2012 President's Budget Request, the National Biological Information Infrastructure (NBII) is terminated. As a result, all resources, databases, tools, and applications within this web site will be removed on January 15, 2012.
NBII has been a critical program since 1994 (See Bill Clinton's Executive Order 12906 which created the "National Spatial Data Infrastructure" ("NSDI")). NBII was set up to coordinate a broad array of information at the federal level about biodiversity and ecosystems.
Todd Carpenter, director of National Information Standards Organization NISO, put it nicely and succinctly when he tweeted:
What is particularly sad about NBII shutting down is it's precisely the thing we need MORE of not less=>trusted data repositories #opendata
Well have no fear, the Library of Congress, Internet Archive and Stanford Libraries have all harvested (separately) the NBII Website -- Stanford harvested twice between January 5 and January 13, 2012for its Fugitive US Agencies collection.
- jrjacobs's blog
- Add new comment
- 1767 reads
Archiving .Gov: Your Help Requested!
Submitted by starr on Mon, 2009-01-19 21:18.As the inauguration ceremony begins tomorrow, we can be assured that the Library of Congress and other partners in the End of Term Harvest project have captured much of the Bush administration's online presence. Many of these websites will be re-captured at later dates, providing an interesting look at how these websites will change over time, through different administrations.
On a related note, there will undoubtedly be changes in the coming days, weeks, months, that will eliminate some government agencies. We are trying to archive as many of these "dead" websites as possible in the CyberCemetery, to preserve them in their final form.
Please, if you know of a website that is disappearing, email or call me. I'm keeping my eyes and ears open, but there is a lot of content out there, and I welcome your help. After all, this information is for all of us!
Thanks, and I wish you all joy as we witness history tomorrow.
- starr's blog
- Add new comment
- 2174 reads
Harvesting .gov
Submitted by jajacobs on Tue, 2008-10-28 17:20.Harvest time, By William Jackson, GCN, 10/27/08.
A nice article about the end-of-administration web harvest.
See also: Library Partnership Saves Government Sites.
- jajacobs's blog
- Add new comment
- 2633 reads
We Want YOU... To Help With the Dot Gov Harvest!
Submitted by starr on Tue, 2008-09-09 15:10.Hi to all you FGI readers! I'm thrilled to be this month's guest blogger.
As we all watch this historic presidential election unfold, there's another question going on in the back of our minds--how much of this online government information is going to change with the new administration, regardless of who's sworn in next January? As someone who works specifically with digital government collections, and whose primary job is capturing defunct government websites, this is of particular interest.
Most of you already know about the "Dot Gov Crawl" project that's been organized to address this issue. The project partners include the Library of Congress, the Internet Archive, the California Digital Library (CDL), the University of North Texas (UNT), and the U.S. Government Printing Office (GPO). We're working collaboratively to harvest and preserve government websites (primarily .gov and .mil domains), to form a snapshot of digital government information at the end of the current presidential administration.
The Internet Archive will be performing the comprehensive crawl, and Library of Congress is focusing on congressional materials. CDL and UNT will be performing in-depth harvests of specific government websites, gathering documents linked deep within the websites that may not be gathered in the Internet Archive crawl.
I encourage you to participate in the project. Communicate with the partner institution closest to you, and let them know if there are specific websites (or portions of websites) that are of particular interest to you.
At UNT, we're trying to focus on documents that support our regional interests, things that might be overlooked in the kinds of sweeping national topics that will be handled by the Internet Archive. We're requesting that librarians for the central United States send us things that you want captured--websites you use often, publications deep within websites that might not be captured in large crawls, topics of regional interest. Your requests will help us identify and prioritize the information that is preserved for future generations.
Please, submit any suggestions you have in the comments section below--I'll be monitoring them and adding them to our list. Thanks for your input!
- starr's blog
- 1 comment
- 3885 reads
EPA Tagging Results and Future Directions
Submitted by dcornwall on Wed, 2008-05-07 18:57.Back in January we asked people to use del.icio.us to tag a sample of 32 documents taken from the 100 EPA documents posted by the Government Printing Office (GPO) to http://www.gpoaccess.gov/harvesting/index.html.
We asked people to tag documents from 1/18/2008 through /18/2008. A spreadsheet of the results is available at http://spreadsheets.google.com/pub?key=pybymZBlZ80PVat2ggty2GA.
This brief article informally discusses some of our results, offers some lessons learned, and offers suggestions for future projects. Finally, a short list of articles on other research relating to tagging is presented.
1) Findings
- Number of tagged documents - 31
- Average number of people tagging a given document - 2.5
- Highest number of taggers for a document - 8, for the document "Environmental Results Under EPA Assistance Agreements"
- Average number of deduplicated tags per document - 11.25
- Number of documents with descriptions - 31, with a majority of documents having more than one human generated description.
2) Some Promising Results
While we would have liked to have seen more participation (see below under "study limitations"), these initial results are somewhat positive. There is some interest in tagging. Tagged documents tended to receive meaningful descriptions beyond what a brief bibliographic record would provide. For example, for the document "Air Sealing: Building Envelope Improvements", we have the following descriptions from five users:
* Mount Desert Spring Water was able to win a bid to provide bottled water and water coolers to the University of Maine. Mount Desert Spring Water was successful because the water coolers it provided were energy efficient and the lowest cost to the Universi - samchap
* Describes the benefits of proper air sealing for homes. EPA awards the EnergyStar when legal minimum standards are exceeded. - mkvs
* Conserving energy in your house by having it sealed correctly - bookswoman
* "Air sealing the building envelope is one of the most critical features of an energy efficient home." "25-40% of energy" "ENERGY STAR qualified homes, constructed to exceed [building] codes with air sealing, can offer a better quality product." - keyvowel
* This Energy Star news release describes ways homeowners can reduce home heating and cooling costs by implementing air sealing techniques. - tadamich
Without question, the first description is problematic, but the other four descriptions are in agreement about what this document is about AND provide more relevant information than a brief bibliographic record.
For the most part, the tags we got were also meaningful and descriptive. Staying with the document "Air Sealing", we have the following tags:
Air, air-sealing, airsealing, building-insulation, efficient, energy,
energy-efficiency, Energy-Star-Branding, energyconservation, energystar, epa, EPA-advertising, globalwarming, greenhousegases, home-building, home-building-techniques, home-construction, home-improvement, homes, hvac, indoor, leakage, money-saving, quality, sealing, ventilation
Contrast that with a brief bibliographic record that simply has title, agency, and URL. How would people know that this document is part of the EnergyStar initiative, or that it was related to home building or energy efficiency? Clearly, in this instance and in a number of other project documents, there was a clear value added.
3) Limitations of current study
Our promising results were limited by three factors, the most important was the lack of participation. We estimate that about ten people participated in our tagging project. The available research on tagging is pretty firm on stating that good social tagging requires many users. Some say 100 or so is good, others suggest higher numbers. Our numbers are clearly too low. There are also too many instances (12) when a document was tagged by a single user. This could greatly bias how a document gets tagged. Consider if the only description of "Air Sealing" had been the mistaken one about water coolers. That would have been worse than useless. But even in this instance, a user pulling up this document while searching for water coolers could have provided a more accurate description.
The low number of taggers also made it difficult to see how much tag agreement existed among the various taggers.
Another problem was self-inflicted. We forgot to instruct people on tag construction. These were our original instructions:
1) Visit http://www.archive.org/search.php?query=epapilotproject and go to a document on the list. Open the pdf file in a separate browser window.
2) In del.icio.us, tag the page for the Internet Archive record (i.e. not the PDF file) after examining the PDF file.
3) In the del.icio.us "notes" field, write a one or two sentence description of what the document is about.
4) In the tags field, please use epapilotproject, for:freegovinfo and then any tags that you feel describe this document.
del.icio.us uses a space separated tag system. In other words, a space begins a new tag. So tagging something as "air quality" results in the two tags of "air" and "quality" and not the more helpful tag of "air quality" This resulted in some of the tagging becoming meaningless. If we had asked people to put dots or dashes in multiple word tags, we would have gotten more meaningful tags. We still got some useful tags because some of our taggers were used to the del.icio.us system, but we shouldn't have assumed that everyone tagging would know how to construct multiword tags in del.icio.us. On the other hand, this problem might have been less noticeable if we had more taggers per document.
Our final problem is one we think could be avoided in future projects. That is people tagging different files with the same document title. We asked people to bookmark the Internet Archive page for a given document, which has a link to the PDF file. We specifically asked people NOT to tag the PDF file because del.icio.us doesn't populate the title field of bookmarked PDFs. But one person in our project consistently bookmarked a document's PDF file instead of the Internet Archive page and this separated that person's tagging from everyone else's and made it more difficult to compile tagging info for every document.
4) What next? Some suggestions
Our findings indicate that tagging does have potential to add value to web harvested documents that do not receive full cataloging, but for this benefit to be fully realized, there must be more taggers. When we realized we didn't have the number of taggers we wanted, we headed for the literature and found some articles
listed below under "References Consulted." They offer some interesting guidance for other document tagging efforts.
While all of the papers below talked about user motivation, I think Tim Spalding said it best in a post titled "When tags work and when they don't: Amazon and LibraryThing":
"Something is going on here—something with broad implications for tagging, classification and "Web 2.0" commerce. There are a couple of lessons, but the most important is this: Tagging works well when people tag "their" stuff, but it fails when they're asked to do it to "someone else's" stuff. You can't get your customers to organize your products, unless you give them a very good incentive. We all make our beds, but nobody volunteers to fluff pillows at the local Sheraton."
The EPA documents are sort of like fluffing pillows at the local Sheraton, to me at least. My primary interest isn't environmental documents and EPA documents are not a major component of my library's depository collection. In addition our particular sample was unintentionally heavy on flyers, applications, and brochures. It could be that another agency's documents, say NASA or DoD might get more attention.
There's another angle too. In my anecdotal experience, librarians don't see web stuff as theirs, so they don't spend much processing time on it. Of if they are concerned about web documents, perhaps their administration does not. So how could we make them owners and think of web harvested materials as "their stuff" so they'll make their "documents beds"? A few suggestions follow:
1) For the EPA documents, GPO could partner with libraries that do have a strong environmental collection. Perhaps candidate libraries could be determined through item selection analysis.
2) GPO might wish to consider doing a depository survey to see what agency depositories would most like to see web-harvested. The survey could include a question asking libraries if they would tag if the desired content was harvested.
There wouldn't have to be a commitment to tag every document, but to tag some of the documents.
While GPO should continue with web harvesting no matter what, we wouldn't blame them for not moving forward with a documents tagging initiative if the depository community failed to register interest in such a project.
3) If GPO re-harvests EPA or moves on to another agency, it should consider setting up RSS feeds for newly harvested documents. Subject specialists from inside and outside the library community could take part in tagging. Again, GPO would need to start with some broadly popular agencies to have a chance of recruiting a significant number of taggers.
4) If GPO or another organization does a large scale tagging project, significant thought should go into tagging conventions. Not the vocabulary itself -- research seems to show that once an item reaches 100 tags or so, the proportion of tags stays constant. That is to say that agreed upon terms appear to predominate over idiosyncratic or spam tags (See Golder and Huberman below for details). What needs to be spelled out is how multi-word tags should be constructed -- is it air-quality, air.quality, or air_quality? They all mean the same thing, but del.icio.us and other tagging services interpret them differently. A consistent new word marker or a choice of tagging site that supported spaces inside tags will make any tagging project go smoother.
These are our thoughts. What are yours? Look at our spreadsheet. Check out the item pages on del.icio.us and read the articles below. Then let us know what you think about the future of social tagging for government documents.
References Consulted
- "HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, ToRead" by Cameron Marlow, Mor Naaman, danah boyd, Marc Davis http://www.danah.org/papers/Hypertext2006.pdf
- The Structure of Collaborative Tagging Systems
by Scott A. Golder and Bernardo A. Huberman
http://www.hpl.hp.com/research/idl/papers/tags/
http://www.hpl.hp.com/research/idl/papers/tags/tags.pdf
- "Can Social Bookmarking Improve Web Search?" by Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina
http://heymann.stanford.edu/improvewebsearch.html
http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=2008-2&format=pdf&compression=&name=2008-2.pdf
- "When tags work and when they don't: Amazon and LibraryThing"
Thingology Blog, posted by Tim Spalding Tuesday, February 20, 2007
http://www.librarything.com/thingology/2007/02/when-tags-works-and-when-...
- Add new comment
- 3212 reads
GPO/GODORT Conf Call Minutes Posted
Submitted by dcornwall on Sat, 2008-03-29 11:20.Bill Sleeman, chair of the ALA Government Documents Roundtable (GODORT), recently posted the minutes to the GPO / GODORT Steering Conference Call of March 12, 2008. These conference calls take place from time to time and often have news of value. The minutes can be found (may have to scroll) at http://wikis.ala.org/godort/index.php/GODORT_Chair and covered the following topics, among others:
- Request for Information for Mass Digitization Opportunities
- Status of EPA Web Harvesting
- Status of the Federal Digital System (FDSys)
- Addition of pre-1976 cataloging to the Catalog of Government Publications - in progress.
- Continued distance ed through OPAL
- Current stats on the newish Government Information Online reference service.
If I were you, I'd look over the entire set of minutes as it was all interesting. I'd like to highlight two issues, both of which cry out for the documents community to do more to support GPO in some of its efforts:
EPA Web Harvest Project
Here are the notes on this subject (full names available from minutes page):
LH & RHM: Status of EPA harvesting project: GPO worked through 300 of the documents to gather information on what it will take for GPO to provide access to harvested materials (process, workflow and staffing implications). So far: the back end automation of meta-data extraction is not ready; parameters for metadata that accompanies the files needs improvement to automate de-duping; and the rules, methods and mechanisms for harvesting need to be refined (approximately 28% of material was not in scope). Basically, it is still taking more staff time to make these available than GPO can afford. BS asked about the FGI taxonomy experiment and if GPO would be investigating the results of that effort. GPO may incorporate that information into the project as the project moves forward.
GPO's results of automated harvesting finding a lot of out of scope material and difficult automated extraction of metadata are about what I expected based on my own experience and from my reading of the literature. Whether or not GPO builds on our modest taxonomy experiment (Thanks Bill!), I think that a GPO - community/citizen collaboration will be needed to begin getting a handle on web-based agency documents. They could start simply by publishing their spidering logs and see what happens. Or perhaps they can obtain some of the $2 Billion/week currently being spent elsewhere. If GPO choose to take the mass collaboration route, I hope the documents community is in the forefront of helping them.
If you're interested in taking part in our tagging experiment, please see http://freegovinfo.info/epatagging. We will be running the project through April 18, 2008. To see what has been tagged so far, please visit http://del.icio.us/tag/epapilotproject.
OPAL Training
Here are the notes on this subject:
LC: OPAL, GPO continues to use OPAL for online training and demos. At present, technical capabilities limit presentations to slide shows, such as PowerPoint presentations. Interactive web functions will be added in the future. January call for participation in creation of tutorials netted one submission; hoping to generate interest at DLC.
The FDLP has over 1200 libraries and GPO got ONE SUBMISSION? A majority of FDLP libraries are teaching oriented academic libraries and GPO got ONE SUBMISSION?
Hello! I know I'm not the only one who has insisted that GPO provide training between conferences for those of us who don't get out much. The documents community has a great reservoir of government information expertise. We should be actively aiding GPO in their efforts to spread that expertise.
I admit that GPO's one submission wasn't from my library. I have a pretty new docs staff that's still getting up to speed. But that can't be the case everywhere. If only 10% of FDLP libraries could step up with a program, that would still be 120 programs -- twice a week for a whole year.
Just so I can at least pretend to put my money (or staff time) where my mouth is, I will spend some time next month looking at our library's gov info information strengths, our customer needs and patron interests. And then sometime during the summer I or someone else from our library will submit a program. If you run a depository, will you commit to doing the same? Not only does GPO need our help, so do our colleagues.
FGI thanks the GODORT and GPO personnel who participated, Jill Vassilakos-Long for taking the minutes and Bill for posting them to the ALA GODORT Wiki.
- dcornwall's blog
- 2 comments
- 4447 reads
Help Us Explore Findability Through Tagging!
Submitted by dcornwall on Mon, 2008-01-21 20:52.Free Government Information is investigating the usefulness of tagging government documents that do not receive traditional cataloging and needs your help! We've posted 32 documents that the Government Printing Office (GPO) harvested from the EPA web site and posted them to the Internet Archive. Over the next three months, we'd like to see as many people as possible tag and describe these documents using the del.icio.us bookmarking service. For a full project description and instructions on how to participate, please visit http://freegovinfo.info/epatagging. We'd like to thank GPO for posting a sample of their harvested EPA documents that made this project possible.
This project got its inspiration from Galaxy Zoo (http://www.galaxyzoo.org), an astronomy project which has a database of 1 million galaxies that researchers asked regular folks to classify as ellipical, clockwise spiral, or anticlockwise spiral. They aimed for and got at least 20 classifications per galaxy. If a particular galaxy was classified a certain way by 80% of users who assigned a classification to that galaxy, that classification was accepted. This "person on the street" data was compared with a small subset (50,000) of galaxies that professional astronomers had managed to classify on their own. The researchers found that there was pretty much total agreement between the professional and amateur assessments. Documents are more complex than galaxies. :-) , but if 9 out of 10 people tag an epa document as air quality, then it's probably about air quality.
So please visit http://freegovinfo.info/epatagging and get started. And tell your friends, coworkers and especially any environmental professionals that you know to get involved. Also, if you have a network in del.icio.us, we'd appreciate you putting on a "for:[friend name]" tag for every member of your del.icio.us network.
UPDATE 1/25/2008 Forgive my overzealousness with the above suggestion to tag every person in your del.icio.us network. I should never advocate spam. BUT, if there are people in your network interested in the environment or government documents, please consider sharing our project page with them.
The more people involved with this project, the better the descriptions and the more robust the subject access provided by the tagging will be. At least that's our hope.
We are going to run this project for three months, then the FGI volunteers will compile data on the following:
A) How many people participated in the project.
B) How many documents were tagged.
C) How many documents were described.
D) The average number of tags per document.
We will also examine how much agreement on tags exist for a given document. We will make our compilations publicly available along with any analysis we have.
Hope to see you on del.icio.us soon making environmental documents easier to find and easier to digest!
- dcornwall's blog
- Add new comment
- 2543 reads


Recent comments
1 week 5 days ago
1 week 5 days ago
2 weeks 1 day ago
2 weeks 3 days ago
2 weeks 4 days ago
3 weeks 17 hours ago
3 weeks 5 days ago
3 weeks 6 days ago
4 weeks 3 days ago
4 weeks 5 days ago