How Much Digital Information?
Since 2007, on behalf of EMC Corporation, IDC has been sizing what it calls the Digital Universe, or the amount of digital information created and replicated in a year. The newest report is now available:
- The Digital Universe Decade – Are You Ready? By John Gantz and David ReinselIDC, Sponsored by EMC Corporation (May 2010) [PDF, 16 pp excerpted from the IDC multimedia presentation, "The Digital Universe Decade – Are You Ready?" (May 2010)
- The Digital Universe Decade – Are You Ready? (The multimedia content)
These reports estimate the size of everything digital. IDC looks at the installed base of devices or applications that could capture or create digital information and estimates (based on their research and "other sources") how much information was created in a year. They also estimate the number of times a a unit of information is replicated. The include devices such as mobile phones and bar code readers and video games as well as cameras, scanners, email, office applications, databases, GPS, medical imaging, and lots more. A lot of this is estimates and I found it hard to tell how much was gathered evidence and how much was speculation (see their methodology in the first IDC Digital Universe paper, published in 2007). To me this means that the figures they come up with may not be very accurate. Predictions of the future based on these estimates are, I think, very speculative.
Nevertheless, I've been following these ever since I noticed that Fran Berman quoted an earlier report and referred to 2007 as the "cross-over" year: the year in which more digital data was created than there was data storage to host it. (Berman, Francine, Got data?: a guide to data preservation in the information age, Commun. ACM, 51 (2008), 50-56.)
Even if you don't believe that the IDC numbers are 100% accurate, the general ideas that they promote are probably not that far off the mark. Some of those ideas:
- Last year, despite the global recession, the Digital Universe set a record growing by 62% to nearly 800,000 petabytes.
- The average file size is getting smaller. The number of things to be managed is growing twice as fast as the total number of gigabytes.
- The growth of the Digital Universe is like a perpetual tsunami. How will we find the information we need when we need it?
- How will we know what information we need to keep, and how will we keep it?
That last item is my favorite. Regardless of exactly how much digital information is created each year, regardless of how much storage space we have, regardless of the fact that a lot of the "digital universe" that IDC describes is throw-away information that no one would think is worth keeping, we are still faced with Lots of Stuff and we need to figure out What to Preserve. That, I believe, is the next big challenge for digital preservation.
One way to face that challenge is to rely on producers to decide what to save. If a government agency produces something digital, allow that agency (or GPO, or LoC, or NARA, or OMB or OPM, or your favorite TLA) to decide for you if that information is worth saving.
Another way to face the challenge is to rely on a few big organizations. That is: pool our resources and outsource preservation to a few big organizations that will do this for us. Some of the same players pop up here: LoC and NARA, for example, but there are also organizations like Portico, and the Internet Archive, and ICPSR.
Both of the above solutions hope that someone else will take into account the needs of all possible users and make the right decisions. That model can work for some classes of information with appropriate governance and decision-making structures in place.
But, I believe, the lesson from the IDC report is that the "digital universe" is so large that we should not assume that any single solution will be enough. There is just too much information and there are too many decisions to make about what is worth saving. While information producers and a few big preservation organizations can do a lot, they cannot do everything. And, their size alone will constrain their decisions. It will be harder for big organizations to respond to the needs of smaller communities of interest.
What is the alternative? I think that we need (what shall we call them...?) Libraries. Public Libraries, Special Libraries, College and University Libraries, and School Libraries. These can work together or independently. They can address the needs of their particular communities of interest. This will accomplish three things:
- It will aid preservation by making the preservation community bigger. This will not only increase redundancy, but will also help ensure that there is less chance that a single system or financial or governance failure will mean a loss of all information.
- It will help deal with the scale of the preservation problem (as identified by the IDC report). With more players and more stake holders, there will be more voices and more variety in the decision making process when we collectively decide what to save. This will mean, for example, that a group of School Libraries working together on digital preservation could ensure that an item of essential use to K-12 will be saved even if no university saves it. And vice versa.
- It will help users find and use the information they need. Today, it seems that everyone understands what librarians have always known: that there is a lot of information in the world. It used to take a library degree to get an appreciation of all the sources of information in the world. Today, everyone that uses the Web has that same appreciation. It seems like every day there is another newspaper article or blog posting about how great it is to have access to "everything." But the "everything" people see on the Web is really only a subset of everything and it only appears to be "everything" because there is so much in this very large subset of everything. And, when your only option is to search "everything" you quickly discover that that is not always the best way to find just what you want. (Even Google has segmented information into categories like movies, blogs, books, and scholarly information.) Having community-of-interest collections will enable libraries to build user-interfaces that work best for those communities and that provide access to the information those communities most want.
Libraries won't replace "everything" collections. They will complement each other and unfocused "everything" collections. They will enrich us all and help ensure that we will preserve what needs to be preserved as the "digital universe" expands more rapidly than we could otherwise deal with.