Data

What do we mean by "effective" access to data ? (Part II)

In my last post, I described the possibility of a systematic approach to data validation. A key feature of such an approach must be it’s availability to all who are responsible for data – and of special importance, its capacity to support efficient and timely use by creators or managers of data. Bill Michener (UNM), leader of one of the currently funded DataNet projects has published a chart describing the problem of “information entropy” [SEE: WK Michener “Meta-information concepts for ecological data management,” Ecological Informatics 1 (2006): 4 ] Within recent memory, I have heard an ecologist say that were it not possible to generate minimally necessary metadata “in 8 minutes,” he would not do it. Leaving aside -- for now -- the possibility of applying sticks and/or carrots (i.e. law and regulations, norms and incentives), it seems clear that a goal of applications development should be simplicity and ease of use.

[ Within the realm of ecology, a good set of guidelines to making data effectively available was recently published – these guidelines are well worth reviewing and make specific reference to the importance of using "scripted" statistical applications (i.e. applications that generate records of the full sequence of transformations performed on any given data) this recommendation complements the broader notion -- mentioned in my last post -- of using work flow mechanisms like Kepler to document the full process and context of a scientific investigation. SEE “Emerging Technologies: Some Simple Guidelines for Effective Data Management” Bulletin of the Ecological Society of America, April 2009, 205-214. http://www.nceas.ucsb.edu/files/computing/EffectiveDataMgmt.pdf ]

As a sidebar, it is worth noting that virtually all data are “dynamic” in the sense that they may be and are extended, revised, reduced etc. For purposes of publication – or for purposes of consistent citation and coherent argument in public discourse – it is essential that the referent instance of data or “version” of a data set be exactly specified and preserved. (This is analogous to the practice of "time-stamping" the citation of a Wikipedia article...)

Lest we be distracted by the brightest lights of technology, we should acknowledge that we now have available to us, on our desktops, powerful visualization tools. The development of Geographic Information Systems (GIS) has made it possible to present any and all forms of geo-referenced data as maps. Digital imaging and animation tools give us tremendous expressive power – which can greatly increase the persuasive, polemical effects of any data. (For just two instances among many possible, have a look at presentations at the TED meetings [SEE: http://www.ted.com/ ] or have a look Many Eyes [SEE: http://manyeyes.alphaworks.ibm.com/manyeyes/ ] .) But, these tools notwithstanding, there is always a fundamental obligation to provide for full , rigorous and public validation of data. That is, data must be fit for confident use.

+++++++++++++++

Unanticipated uses of resources are one of the most interesting aspects of resource sharing on the Web. (At the American Museum of Natural History, we made a major investment in developing a comprehensive presentation of the American Museum Congo Expedition (1909-1915) – our site included 3-D presentation of stereopticon slides and one of the first documented uses of the site was by a teacher in Amarillo, Texas who was teaching Joseph Conrad – we received a picture of her entire class wearing our 3-D glasses.) It seems highly unlikely to me that we can anticipate or even should try to anticipate all such uses.

In the early 1980’s, I taught Boolean searching to students at the University of Washington and I routinely advised against attempts to be overly precise in search formulation – my advice was – and is – to allow the user to be the last term in the search argument.

An important corollary to this concept is the notion that metadata creation is a process not an event – and by “process” I mean an iterative, learning process. Clearly some minimally adequate set of descriptive metadata is essential for discovery of data but our applications must also support continuing development of metadata. Social, collaborative tools are ideal for this purpose. (I will not pursue this point here but I believe that a combination of open social tagging and tagging by “qualified” users -- perhaps using applications that can invoke well-formed ontologies – holds pour best hope for comprehensive metadata development.)

What do we mean by “effective” access to data?

As previously discussed, “free” and “open” dissemination of data are primary values, are fundamental premises for democracy. Data buried behind money walls, or impeded or denied to users by any of a variety of obstacles or “modalities of constraint” (Lawrence Lessig’s phrase) cannot be “effective”. But even when freely and/or openly available data can be essentially useless.

So what do we mean by “effective”? One possible definition of “statistics” is: “technology for extracting meaning from data in the context of uncertainty”. In the scientific context – and I have been arguing that all data are or should be treated as “scientific” – if data are to be considered valid, they must be subject to a series of tests respecting the means by which meaning is extracted...

By my estimation, these tests in logical order are:
Are the data well defined and logically valid within some reasoned context (for example, a scientific investigation – or as evidentiary support for some proposition)?
-- Is the methodology for collecting the data well formed (this may include selection of appropriate, equipment, apparatus, recording devices, software)?
-- Is the prescribed methodology competently executed? Are the captured data integral and is their integrity well specified?
-- To what transformations have primary data been subject?
-- Can each stage of transformation be justified in terms of logic, method, competence and integrity?
-- Can the lineages and provenances of original data be traced back from a data set in hand?

The Science Commons [SEE: “Protocol for Implementing Open Access Data” http://www.sciencecommons.org/projects/publishing/open-access-data-protocol/] envisions a time when “in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database” and, later in the protocol, imagines a compilation involving 40,000 data sets. Just the prospect of proper citation for the future “meta-analyst” researcher suggests an overwhelming burden.

So, of course, even assuming that individual data sets can be validated in terms of the tests I mention above, how are we to manage this problem of confidence/ assurance of validity in this prospectively super-data-rich environment?

(Before proceeding to this question let’s parenthetically ask how these test are being performed today? I believe that they are accomplished through a less than completely rigorous series of “certifications” – most basically, various aspects of the peer review process assure that the suggested tests are satisfied. Within most scientific contexts, research groups or teams of scientists develop research directions and focus on promising problems. The logic of investigation, methodology and competence are scrutinized by team members, academic committees, institutional colleagues (hiring, promotion, and tenure processes), by panels of reviewers – grant review groups, independent review boards, editorial boards -- and ultimately by the scientific community at large after publication. Reviews and citation are the ultimate validations of scientific research. In government, data are to some extent or other "certified by the body of agency responsible.)

If we assume a future in which tens of thousands of data sets are available for review and use, how can any scientists proceed with confidence? (My best assumption, at this point, is that such work will proceed with a presumption of confidence – perhaps little else?)

Jumping ahead, even in a world where confidence in the validity data can be assured, how can we best assure that valid data are effectively useful?

A year ago in Science a group of bio-medical researchers raised the problem of adequate contextualization of data [SEE: I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713] Specifically, they suggested:
“a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis.“ While their focus was on clinical studies in the bio-medical realm, the logic of their argument extends to all data. We already have tools available to us that can specify scientific work flows to a very precise degree. [SEE for example: https://kepler-project.org/ ] It seems entirely possible to me that such tools can be used – in combination with well-formed ontologies built by consensus within disciplinary communities to systematize the descriptions of scientific investigation and data transformation. – and moreover – by the combinations with socially collaborative applications -- to support a systematic process of peer review and evaluation of such work flows.

OK -- so WHAT ABOUT GOVERNMENT INFORMATION??? We’re just government document librarians or just plain citizens trying to make well-informed decisions about policy? Stay tuned…

What is NOT “science”? Why we have a right to “data” as “evidence”…

Most of us accept a priori the institutionalized distinction between the sciences and the humanities. If asked, we can tick off the names of “disciplines” that are “scientific” and those that constitute “the humanities”… (The “social sciences” are somehow less centrally – more vaguely? -- “scientific” -- but what do we mean essentially by these distinctions?) [It's worth noting that novelist CP Snow famously posited this distinction in his Cambridge lecture and subsequent book "The Two Cultures" -- ca. 1959.]

We might say that science is “empirical” meaning that it is based upon real, physical evidence? Or perhaps that it’s “inductive” – its theories or “laws” flowing from observations of facts… Or perhaps that it is “quantitative” or "technical" – its conclusions determined by the use of sometimes very complex mathematical logic or by complex apparatus. We might also say that it employs a rigorous methodology that includes exact logical provisions for “falsifiability” [SEE: Karl Popper, The Logic of Scientific Investigation – and elsewhere], for open peer review – including test by replication – and for validation by demonstration of predictive power… Science also is systematically accretive and depends on careful citation and documentation, building upon itself like a coral reef…

But, it strikes me that any humanist should feel uncomfortable at the assumption that the humanities do not – or are incapable of – meeting these standards at least most of them in most cases? (I'll leave it to the reader to assess what is most essentially “humanist” – but I often have the uncomfortable sense that the humanities may too often depend for their esoteric authority upon the incoherence of their evidentiary base or upon the imprecision of language or between languages…?)

I attribute "beauty" as a primary motive/value to “the arts”… (The American poet Randall Jarrell once said: “Criticism is the poetry of the prosaic.”) And I heard, anecdotally, a few years ago that the performance artist, Laurie Anderson, was invited to a discussion about “the arts” and “the sciences” and before too long was asking “What are we doing here?” I understood this to be an intuitive recognition that the arts and the sciences are on very similar tracks… I believe that artists are able to operate more spontaneously, intuitively and imaginatively -- perhaps more "aesthetically"? but less "systematically" ? Scientists often operate on that same frontier but with the requirement that they test their intuitions using the scientific method and then publicly disclose their “tests”.

"Belief" is ultimately the subjective preserve of the individual -- and the institutional preserve of religion. Maintaining the distinction between "belief" and reason (or logic) is a fundamental value of the Enlightenment -- particularly in public discourse.

OK so what am I getting at here? And why?

Ultimately all policy -- whether "scientific" or not -- and all human decisions should be based on logical analysis and on evidence. Both evidence and analysis are susceptible to testing, to evaluation and thus to reasoned discussion. Our civil discourse will always be improved by clear specification of analytical logic and by free, open and effective disclosure of empirical evidence or DATA.

Respecting data there are a series of fundamental criteria that must be satisfied to validate it’s “authenticity” and its probative value (its effectiveness as evidence). As citizens, we have the right to demand that public policy and public decisions be based on well-formed logic and on valid evidence… Discussion that occurs in our public fora should always distinguish between matters of logic and fact and matters of belief.

We’ll pursue these notions – in the context of free, open effective access to data and in the context of science literacy – in future posts…

Data.gov.uk Launches Soon!

Looks like the UK version of data.gov, developed by Sir Tim Berners-Lee, is going to be released soon. It is "language-based" where "linkages are based on human language, rather than hard-coded hyperlinks", a.k.a. the Semantic Web concept that Berners-Lee has been touting for years.

I like the way Nancy Scola of Personal Democracy Forum describes the Semantic Web:

[Berners-Lee] vision is of a web that understands the connections between disparate bits of information in a way similar to how the human mind might effortlessly connect an address on London's Whitehall with the events of World War II that Winston Churchill directed from an underground bunker there. Data woven through with more human ways of interpretation might, just might, make the gap between making government information public and making it useful a little smaller.

The BBC reports that "Data.gov.uk is built with semantic web technology, which will enable the data it offers to be drawn together into links and threads as the user searches...we will also be able to look for patterns...visitors to data.gov.uk will want to make their own mash-ups from the information available."

Yes, and we should be making mashups from our country's data.gov for our library patrons too! Let's get to it! I'll be working on mine and will show you how it can be done.

DataTO.org

Check out DataTO.org, which is similar to data.gov, but users request data sets from the Toronto municipal government. The first phase of the website will allow one to:

...publish a request for data to the community, where members can comment and rate the request. In future iterations of this site, publishers and others will be able to post details of known and existing data sources so that community members can rate them for prioritization. Users will then be able to find data sources that have been published.

Kudos to O'Reilly Radar.

FEC makes data available in multiple formts

Disclosure Data Catalog, Federal Election Commission

"Each of the files listed here can be downloaded in either csv or xml formats. Each also has a metadata page that describes the information included and the structure of the file itself. There is a pdf version of each file if you need to print the information. You can also subscribe to RSS feeds for each of the files so you're notified whenever new data is available or a change is made."

Also see the Commission's Disclosure Data Blog where the FEC will post information about the files and its future plans. And: they say that "you can get help with any questions about the data we're providing here."

Free ICPSR Data Conference on the Web

ICPSR 2009: Real Data in a Virtual World

ICPSR (the Inter-University Consortium for Political and Social Research) is the large social science data archive at the University of Michigan. Every second year, ICPSR hosts a meeting (in Ann Arbor Michigan) for its "Official Representatives" -- one person at each ICPSR member institution. This year, the meeting (October 5-9) is open to all and the meeting is on the web instead of in Ann Arbor.

At the link above, you can find a list of the week long program and sign up for individual sessions.

Government information specialists, particularly those with responsibilities for data and statistics, should set aside time for this! Sessions are interesting and informative. Some examples from the program:

  • Census 2010 & American Community Survey
  • Delivering Research Opportunities to Undergraduates
  • Online Data Analysis Tools

US Federal Government IT Dashboard: Data and Visualization

If you enjoy interactive data visualizations, make sure you visit the US Federal IT Dashboard. This site is meant for use by both the public and the staff of federal agencies. The goal is to make it easy to explore how the federal government is spending its IT dollars.

The major components of the dashboard are:

  • Performance Dashboard: supports viewing of major IT investments, filterable by agency
  • Data Feeds: will let you select data to download or create a "dynamically XML feed"
  • Analysis Visualizations: lets you chart and animate any combination of 15 IT spending statistics from 2002 to present

This blurb off the FAQ page gives a great overview of what this site is all about:

"The IT Dashboard provides the public with an online window into the details of Federal information technology investments and provides users with the ability to track the progress of investments over time. The IT Dashboard displays data received from agency reports to the Office of Management and Budget (OMB), including general information on over 7,000 Federal IT investments and detailed data for nearly 800 of those investments that agencies classify as "major." The performance data used to track the 800 major IT investments is based on milestone information displayed in agency reports to OMB called "Exhibit 300s." Agency CIOs are responsible for evaluating and updating select data on a monthly basis, which is accomplished through interfaces provided on the website."

What features do you want for your catalog of govt data?

Data is definitely getting sexy. Jonathan Gray of the Open Knowledge Foundation asks "What features should be included in a catalogue of open government data?" and points to a few other data repositories being built on the state and country level (like my own city of SF's CivicDB!). He also mentions the Sunlight Foundation's plan to build on and expand data.gov with a national data catalog that I had meant to write about a couple of weeks ago (go Sunlight!). So I'm putting this question to you, FGI's faithful readers. I'm sure you'll have a thing or three to add to the list of requirements for open data catalogs.


Here are a few suggestions for those building catalogues for (open) government data based on our experience developing CKAN:

  • Make the catalogue itself open!
  • Let others download the catalogue data in bulk (not just via an API)
  • Include information on how to get the data, and how it can be used
  • Make it versioned!

[originally tweeted by @EllnMllr. Go ahead and follow her!]

ALA Annual GODORT Update Meeting: "Need Data – but don’t know where to go?"

Death and Taxes

Death and Taxes - A Graphical Visualization of the Federal Budget

Death and Taxes is a large representational graph and poster of the federal budget. It contains over 500 programs and departments and almost every program that receives over 200 million dollars annually. The data is straight from the president's 2009 budget request and will be debated, amended, and approved by Congress to begin the fiscal year. All of the item circles are proportional in size to their spending totals and the percentage change from 2008 is included to spot trends and disproportion.

US Census Bureau's DataFerret

DataFerrett (Federated Electronic Research, Review, Extraction, and Tabulation Tool) is a free data mining and extraction tool developed by the U.S. Census Bureau that allows users to search, browse, combine, tabulate, recode, and analyze statistical data from a network of online data libraries. The DataFerret software can be downloaded from the website or ran in the browser via a java applet.

Some material to read before getting started:

  1. DataFerret Brochure
  2. Getting Starting with DataFerrett Tour
  3. DataFerret User Guide

Available data sets included:

  • American Community Survey (ACS)
  • American Housing Survey (AHS)
  • Behavioral Risk Factor Surveillance System (BRFSS)
  • Consumer Expenditure Survey (CES)
  • County Business Patterns (CBP)
  • Current Population Survey (CPS)
  • Decennial Census of Population and Housing
  • Harvard-MIT Data Center Collection
  • Home Mortgage Disclosure Act (HMDA)
  • Local Employment Dynamics (LED)
  • National Ambulatory Medical Care Survey (NAMCS)
  • National Center for Health Statistics Mortality (MORT)
  • National Health and Nutrition Examination Survey (HANES)
  • National Health Interview Survey (NHIS)
  • National Hospital Ambulatory Medical Care Survey (NHAMCS)
  • National Survey of Fishing, Hunting, and Wildlife (FHWAR)
  • Small Area Income and Poverty Estimates (SAIPE)
  • Social Security Administration (SSA)
  • Survey of Income and Program Participation (SIPP)
  • Survey of Program Dynamics (SPD)

DataFerret is a wonderful tool for exploring and analyzing data. Enjoy!

(found via Open Access News)

Data.gov Goes Live!

Data.gov is now live and ready for you to explore!

The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.

You have a say in the future of Data.gov by suggesting datasets to include and suggest improvements/enhancements to the website.

Data.gov has a searchable data catalog that gives access to data through the "raw" data catalog and by using tools. "The Raw Data Catalog provides an instant download of machine readable, platform-independent datasets while the Tools Catalog provides hyperlinks to tools that allow you to mine datasets."

Please note that by accessing datasets or tools offered on Data.gov, you agree to the Data Policy, which you should read before accessing any dataset or tool.

Here is an excerpt from the policy that we need to read closely:

Secondary Use
Data accessed through Data.gov do not, and should not, include controls over its end use. However, as the data owner or authoritative source for the data, the submitting Department or Agency must retain version control of datasets accessed. Once the data have been downloaded from the agency's site, the government cannot vouch for their quality and timeliness. Furthermore, the US Government cannot vouch for any analyses conducted with data retrieved from Data.gov.

Citing Data
The agency's preferred citation for each dataset is included in its metadata. Users should also cite the date that data were accessed or retrieved from Data.gov. Finally, users must clearly state that "Data.gov and the Federal Government cannot vouch for the data or analyses derived from these data after the data have been retrieved from Data.gov."

What do you think? Is the policy fair? Any suggestions for improvement we could make to Data.gov?

For more information, visit their FAQ and Tutorial.

Also, check out Sunlight Lab's "Apps for America 2: The Data.gov Challenge"!

Just as the federal government begins to provide data in Web developer-friendly formats, we're organizing Apps for America 2: The Data.gov Challenge to demonstrate that when government makes data available it makes itself more accountable and creates more trust and opportunity in its actions. The contest submissions will also show the creativity of developers in designing compelling applications that provide easy access and understanding for the public while also showing how open data can save the government tens of millions of dollars by engaging the development community in application development at far cheaper rates that traditional government contractors.

Now, let's go play around with this new site and make suggestions, shall we?

OpenSecrets.org Goes OpenData

How cool is this? Today the Center for Responsive Politics has announced that it's putting 200 million data records from its archive directly into the hands of citizens, activists, journalists and anyone else interested in following the money in U.S. politics. The data are available through the site's Action Center. Thanks OpenSecrets!


The following data sets, along with a user guide, resource tables and other documentation, are now available in CSV format (comma-separated values, for easy importing) through OpenSecrets.org's Action Center at http://www.opensecrets.org/action/data.php:

CAMPAIGN FINANCE: 195 million records dating to the 1989-1990 election cycle, tracking campaign fundraising and spending by candidates for federal office, as well as political parties and political action committees. CRP's researchers add value to Federal Election Commission data by cleaning up and categorizing contribution records. This allows for easier totaling by industry and company or organization, to measure special-interest influence.

LOBBYING: 3.5 million records on federal lobbyists, their clients, their fees and the issues they reported working on, dating to 1998. Industry codes have been applied to this data, as well.

PERSONAL FINANCES: Reports from members of Congress and the executive branch that detail their personal assets, liabilities and transactions in 2004 through 2007. The reports covering 2008 will become available to the public in June, and the data will be available for download once CRP has keyed those reports.

527 ORGANIZATIONS: Electronically filed financial records beginning in the 2004 election cycle for the shadowy issue-advocacy groups known as 527s, which can raise unlimited sums of money from corporations, labor unions and individuals.
To download bulk data from OpenSecrets.org, users must register on the site and agree to prominently credit the Center for Responsive Politics, along with other terms of service. CRP is making its data available through a Creative Commons Attribution-Noncommercial-Share Alike license, which allows users to remix, tweak, build upon and share the Center's work non-commercially. CRP will continue to offer its data to commercial users for a negotiable fee.

OpenSecrets.org also offers a number of APIs (Application Programming Interfaces) to give users direct access via web programming to data displayed on OpenSecrets.org. Web developers are already using these APIs to display OpenSecrets data on their web pages and create mashups using live, up-to-date data.

Users can also share CRP data using OpenSecrets.org's widgets, which can be placed easily on any website or blog. New widgets for the 2010 election cycle are in development.

data.gov coming to an internet near you!

As we noted last month, the Federal government is moving forward with their plans for a May launch of data.gov. US CIO Vivek Kundra has said that this is an attempt to ensure that all government data 'that is not restricted for national security reasons can be made public' through data feeds. And just to remind everyone, Wired has launched a wiki for calling attention to datasets that should be shared as part of the Data.gov plan. Please go to the wiki and add those data sets that are near and dear to your hearts!

that is all.

Syndicate content Syndicate content