Data
Govistics and the need for library data microservices
Submitted by jrjacobs on Mon, 2010-08-16 11:01.Several of us here at Stanford library who deal with data and/or govt information have recently received emails asking if we'd be interested in a free trial of the Pro level of subscription to the Govistics Government Spending Database built by the Center for Governmental Research (CGR). I'm a sucker for free trials, so took them up on their offer. Here's what I found -- and please take it with an FGI grain of salt ;-)
The interface is easy for quick results and high-level comparisons, but I found it lacking for any kind of in-depth scholarly pursuits -- the researchers and students I work with would most likely be interested in historic data for all counties or all municipalities in a state or region or ALL states; and they'd probably want the data exportable so they could do further analysis with a statistical package (SPSS etc) or GIS software. I also didn't find the maps or charts particularly compelling. $50/year for an individual subscription (I didn't ask about an institutional subscription) seems like too steep a price to pay when there are other *free* tools out there -- my personal favorite is Many Eyes (also check out their new project Many Bills visual bill explorer!). Many Eyes allows a person to upload datasets, share them, run a variety of visualizations (charts, graphs, maps, clouds etc), and most importantly embed those visualizations in other Web pages. Govistics doesn't do any of that.
And what about the underlying data you say? Govistics is basically US census of govts which is available for free on factfinder.census.gov (although only in PDF with no data export :-|). Many of the same variables are also available via the Census' County and City Data Book (again only PDF :-|). Govistics only offers data export with the pro version and the data only goes back to 2007.
I don't begrudge the govistics folks trying to make a quick buck on public domain data that's already available online for free (well maybe a little). Perhaps for the casual user, this service will work well. But what I'd love to see is libraries creating interfaces like this *for free*. There needs to be free tools that include access + visualization + preservation. UVA has done gotten a great start with their historical county and city data books 1944-2000(!). This is especially cool because it not only gives access to historical data back to 1944 (no visualization yet, but users can use Many Eyes!) and allows for export of data for reuse, but it provides a preservation model as well. And THAT'S why I'd love to see more libraries doing this sort of thing. This is an increasingly data driven world and it would behoove libraries to combine these kind of access/visualization services with libraries' traditional strength in long-term preservation.
--that is all.
- jrjacobs's blog
- Add new comment
- 255 reads
New maps, charts, tables from BLS's Quarterly Census of Employment and Wages
Submitted by jajacobs on Thu, 2010-07-01 08:25.A very nice new online application from the Bureau of Labor Statistics:
- Introducing the QCEW State and County Map Application
The Bureau of Labor Statistics (BLS) has developed an interactive state and county map application available at http://beta.bls.gov/. The application displays geographic economic data through maps, charts, and tables, allowing users to explore employment and wage data of private industry at the National, State, and county level. Throughout this application, URLs are specific to the data displayed, so links can be bookmarked, reused, and shared. The application includes maps, charts, tables, and a link to standard BLS data tables and graphs.
- QCEW State and County Map
hat tip to Sabrina I. Pacifici!
- jajacobs's blog
- Add new comment
- 436 reads
Changes at Data.gov
Submitted by jajacobs on Tue, 2010-05-25 11:05.If you haven't looked at data.gov lately, you should. It was launched one year ago and has had a bit of a makeover recently and has added lots of new data.
OMB Watch has a quick overview and comment about the current state of data.gov (Data.gov Celebrates First Birthday with a Makeover, by Roger Strother, OMB Watch. 05/24/10).
Check out these highlights:
- Apps where developers are creating a wide variety of applications, mashups, and visualizations. From crime statistics by neighborhood to the best towns to find a job to seeing the environmental health of your community...
- Semantic Web where they highlight a set of data.gov resources reformatted into Resource Description Framework (RDF) format. These allow new kinds of rich interaction with the data. See, for example, the White House Visitor Search. Also see the Thetherless World Weblog from the Rensselaer Polytechnic Institute where some of this work is being done.
And don't forget, at Data.gov, "data" can mean just about anything, even the Foreign Relations of the U.S.
- jajacobs's blog
- Add new comment
- 953 reads
Workshop: Providing Social Science Data Services: Strategies for Design and Operation
Submitted by jajacobs on Sun, 2010-03-21 06:31.Announcement of Workshop:
Providing Social Science Data Services: Strategies for Design and Operation
http://www.icpsr.umich.edu/icpsrweb/sumprog/courses/0041
August 9-13, 2010
Ann Arbor Michigan
Instructors:
Chuck Humphrey, Head of the Data Library, University of Alberta
Jim Jacobs, Data Services Librarian Emeritus, University of California San Diego
This five-day workshop is being offered for individuals who manage or provide local support services for ICPSR and other numeric data for quantitative research.
Providing access to data has taken on greater prominence over this past decade with the emergence of several significant developments, including, e-Science infrastructure funding, the open data movement, national and institutional digital preservation strategies & services, data enclaves for confidential data, lifecycle data management planning, and data mash-up technologies on the Internet. Given these major environmental changes, how does one plan and design appropriate levels of data service in her or his local institution?
This workshop is structured around a five-stage data lifecycle model that focuses on data production, data dissemination, data repositories, data discovery and data repurposing. A day is dedicated to each stage in this model during which discussions address issues for local data services and computer exercises demonstrate service activities. In this context, fundamental data topics are covered, including understanding the data reference interview, working with variables, interpreting data documentation, coping with various dissemination formats, accessing different online services (e.g., SDA and Nesstar), searching for social science data, subsetting data using Web-based tools, selecting and downloading ICPSR data, and options for local data delivery. Throughout the workshop, an emphasis will be placed on social science concepts and terminology, as well as on practical solutions to service delivery.
Who Should Attend: Anyone who is new to providing services for numeric social science data or is seeking to revitalize an existing service. This is not a course in statistics and attendees are not expected to know how to analyze data.
Online Registration:
http://www.icpsr.umich.edu/icpsrweb/sumprog/2010/index.jsp
Workshop will remain open only until the Summer Program office has received 20 paid applications.
Questions?
If you have questions about registration, fees, travel, housing, or other courses at the ICPSR Summer program, please get in touch with ICPSR directly:
http://www.icpsr.umich.edu/icpsrweb/sumprog/contact.jsp
If you have any questions about the workshop content, please feel free
to send email to Chuck or Jim:
Chuck: humphrey at datalib.library.ualberta.ca
Jim: jajacobs at ucsd.edu
Dates: August 9-13, 2010
Location: University of Michigan, Ann Arbor MI.
Fees (Participants from ICPSR member institutions): $1,500
Fees (Participants from institutions that are not members of ICPSR): $3,000
http://www.icpsr.umich.edu/icpsrweb/sumprog/2010/application.jsp
List of ICPSR member institutions and Official Representatives:
http://www.icpsr.umich.edu/icpsrweb/ICPSR/membership/ors.jsp
Information about transportation and housing:
http://www.icpsr.umich.edu/icpsrweb/sumprog/visiting.jsp
This workshop is part of the ICPSR Summer Program in Quantitative Methods of Social Research
http://www.icpsr.umich.edu/icpsrweb/sumprog/index.jsp
---
James A. Jacobs
jajacobs at ucsd.edu
- jajacobs's blog
- 1 comment
- 1021 reads
What do we mean by "effective" access to data ? (Part II)
Submitted by moritz on Tue, 2010-01-26 21:50.In my last post, I described the possibility of a systematic approach to data validation. A key feature of such an approach must be it’s availability to all who are responsible for data – and of special importance, its capacity to support efficient and timely use by creators or managers of data. Bill Michener (UNM), leader of one of the currently funded DataNet projects has published a chart describing the problem of “information entropy” [SEE: WK Michener “Meta-information concepts for ecological data management,” Ecological Informatics 1 (2006): 4 ] Within recent memory, I have heard an ecologist say that were it not possible to generate minimally necessary metadata “in 8 minutes,” he would not do it. Leaving aside -- for now -- the possibility of applying sticks and/or carrots (i.e. law and regulations, norms and incentives), it seems clear that a goal of applications development should be simplicity and ease of use.
[ Within the realm of ecology, a good set of guidelines to making data effectively available was recently published – these guidelines are well worth reviewing and make specific reference to the importance of using "scripted" statistical applications (i.e. applications that generate records of the full sequence of transformations performed on any given data) this recommendation complements the broader notion -- mentioned in my last post -- of using work flow mechanisms like Kepler to document the full process and context of a scientific investigation. SEE “Emerging Technologies: Some Simple Guidelines for Effective Data Management” Bulletin of the Ecological Society of America, April 2009, 205-214. http://www.nceas.ucsb.edu/files/computing/EffectiveDataMgmt.pdf ]
As a sidebar, it is worth noting that virtually all data are “dynamic” in the sense that they may be and are extended, revised, reduced etc. For purposes of publication – or for purposes of consistent citation and coherent argument in public discourse – it is essential that the referent instance of data or “version” of a data set be exactly specified and preserved. (This is analogous to the practice of "time-stamping" the citation of a Wikipedia article...)
Lest we be distracted by the brightest lights of technology, we should acknowledge that we now have available to us, on our desktops, powerful visualization tools. The development of Geographic Information Systems (GIS) has made it possible to present any and all forms of geo-referenced data as maps. Digital imaging and animation tools give us tremendous expressive power – which can greatly increase the persuasive, polemical effects of any data. (For just two instances among many possible, have a look at presentations at the TED meetings [SEE: http://www.ted.com/ ] or have a look Many Eyes [SEE: http://manyeyes.alphaworks.ibm.com/manyeyes/ ] .) But, these tools notwithstanding, there is always a fundamental obligation to provide for full , rigorous and public validation of data. That is, data must be fit for confident use.
+++++++++++++++
Unanticipated uses of resources are one of the most interesting aspects of resource sharing on the Web. (At the American Museum of Natural History, we made a major investment in developing a comprehensive presentation of the American Museum Congo Expedition (1909-1915) – our site included 3-D presentation of stereopticon slides and one of the first documented uses of the site was by a teacher in Amarillo, Texas who was teaching Joseph Conrad – we received a picture of her entire class wearing our 3-D glasses.) It seems highly unlikely to me that we can anticipate or even should try to anticipate all such uses.
In the early 1980’s, I taught Boolean searching to students at the University of Washington and I routinely advised against attempts to be overly precise in search formulation – my advice was – and is – to allow the user to be the last term in the search argument.
An important corollary to this concept is the notion that metadata creation is a process not an event – and by “process” I mean an iterative, learning process. Clearly some minimally adequate set of descriptive metadata is essential for discovery of data but our applications must also support continuing development of metadata. Social, collaborative tools are ideal for this purpose. (I will not pursue this point here but I believe that a combination of open social tagging and tagging by “qualified” users -- perhaps using applications that can invoke well-formed ontologies – holds pour best hope for comprehensive metadata development.)
- moritz's blog
- Add new comment
- 1072 reads
What do we mean by “effective” access to data?
Submitted by moritz on Mon, 2010-01-25 16:22.As previously discussed, “free” and “open” dissemination of data are primary values, are fundamental premises for democracy. Data buried behind money walls, or impeded or denied to users by any of a variety of obstacles or “modalities of constraint” (Lawrence Lessig’s phrase) cannot be “effective”. But even when freely and/or openly available data can be essentially useless.
So what do we mean by “effective”? One possible definition of “statistics” is: “technology for extracting meaning from data in the context of uncertainty”. In the scientific context – and I have been arguing that all data are or should be treated as “scientific” – if data are to be considered valid, they must be subject to a series of tests respecting the means by which meaning is extracted...
By my estimation, these tests in logical order are:
Are the data well defined and logically valid within some reasoned context (for example, a scientific investigation – or as evidentiary support for some proposition)?
-- Is the methodology for collecting the data well formed (this may include selection of appropriate, equipment, apparatus, recording devices, software)?
-- Is the prescribed methodology competently executed? Are the captured data integral and is their integrity well specified?
-- To what transformations have primary data been subject?
-- Can each stage of transformation be justified in terms of logic, method, competence and integrity?
-- Can the lineages and provenances of original data be traced back from a data set in hand?
The Science Commons [SEE: “Protocol for Implementing Open Access Data” http://www.sciencecommons.org/projects/publishing/open-access-data-protocol/] envisions a time when “in 20 years, a complex semantic query across tens of thousands of data records across the web might return a result which itself populates a new database” and, later in the protocol, imagines a compilation involving 40,000 data sets. Just the prospect of proper citation for the future “meta-analyst” researcher suggests an overwhelming burden.
So, of course, even assuming that individual data sets can be validated in terms of the tests I mention above, how are we to manage this problem of confidence/ assurance of validity in this prospectively super-data-rich environment?
(Before proceeding to this question let’s parenthetically ask how these test are being performed today? I believe that they are accomplished through a less than completely rigorous series of “certifications” – most basically, various aspects of the peer review process assure that the suggested tests are satisfied. Within most scientific contexts, research groups or teams of scientists develop research directions and focus on promising problems. The logic of investigation, methodology and competence are scrutinized by team members, academic committees, institutional colleagues (hiring, promotion, and tenure processes), by panels of reviewers – grant review groups, independent review boards, editorial boards -- and ultimately by the scientific community at large after publication. Reviews and citation are the ultimate validations of scientific research. In government, data are to some extent or other "certified by the body of agency responsible.)
If we assume a future in which tens of thousands of data sets are available for review and use, how can any scientists proceed with confidence? (My best assumption, at this point, is that such work will proceed with a presumption of confidence – perhaps little else?)
Jumping ahead, even in a world where confidence in the validity data can be assured, how can we best assure that valid data are effectively useful?
A year ago in Science a group of bio-medical researchers raised the problem of adequate contextualization of data [SEE: I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713] Specifically, they suggested:
“a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis.“ While their focus was on clinical studies in the bio-medical realm, the logic of their argument extends to all data. We already have tools available to us that can specify scientific work flows to a very precise degree. [SEE for example: https://kepler-project.org/ ] It seems entirely possible to me that such tools can be used – in combination with well-formed ontologies built by consensus within disciplinary communities to systematize the descriptions of scientific investigation and data transformation. – and moreover – by the combinations with socially collaborative applications -- to support a systematic process of peer review and evaluation of such work flows.
OK -- so WHAT ABOUT GOVERNMENT INFORMATION??? We’re just government document librarians or just plain citizens trying to make well-informed decisions about policy? Stay tuned…
- moritz's blog
- Add new comment
- 1286 reads
What is NOT “science”? Why we have a right to “data” as “evidence”…
Submitted by moritz on Mon, 2010-01-11 15:55.Most of us accept a priori the institutionalized distinction between the sciences and the humanities. If asked, we can tick off the names of “disciplines” that are “scientific” and those that constitute “the humanities”… (The “social sciences” are somehow less centrally – more vaguely? -- “scientific” -- but what do we mean essentially by these distinctions?) [It's worth noting that novelist CP Snow famously posited this distinction in his Cambridge lecture and subsequent book "The Two Cultures" -- ca. 1959.]
We might say that science is “empirical” meaning that it is based upon real, physical evidence? Or perhaps that it’s “inductive” – its theories or “laws” flowing from observations of facts… Or perhaps that it is “quantitative” or "technical" – its conclusions determined by the use of sometimes very complex mathematical logic or by complex apparatus. We might also say that it employs a rigorous methodology that includes exact logical provisions for “falsifiability” [SEE: Karl Popper, The Logic of Scientific Investigation – and elsewhere], for open peer review – including test by replication – and for validation by demonstration of predictive power… Science also is systematically accretive and depends on careful citation and documentation, building upon itself like a coral reef…
But, it strikes me that any humanist should feel uncomfortable at the assumption that the humanities do not – or are incapable of – meeting these standards at least most of them in most cases? (I'll leave it to the reader to assess what is most essentially “humanist” – but I often have the uncomfortable sense that the humanities may too often depend for their esoteric authority upon the incoherence of their evidentiary base or upon the imprecision of language or between languages…?)
I attribute "beauty" as a primary motive/value to “the arts”… (The American poet Randall Jarrell once said: “Criticism is the poetry of the prosaic.”) And I heard, anecdotally, a few years ago that the performance artist, Laurie Anderson, was invited to a discussion about “the arts” and “the sciences” and before too long was asking “What are we doing here?” I understood this to be an intuitive recognition that the arts and the sciences are on very similar tracks… I believe that artists are able to operate more spontaneously, intuitively and imaginatively -- perhaps more "aesthetically"? but less "systematically" ? Scientists often operate on that same frontier but with the requirement that they test their intuitions using the scientific method and then publicly disclose their “tests”.
"Belief" is ultimately the subjective preserve of the individual -- and the institutional preserve of religion. Maintaining the distinction between "belief" and reason (or logic) is a fundamental value of the Enlightenment -- particularly in public discourse.
OK so what am I getting at here? And why?
Ultimately all policy -- whether "scientific" or not -- and all human decisions should be based on logical analysis and on evidence. Both evidence and analysis are susceptible to testing, to evaluation and thus to reasoned discussion. Our civil discourse will always be improved by clear specification of analytical logic and by free, open and effective disclosure of empirical evidence or DATA.
Respecting data there are a series of fundamental criteria that must be satisfied to validate it’s “authenticity” and its probative value (its effectiveness as evidence). As citizens, we have the right to demand that public policy and public decisions be based on well-formed logic and on valid evidence… Discussion that occurs in our public fora should always distinguish between matters of logic and fact and matters of belief.
We’ll pursue these notions – in the context of free, open effective access to data and in the context of science literacy – in future posts…
- moritz's blog
- 1 comment
- 1659 reads
Data.gov.uk Launches Soon!
Submitted by blakeley on Wed, 2009-11-11 15:22.Looks like the UK version of data.gov, developed by Sir Tim Berners-Lee, is going to be released soon. It is "language-based" where "linkages are based on human language, rather than hard-coded hyperlinks", a.k.a. the Semantic Web concept that Berners-Lee has been touting for years.
I like the way Nancy Scola of Personal Democracy Forum describes the Semantic Web:
[Berners-Lee] vision is of a web that understands the connections between disparate bits of information in a way similar to how the human mind might effortlessly connect an address on London's Whitehall with the events of World War II that Winston Churchill directed from an underground bunker there. Data woven through with more human ways of interpretation might, just might, make the gap between making government information public and making it useful a little smaller.
The BBC reports that "Data.gov.uk is built with semantic web technology, which will enable the data it offers to be drawn together into links and threads as the user searches...we will also be able to look for patterns...visitors to data.gov.uk will want to make their own mash-ups from the information available."
Yes, and we should be making mashups from our country's data.gov for our library patrons too! Let's get to it! I'll be working on mine and will show you how it can be done.
- blakeley's blog
- Add new comment
- 2035 reads
DataTO.org
Submitted by blakeley on Wed, 2009-11-11 15:03.Check out DataTO.org, which is similar to data.gov, but users request data sets from the Toronto municipal government. The first phase of the website will allow one to:
...publish a request for data to the community, where members can comment and rate the request. In future iterations of this site, publishers and others will be able to post details of known and existing data sources so that community members can rate them for prioritization. Users will then be able to find data sources that have been published.
Kudos to O'Reilly Radar.
- blakeley's blog
- Add new comment
- 868 reads
FEC makes data available in multiple formts
Submitted by jajacobs on Thu, 2009-10-29 12:46.Disclosure Data Catalog, Federal Election Commission
"Each of the files listed here can be downloaded in either csv or xml formats. Each also has a metadata page that describes the information included and the structure of the file itself. There is a pdf version of each file if you need to print the information. You can also subscribe to RSS feeds for each of the files so you're notified whenever new data is available or a change is made."
Also see the Commission's Disclosure Data Blog where the FEC will post information about the files and its future plans. And: they say that "you can get help with any questions about the data we're providing here."
- jajacobs's blog
- Add new comment
- 1117 reads
Free ICPSR Data Conference on the Web
Submitted by jajacobs on Sun, 2009-09-06 09:55.ICPSR 2009: Real Data in a Virtual World
ICPSR (the Inter-University Consortium for Political and Social Research) is the large social science data archive at the University of Michigan. Every second year, ICPSR hosts a meeting (in Ann Arbor Michigan) for its "Official Representatives" -- one person at each ICPSR member institution. This year, the meeting (October 5-9) is open to all and the meeting is on the web instead of in Ann Arbor.
At the link above, you can find a list of the week long program and sign up for individual sessions.
Government information specialists, particularly those with responsibilities for data and statistics, should set aside time for this! Sessions are interesting and informative. Some examples from the program:
- Census 2010 & American Community Survey
- Delivering Research Opportunities to Undergraduates
- Online Data Analysis Tools
- jajacobs's blog
- Add new comment
- 850 reads
US Federal Government IT Dashboard: Data and Visualization
Submitted by Kramer-Smyth on Fri, 2009-07-31 19:47.If you enjoy interactive data visualizations, make sure you visit the US Federal IT Dashboard. This site is meant for use by both the public and the staff of federal agencies. The goal is to make it easy to explore how the federal government is spending its IT dollars.
The major components of the dashboard are:
- Performance Dashboard: supports viewing of major IT investments, filterable by agency
- Data Feeds: will let you select data to download or create a "dynamically XML feed"
- Analysis Visualizations: lets you chart and animate any combination of 15 IT spending statistics from 2002 to present
This blurb off the FAQ page gives a great overview of what this site is all about:
"The IT Dashboard provides the public with an online window into the details of Federal information technology investments and provides users with the ability to track the progress of investments over time. The IT Dashboard displays data received from agency reports to the Office of Management and Budget (OMB), including general information on over 7,000 Federal IT investments and detailed data for nearly 800 of those investments that agencies classify as "major." The performance data used to track the 800 major IT investments is based on milestone information displayed in agency reports to OMB called "Exhibit 300s." Agency CIOs are responsible for evaluating and updating select data on a monthly basis, which is accomplished through interfaces provided on the website."
- Kramer-Smyth's blog
- Add new comment
- 2225 reads
What features do you want for your catalog of govt data?
Submitted by jrjacobs on Fri, 2009-07-24 15:41.Data is definitely getting sexy. Jonathan Gray of the Open Knowledge Foundation asks "What features should be included in a catalogue of open government data?" and points to a few other data repositories being built on the state and country level (like my own city of SF's CivicDB!). He also mentions the Sunlight Foundation's plan to build on and expand data.gov with a national data catalog that I had meant to write about a couple of weeks ago (go Sunlight!). So I'm putting this question to you, FGI's faithful readers. I'm sure you'll have a thing or three to add to the list of requirements for open data catalogs.
Here are a few suggestions for those building catalogues for (open) government data based on our experience developing CKAN:
- Make the catalogue itself open!
- Let others download the catalogue data in bulk (not just via an API)
- Include information on how to get the data, and how it can be used
- Make it versioned!
[originally tweeted by @EllnMllr. Go ahead and follow her!]
- jrjacobs's blog
- 1 comment
- 1245 reads
ALA Annual GODORT Update Meeting: "Need Data – but don’t know where to go?"
Submitted by blakeley on Sat, 2009-07-11 06:44.- blakeley's blog
- Add new comment
- 2075 reads
Death and Taxes
Submitted by justgrimes on Sun, 2009-05-31 20:05.Death and Taxes - A Graphical Visualization of the Federal Budget
Death and Taxes is a large representational graph and poster of the federal budget. It contains over 500 programs and departments and almost every program that receives over 200 million dollars annually. The data is straight from the president's 2009 budget request and will be debated, amended, and approved by Congress to begin the fiscal year. All of the item circles are proportional in size to their spending totals and the percentage change from 2008 is included to spot trends and disproportion.
- justgrimes's blog
- Add new comment
- 923 reads


Recent comments
5 days 3 hours ago
5 days 10 hours ago
6 days 6 hours ago
1 week 5 days ago
1 week 6 days ago
1 week 6 days ago
3 weeks 1 day ago
3 weeks 6 days ago
4 weeks 7 hours ago
4 weeks 2 days ago