Monitoring, Modeling, and Memory

Dynamics of Data and Knowledge in Scientific Cyberinfrastructures

Archive for February, 2011

LTER goes after climate change

Posted by Paul N. Edwards on February 23, 2011

LTER is getting involved in climate change studies. Sounds like the reporter didn’t really investigate LTER’s original purpose, though.

“The record snows across the United States this winter may be seen as a harbinger of the extreme weather expected from global warming, but figuring out how much the planet is warming and what the impact might be will take long-term studies. The Long-Term Ecological Research project, started by the National Science Foundation in 1980, is doing just that, with 26 sites, most located in the U.S., collecting data related to climate change. And at a symposium in Washington, March 2, seven researchers will present results from a sampling of LTER projects.”

Full story here.

Also, we should eat more insects.

Posted in Uncategorized | Leave a Comment »

Data.Rescue@Home

Posted by Paul N. Edwards on February 18, 2011

I knew this was coming: crowdsourcing climate data.

Data.Rescue@Home is an internet-based attempt to digitize historical weather data from all over the globe and make the digitised data available to everybody. Two projects are currently online: German radiosonde data form the Second World War and meteorological station data from Tulagi (Solomon Islands) for the first half of the 20th century.

You log in, look at a scanned image of a weather record, and enter the data as numbers on a form.

Not much progress yet. Up and running since October 2010, but only about 150 of 2000 scanned images have been coded. Where are the masses when you need them?

Posted in Uncategorized | Leave a Comment »

New climate variability results: models and data, again

Posted by Paul N. Edwards on February 18, 2011

The New York Times reported yesterday on two new Nature papers on climate change (extreme precipitation events linked to anthropogenic global warming through computer simulation), expected to stir up debates again.

Meanwhile, a few weeks ago the 20th-Century Reanalysis Project reported on recent results of the longest-term weather data reanalysis project yet, collecting every scrap of available weather data from 1871-2008 and running them through a weather forecast model to “fill in the blanks” for what’s missing.

A salient finding from this study: changes in the North Atlantic Oscillation (also see the North Atlantic Oscillation theme site) appear to be driven throughout the study period primarily by natural variability. In other words, the reanalysis isn’t seeing an effect of global warming on variability in the NAO.

The reanalysis data go back to 1871 — but as they go back in time, they get thinner and thinner. Most data prior to the 1950s are from the surface only. The reanalysis model fills in the missing data. So the large majority of data in the pre-1950s reanalysis are created by the model.

The Nature studies are looking at an entirely different kind of variability, i.e. frequency of extreme precipitation events in the UK (one study) and the Northern Hemisphere (the second study). (It’s worth jumping to the actual articles from the links given on the Nature news page.) These studies compare observational data with results from simulation models with and without anthropogenic forcing (i.e. greenhouse gases and other human influences on climate). The results: (a) natural variability alone can’t account for the increased northern hemisphere precipitation in the second half of the 20th century, and (b) anthropogenic factors, added to the simulation models, doubled the risk of the floods experienced in the UK in 2000.

This, combined with the comments on the two Nature pieces, make for a lovely skeptic paradox. The skeptics are very happy with the results from the model-driven reanalysis data which (they think) confirm their views. (Another nail in the coffin of AGW, one wrote.) But they roundly reject the idea that simulation models could explain the significant increase in extreme precipitation.

By the way, Piers Corbyn, mentioned in the Kevin Crean comment on the Nature news page, runs a commercial long-term weather prediction service in the UK using his own “solar/lunar” model, whose details he will not reveal and which has never been peer reviewed. He’s had some notable successes in forecasting major storms long in advance (months). He places bets on his own forecasts (and sometimes wins). He’s a skeptic in the Christopher Monckton vein. (Monckton, by the way, claims to be a hereditary member of the House of Lords, but the Lords are having none of it.)

I’m going to be working on an op-ed about this over the weekend. Comments welcome.

 

Posted in Uncategorized | 2 Comments »

[Taxacom]: Data persistence

Posted by Paul N. Edwards on February 11, 2011

From an Ars Technica post, by way of Taxacom:

CERN scientists and researchers from several other facilities have grouped together to preserve data by creating DPHEP (Data Preservation in High Energy Physics). DPHEP recommends that research budgets provide for a data archivist position. The data archivist will preserve data along with key supplementary information that is necessary to interpret and put the data in perspective for future generations. They also recommend creating virtualized software that simulates the computers of today, so whatever programs current physicists use for their data workup can be used long after present technology expires.

Two points here, following up on another post from earlier today. First, DPHEP is upping the metadata ante considerably by requiring not just publication of code, but producing and maintaining emulators that could run that code, even much later. Again — is this worth the effort? When? How much effort? At what cost?

Second, though: institutionalizing positions for data archivists would make a lot of sense. Such positions would go far to solving a sociotechnical problem in the most flexible way, i.e. with people rather than (only) technology. Here’s where training comes in — and where there’s a potentially huge role for iSchools and their graduates.

Posted in Uncategorized | Leave a Comment »

Science special issue on “Dealing with Data”

Posted by Paul N. Edwards on February 11, 2011

Reposting a pointer from Cliff Lynch —

The February 11, 2011 issue of Science has a special section titled “Dealing with Data” with a number of papers and articles covering data intensive science and data curation issues.

They have set up a website that consolidates some of the material from this issue and some related topical material from other Science journals (Signaling, Translational Medicine, Careers) for public access (registration required for non-subscribers).

Posted in Uncategorized | 1 Comment »

Replicating results with published data + code: very hard to do…

Posted by Paul N. Edwards on February 11, 2011

James Howison gave a fascinating talk here yesterday about scientific software. Part of his argument was that different sciences have taken different routes to both producing and using specialized software. These range from widely adopted, full-on production codes, which tend to involve their authors in maintenance work that can go on for years (but may produce no scientific credit for them), to scripts written for personal use and never shared at all.

In the case of shared codes, Howison argued, the problem of making them work on different platforms, and in different software contexts (as operating systems and other software packages needed by the scientific software change around them), is far from trivial. It frequently means that scientific software packages don’t live very long.

In questions, one of our faculty pointed out that some economics journals have required authors to publish both data and code for some time. Didn’t this simply take care of the problems of (a) credit for software development and (b) replicability of results? Howison responded by citing McCullough, B. D., McGeary, K. A., & Harrison, T. D. (2006). Lessons from the JMCB Archive. Journal of Money, Credit, and Banking 38(4), 1093–1107 (direct link for UM people here.)

Here are a few tidbits from that article, relevant not only to our work on data, models, and software, but also to our thinking about metadata. Some highlights here are in the original, some are mine:

We examine the online archive of the Journal of Money, Credit, and Banking, in which an author is required to deposit the data and code that replicate the results of his paper. We find that most authors do not fulfill this requirement. Of more than 150 empirical articles, fewer than 15 could be replicated.

Part of the reason for this was noncompliance; despite the policy, only 58 articles included both data and code.

…58 archive entries had at least some data and some code. For each, we attempted to use the supplied data and code to replicate the published results. We made minor alterations to data and code to try to get the code to run with the data, but we did not attempt major alterations. Consequently, for some articles that we could not replicate, it is possible that there exists some data and code combination that will reproduce the reported results, but we are certain that it is not the combination that is in the JMCB archive. Since journal policy requires authors to deposit in the archive the data and code that will replicate their results, if we cannot use the data and code in the archive to replicate an author’s results, it is fair to say that the author did not honor the policy.

Many authors took what can only be considered a desultory approach to fulfilling the requirement, not even caring whether the data would run with the code. In many cases, the author had specifically not provided the data that ran with the code, and instead had provided the data in some alternate format. The obvious implication of such an action is that it makes replication difficult, sometimes requiring much effort to put the data into a format that would run with the code. For example, one author’s code reads an “.xls” file, but the code provided is in “.prn” format. Another author’s code calls a “.rat” file and his provided data are just the output from a “print” command that would take great effort to turn into a machine readable data file. Occasionally, authors provided data in a program-specific format, so that researchers who did not have that program could not access the data.

One author provides no readme file and two data files with no column headers: we are supposed to guess the names of the variables! Other authors seem to think that the entire world shares the exact same hard drive layout, with ‘‘C:\MYDATA\MYPROJECT\” sprinkled liberally throughout their code. Of course, a would-be replicator has to find and change all these. Moreover, the author might not realize all the data/subroutine files that his code utilizes, and forget to include said data/subroutine in his replication files. For example, some authors forgot to include code for a subroutine that existed in yet another subdirectory, and similarly other researchers forgot data files.

We recommend that all data be provided in ASCII format, and that the version of the code submitted to the archive call these same ASCII files. Additionally, the first program should print summary statistics on all the variables, so that subsequent researchers can be sure that they have loaded the data correctly.

Good luck with that.

Another problem, of course, is software versioning:

Simply having data and code and possessing the same software package does not guarantee that a researcher will be able to replicate the author’s results. We know of a package that in two successive releases produced different results for the same “calculate the correlation matrix” problem. We know also of a few packages that, in successive releases, produced different answers for the same nonlinear estimation problem.

The issue of software metadata — commented code — rose to the top as a primary concern:

…Putting code in an archive is not simply a matter of depositing uncommented, unclear code. In fact, the code is a better record of what was actually done to the data than is the article. Moreover, the myriad minor decisions for which there simply is insufficient space in print are revealed in the code, and hopefully made clear via extensive commenting of the code. We recommend that the readme file list all the replication files with a brief description of each.

Writing code for replicable research is not as easy as it sounds… It takes an enormous amount of effort but, then again, this code does represent a contribution to the cumulative body of knowledge: in terms of actually building the body of knowledge and enabling others to make use of an author’s research, the code is in many ways no less important than the article itself. For example, the code should be written and commented so that someone with a different package can build on the existing research. Consequently, if an author writes an article using RATS and a second researcher uses TSP, as a general rule the second researcher should not need to obtain the RATS reference manual and learn RATS just to extend the author’s results. As a simple example, consider the following RATS code:

linreg(spread=x4) y
# constant x1 x2 x3
restrict 1
#2
#1 -1

It is obvious enough that this is some sort of regression of y on a constant and three independent variables, but the rest of the code is not intuitive. It may reproduce the results in the paper, but it hardly helps a TSP user understand what was done. Even RATS users may have to consult the reference manual. A few comment lines would remedy the situation…

Maybe. Maybe not. That depends on the comments, on how distant the user might be from the discipline or subdiscipline of the original authors, and on how old the software might be.

Howison says this article has had some follow-ons, which I haven’t looked at yet. He pointed me to Baiocchi, G. (2007). Reproducible research in computational economics: guidelines, integrated approaches, and open source software. Computational Economics 30:19–40, which includes the following quote:

They [McCullough et al., discussed above] convincingly argue that, though most empirical work could still not be reproduced, the requirement of a data and code archive should be adopted by more journals and that stricter rules that ensure compliance from the author should be introduced.

Yet another resort to the stick, not the carrot, to get metadata — even while recognizing that metadata won’t solve most of the problems.

The MMM response here might be: is such a policy worth the effort? When? How much effort is enough? And how do we know?

Clearly it would be worth it in some cases — but how many? Many results aren’t significant enough to warrant replication. And many codes would be impossible to use even 5 or 10 years after publication, due to the unavailability of the original versions and/or software contexts.

Alternatively, MMM would ask, if someone really wants to reproduce a result, isn’t it likely that s/he will contact the authors of the article for help? The obsession with metadata products obscures the major role of ad hoc processes — communication among scientists — as a normal and often fully adequate way of obtaining metadata. Another MMM response: the amount and type of metadata required will change with the would-be user’s distance in time and/or disciplinary context from those of the original authors. What’s adequate next year for economists looking at JMCB may be impossibly thin for economic historians looking at the same information 20 years from now.

Posted in Current Reading, Uncategorized | 6 Comments »

 
Follow

Get every new post delivered to your Inbox.