Monitoring, Modeling, and Memory

Dynamics of Data and Knowledge in Scientific Cyberinfrastructures

Science special issue on “Dealing with Data”

Posted by Paul N. Edwards on February 11, 2011

Reposting a pointer from Cliff Lynch —

The February 11, 2011 issue of Science has a special section titled “Dealing with Data” with a number of papers and articles covering data intensive science and data curation issues.

They have set up a website that consolidates some of the material from this issue and some related topical material from other Science journals (Signaling, Translational Medicine, Careers) for public access (registration required for non-subscribers).

Posted in Uncategorized | 1 Comment »

Replicating results with published data + code: very hard to do…

Posted by Paul N. Edwards on February 11, 2011

James Howison gave a fascinating talk here yesterday about scientific software. Part of his argument was that different sciences have taken different routes to both producing and using specialized software. These range from widely adopted, full-on production codes, which tend to involve their authors in maintenance work that can go on for years (but may produce no scientific credit for them), to scripts written for personal use and never shared at all.

In the case of shared codes, Howison argued, the problem of making them work on different platforms, and in different software contexts (as operating systems and other software packages needed by the scientific software change around them), is far from trivial. It frequently means that scientific software packages don’t live very long.

In questions, one of our faculty pointed out that some economics journals have required authors to publish both data and code for some time. Didn’t this simply take care of the problems of (a) credit for software development and (b) replicability of results? Howison responded by citing McCullough, B. D., McGeary, K. A., & Harrison, T. D. (2006). Lessons from the JMCB Archive. Journal of Money, Credit, and Banking 38(4), 1093–1107 (direct link for UM people here.)

Here are a few tidbits from that article, relevant not only to our work on data, models, and software, but also to our thinking about metadata. Some highlights here are in the original, some are mine:

We examine the online archive of the Journal of Money, Credit, and Banking, in which an author is required to deposit the data and code that replicate the results of his paper. We find that most authors do not fulfill this requirement. Of more than 150 empirical articles, fewer than 15 could be replicated.

Part of the reason for this was noncompliance; despite the policy, only 58 articles included both data and code.

…58 archive entries had at least some data and some code. For each, we attempted to use the supplied data and code to replicate the published results. We made minor alterations to data and code to try to get the code to run with the data, but we did not attempt major alterations. Consequently, for some articles that we could not replicate, it is possible that there exists some data and code combination that will reproduce the reported results, but we are certain that it is not the combination that is in the JMCB archive. Since journal policy requires authors to deposit in the archive the data and code that will replicate their results, if we cannot use the data and code in the archive to replicate an author’s results, it is fair to say that the author did not honor the policy.

Many authors took what can only be considered a desultory approach to fulfilling the requirement, not even caring whether the data would run with the code. In many cases, the author had specifically not provided the data that ran with the code, and instead had provided the data in some alternate format. The obvious implication of such an action is that it makes replication difficult, sometimes requiring much effort to put the data into a format that would run with the code. For example, one author’s code reads an “.xls” file, but the code provided is in “.prn” format. Another author’s code calls a “.rat” file and his provided data are just the output from a “print” command that would take great effort to turn into a machine readable data file. Occasionally, authors provided data in a program-specific format, so that researchers who did not have that program could not access the data.

One author provides no readme file and two data files with no column headers: we are supposed to guess the names of the variables! Other authors seem to think that the entire world shares the exact same hard drive layout, with ‘‘C:\MYDATA\MYPROJECT\” sprinkled liberally throughout their code. Of course, a would-be replicator has to find and change all these. Moreover, the author might not realize all the data/subroutine files that his code utilizes, and forget to include said data/subroutine in his replication files. For example, some authors forgot to include code for a subroutine that existed in yet another subdirectory, and similarly other researchers forgot data files.

We recommend that all data be provided in ASCII format, and that the version of the code submitted to the archive call these same ASCII files. Additionally, the first program should print summary statistics on all the variables, so that subsequent researchers can be sure that they have loaded the data correctly.

Good luck with that.

Another problem, of course, is software versioning:

Simply having data and code and possessing the same software package does not guarantee that a researcher will be able to replicate the author’s results. We know of a package that in two successive releases produced different results for the same “calculate the correlation matrix” problem. We know also of a few packages that, in successive releases, produced different answers for the same nonlinear estimation problem.

The issue of software metadata — commented code — rose to the top as a primary concern:

…Putting code in an archive is not simply a matter of depositing uncommented, unclear code. In fact, the code is a better record of what was actually done to the data than is the article. Moreover, the myriad minor decisions for which there simply is insufficient space in print are revealed in the code, and hopefully made clear via extensive commenting of the code. We recommend that the readme file list all the replication files with a brief description of each.

Writing code for replicable research is not as easy as it sounds… It takes an enormous amount of effort but, then again, this code does represent a contribution to the cumulative body of knowledge: in terms of actually building the body of knowledge and enabling others to make use of an author’s research, the code is in many ways no less important than the article itself. For example, the code should be written and commented so that someone with a different package can build on the existing research. Consequently, if an author writes an article using RATS and a second researcher uses TSP, as a general rule the second researcher should not need to obtain the RATS reference manual and learn RATS just to extend the author’s results. As a simple example, consider the following RATS code:

linreg(spread=x4) y
# constant x1 x2 x3
restrict 1
#2
#1 -1

It is obvious enough that this is some sort of regression of y on a constant and three independent variables, but the rest of the code is not intuitive. It may reproduce the results in the paper, but it hardly helps a TSP user understand what was done. Even RATS users may have to consult the reference manual. A few comment lines would remedy the situation…

Maybe. Maybe not. That depends on the comments, on how distant the user might be from the discipline or subdiscipline of the original authors, and on how old the software might be.

Howison says this article has had some follow-ons, which I haven’t looked at yet. He pointed me to Baiocchi, G. (2007). Reproducible research in computational economics: guidelines, integrated approaches, and open source software. Computational Economics 30:19–40, which includes the following quote:

They [McCullough et al., discussed above] convincingly argue that, though most empirical work could still not be reproduced, the requirement of a data and code archive should be adopted by more journals and that stricter rules that ensure compliance from the author should be introduced.

Yet another resort to the stick, not the carrot, to get metadata — even while recognizing that metadata won’t solve most of the problems.

The MMM response here might be: is such a policy worth the effort? When? How much effort is enough? And how do we know?

Clearly it would be worth it in some cases — but how many? Many results aren’t significant enough to warrant replication. And many codes would be impossible to use even 5 or 10 years after publication, due to the unavailability of the original versions and/or software contexts.

Alternatively, MMM would ask, if someone really wants to reproduce a result, isn’t it likely that s/he will contact the authors of the article for help? The obsession with metadata products obscures the major role of ad hoc processes — communication among scientists — as a normal and often fully adequate way of obtaining metadata. Another MMM response: the amount and type of metadata required will change with the would-be user’s distance in time and/or disciplinary context from those of the original authors. What’s adequate next year for economists looking at JMCB may be impossibly thin for economic historians looking at the same information 20 years from now.

Posted in Current Reading, Uncategorized | 7 Comments »

Critical Code Studies – digital humanities

Posted by Paul N. Edwards on January 27, 2011

I’ve been stuck on this mailing list for a while (Society for Literature, Science, and the Arts).

Occasionally, interesting bits float by. Here’s one of them.

- Paul

From: Mark Marino <markcmarino>

Date: January 27, 2011 3:06:07 AM EST

To: litsci-l

Subject: Critical Code Studies news

Reply-To: litsci-l, Mark Marino <markcmarino>

Hi, all,

At several SLSA’s I’ve given presentations on what I call Critical Code Studies, a way of using computer source code as an entry way into discussions of digital objects and culture.

Right now there are places to explore and extend those discussions:

1) HASTAC Scholars forum:
A very lively debate (happening as I type this), featuring a main thread and some objects for “code critiques.”

General thread:
http://www.hastac.org/forums/hastac-scholars-discussions/critical-code-studies

Code Critiques:
http://www.hastac.org/forums/hastac-scholars-discussions/code-critiques

2) CCS @ USC conference proceedings:
http://vectorsjournal.org/thoughtmesh/critcode
These just went online last night. They feature the texts of the talks, videos, and code, as well as an opportunity to join the discussion by hitting the “Peer Review” tab on each talk or panel.

The proceedings are published under USC’s Vectors journal on the Thoughtmesh platform, an exciting venue that in many ways recreates the conference experience of intersecting conversations, without all the cab fare and wrinkled outfits.

3) CCS Working Group in electronic book review
http://www.electronicbookreview.com/thread/firstperson/ningislanded

We’re editing and publishing the weekly threads from last year’s Critical Code Studies working group. The first week and an introductory essay are up now. The following weeks will appear over the next several moths.

4) The Critical Code Studies blog
http://criticalcodestudies.com

Please join us for these discussions as we explore ways of talking about and through analyses of code. These are the conversations that will develop this field and the CCS panels to come at future SLSAs.

Best,
Mark Marino
Writing Program
University of Southern California
http://WriterResponseTheory.org
http://CriticalCodeStudies.com
—-
The Litsci-L archive is viewable on the Web at:
http://litsci.org

Posted in Uncategorized | Leave a Comment »

Online and lightweight qual coding tool

Posted by dribes on January 26, 2011

An interesting possibility for qualitative data coding and analysis.
Free, online, and easily supports collaborative activity.

What is ASTOUNDING is how lightweight it is compared to the cluttered
horrors of NVIVO.

A concern would be the long term sustainability of the toolset
(project or facility?). They do allow you to export your data, but I
don’t know what the export file looks like…

david.

Originally from Matt Burton:

Sean Munson just showed me this, Saturate App, a web application for
collaborative qualitative data storage, coding, and analysis.
Check out this overview video, it looks pretty impressive.

this is INFINITELY more usable & collaborative than NVIVO, especially
on a mac or linux.


mcb

Posted in Uncategorized | Leave a Comment »

“Modeling” memory is easier than “monitoring” memory

Posted by archer on January 25, 2011

In reviewing some interview notes, I saw this segment from an interview with someone who works for the Earth System Grid. ESG is planning to add observational data to the data portal in addition to its current collection of model data. I asked him what it will be like to move from a system that handles just data output from models to include data collected from observational systems. His reply:

“Oh yeah it’s going to be a big jump. Because the model data is easy compared to the observational data…”

“What makes model data easy compared to observational data?”

“Well, for one thing, it’s all already in a nice gridded format. I mean, you got the nice 2D and 3D pieces, that doesn’t tend to be any missing data, like… I mean, observational data requires all the work just to be able to take it from what the center says to something that human can use. And it’s already in a pretty well-defined format, either GRIB or NetCDF or something like that. It’s just probably… I mean, it’s… Since it’s an idealized representation of the world, I guess, in some ways the data is seen as kind of an idealized data format and data that it’s a lot easier to… Easier doesn’t mean easy but… I’m reading articles about observational data and I’ve accessed enough it that it’s really, really hard sometimes.”

We haven’t often had our three themes of monitoring, modeling, and memory come up as analytical concepts, but this instance was striking because it nicely showed a relationship between them. Idealized system for generating data results in an idealized data format that is easier to store. Model runs with non-idealized data can be repeated, keeping the data cleaner. I don’t want to pretend that model data is always clean or uncomplicated, but it does seem that in some real senses it could be simpler than observational data.

Posted in Reflections, Uncategorized | Leave a Comment »

Data, data, everywhere…

Posted by Paul N. Edwards on January 22, 2011

We’ve been thinking mainly about data in science, with its associated problems of storing, finding, and forgetting. But it’s not just science.

Here’s an article about explosive data growth in government, and how much it’s costing us:

Our nation is drowning in data. At any given time, federal agencies use more electronic storage units than could fill every NFL stadium from Oakland to Foxboro. At last count, the US government owns or leases at least 2100 data centers, and spends about half of its multi-billion dollar IT budget on digital storage. The United States Census Bureau alone maintains about 2560 terabytes of information — more data than is contained in all the academic libraries in America, and the equivalent of about 50 million four-door filing cabinets of text documents. In addition to the federal deluge, tens of thousands of municipal and state facilities maintain data ranging from driver’s-license pics to administrative e-mails — or at least they’re required to.

An interesting point raised by this article is that even as storage burdens become crushing, human beings to help with organizing and finding data are losing their jobs, especially at the state and municipal levels. Data.gov has big press and big ambitions, but much of the stuff lower down in the system is rotting away.

Posted in News | Leave a Comment »

On time: Victorian “time table of the world’s principal cities”

Posted by Paul N. Edwards on January 21, 2011

We were talking about time again. Steve finally started reading Robert Grudin’s Time and the Art of Living, one of the most beautiful and worthwhile books I know.

Now, check out this Victorian-era infographic (1883) showing over 100 world cities with their local times relative to Washington D.C. (at the center of the temporal universe, according to the graphic.)

Look twice. Because the times at each city vary not by today’s standard (multiples of one hour), but by multiples of one minute. Or less.

When it’s noon in Washington D.C., it’s 7:49 PM in Mecca and 6:33 PM in Warsaw.

There’s no information with the graphic to tell you the basis of time at each location, but given the era it was probably loosely centered on the solar clock. Noon at each location was the moment the sun reached its zenith. A local observatory would have sent out a time signal by telegraph, and/or lowered a time ball mounted on a high mast, and/or fired a gun to mark the moment. Locals would have set their clocks and watches to that signal.

Observatories and telegraph companies also got into the business of selling time signals to businesses within a hundred miles or so of their location. Further west or east than that and it would’ve been time (ho ho) to get a different signal.

It’s part of an exhibit of Victorian graphics at BibliOdyssey, also worth checking out.

Posted in Current Reading, Reflections | 1 Comment »

CFP on methods for studying virtual environments and online social networks

Posted by Paul N. Edwards on January 19, 2011

Caught this in Technoscience and thought our grad students should know about it.

This journal special issue encourages submissions from graduate students and junior academics.

Methodological approaches to the study of virtual environments and online social networks

Deadline: March 15 2011

http://gjss.org.

Call for Papers and Book Reviews: Methodological approaches to the study of virtual environments and online social networks The Graduate Journal of Social Science (GJSS) announces a Call for Papers and Book Reviews for a special issue dealing with methodological approaches to the study of virtual environments and online social networks. The journal encourages the submission of work by MSc/ MA/MS, MPhil, PhD students and junior academics from all geographic regions. All papers are submitted to a blind peer review process. The special issue is scheduled for December 2011.

Posted in Publications, Uncategorized | Leave a Comment »

Science review of A Vast Machine

Posted by Paul N. Edwards on January 15, 2011

A review by Richard Somerville just came out. You can see it here.

In the same review, Somerville discusses philosopher Eric Winsberg’s new book Science in the Age of Computer Simulation. Winsberg is one of a few intrepid philosophers who have taken up the challenge of understanding the logic of simulation and modeling, which lie at the core of modern science (and which I discuss extensively in A Vast Machine.)

From the review:

Winsberg suggests that philosophy of [contemporary] science… ought to concern itself with the subject of simulating complex phenomena within existing theory, as opposed to its traditional focus on the creation of novel scientific theories. Winsberg concludes,

[W]hat we might call the ontological relationship between simulations and experiments is quite complicated. Is it true that simulations are, after all, a particular species of experiment? I have tried to argue against this claim, while at the same time insisting that the differences between simulation and experiment are more subtle than some of the critics of the claim have suggested. Most important, I have tried to argue that we should disconnect questions about the identity of simulations and experiments from questions of the epistemic power of simulations.

Philosophy has been trailing the actual state of science for a long time now, so it’s good to see this kind of work coming out.

I’m afraid, though, that it’s still trailing the bleeding edge — we’ve entered an age of data-intensive science, which presents its own epistemic challenges: for example, how much does theory matter when statistical analysis of huge datasets reveals strong correlations? If predictive power is your main goal, sometimes data can take the place of explanation. (Not sure I actually believe this, but it’s a compelling point of view.) Take a look at Hey et al., The Fourth Paradigm if this kind of thing interests you.

Posted in News, Reflections, Reviews, Uncategorized | 1 Comment »

Time

Posted by Paul N. Edwards on January 14, 2011

We were talking about time and rhythm today – here are a few references:

Dohrn-van Rossum, Gerhard. History of the Hour: Clocks and Modern Temporal Orders. Chicago: University of Chicago Press. 1996. Focuses on the medieval world, but covers the ancient world through the present. Fascinating discussion of the medieval order — when hours were of variable length (1/12 of the time between sunset and sunrise, or vice versa) — and its evolution into the modern temporal order with hours and other units having equal length. Clocks and bells as signaling systems, going off all day and much of the night as well to mark various prayers, community tasks, come-in-from-the-fields, etc. — the medieval world was REALLY LOUD!

Grudin, Robert. Time and the Art of Living. New York: Ticknor & Fields. 1988. I adore this book. Put it by your bedside – numbered paragraphs, à la Wittgenstein, each a small meditation on time.

Høeg, Peter. Borderliners. New York: Farrar Straus and Giroux. 1994. Same guy that wrote Smilla’s Sense of Snow. An extremely controlled, possibly autobiographical story about growing up in an orphanage in Denmark, but much of the book is about the psychological experience of time.

Then, of course, there’s the immortal Pink Floyd song “Time,” from Dark Side of the Moon. Maybe the best thing ever written about mortality.

Posted in Reflections, Uncategorized | 1 Comment »

 
Follow

Get every new post delivered to your Inbox.