Monitoring, Modeling, and Memory

Dynamics of Data and Knowledge in Scientific Cyberinfrastructures

Archive for the ‘Current Reading’ Category

Replicating results with published data + code: very hard to do…

Posted by Paul N. Edwards on February 11, 2011

James Howison gave a fascinating talk here yesterday about scientific software. Part of his argument was that different sciences have taken different routes to both producing and using specialized software. These range from widely adopted, full-on production codes, which tend to involve their authors in maintenance work that can go on for years (but may produce no scientific credit for them), to scripts written for personal use and never shared at all.

In the case of shared codes, Howison argued, the problem of making them work on different platforms, and in different software contexts (as operating systems and other software packages needed by the scientific software change around them), is far from trivial. It frequently means that scientific software packages don’t live very long.

In questions, one of our faculty pointed out that some economics journals have required authors to publish both data and code for some time. Didn’t this simply take care of the problems of (a) credit for software development and (b) replicability of results? Howison responded by citing McCullough, B. D., McGeary, K. A., & Harrison, T. D. (2006). Lessons from the JMCB Archive. Journal of Money, Credit, and Banking 38(4), 1093–1107 (direct link for UM people here.)

Here are a few tidbits from that article, relevant not only to our work on data, models, and software, but also to our thinking about metadata. Some highlights here are in the original, some are mine:

We examine the online archive of the Journal of Money, Credit, and Banking, in which an author is required to deposit the data and code that replicate the results of his paper. We find that most authors do not fulfill this requirement. Of more than 150 empirical articles, fewer than 15 could be replicated.

Part of the reason for this was noncompliance; despite the policy, only 58 articles included both data and code.

…58 archive entries had at least some data and some code. For each, we attempted to use the supplied data and code to replicate the published results. We made minor alterations to data and code to try to get the code to run with the data, but we did not attempt major alterations. Consequently, for some articles that we could not replicate, it is possible that there exists some data and code combination that will reproduce the reported results, but we are certain that it is not the combination that is in the JMCB archive. Since journal policy requires authors to deposit in the archive the data and code that will replicate their results, if we cannot use the data and code in the archive to replicate an author’s results, it is fair to say that the author did not honor the policy.

Many authors took what can only be considered a desultory approach to fulfilling the requirement, not even caring whether the data would run with the code. In many cases, the author had specifically not provided the data that ran with the code, and instead had provided the data in some alternate format. The obvious implication of such an action is that it makes replication difficult, sometimes requiring much effort to put the data into a format that would run with the code. For example, one author’s code reads an “.xls” file, but the code provided is in “.prn” format. Another author’s code calls a “.rat” file and his provided data are just the output from a “print” command that would take great effort to turn into a machine readable data file. Occasionally, authors provided data in a program-specific format, so that researchers who did not have that program could not access the data.

One author provides no readme file and two data files with no column headers: we are supposed to guess the names of the variables! Other authors seem to think that the entire world shares the exact same hard drive layout, with ‘‘C:\MYDATA\MYPROJECT\” sprinkled liberally throughout their code. Of course, a would-be replicator has to find and change all these. Moreover, the author might not realize all the data/subroutine files that his code utilizes, and forget to include said data/subroutine in his replication files. For example, some authors forgot to include code for a subroutine that existed in yet another subdirectory, and similarly other researchers forgot data files.

We recommend that all data be provided in ASCII format, and that the version of the code submitted to the archive call these same ASCII files. Additionally, the first program should print summary statistics on all the variables, so that subsequent researchers can be sure that they have loaded the data correctly.

Good luck with that.

Another problem, of course, is software versioning:

Simply having data and code and possessing the same software package does not guarantee that a researcher will be able to replicate the author’s results. We know of a package that in two successive releases produced different results for the same “calculate the correlation matrix” problem. We know also of a few packages that, in successive releases, produced different answers for the same nonlinear estimation problem.

The issue of software metadata — commented code — rose to the top as a primary concern:

…Putting code in an archive is not simply a matter of depositing uncommented, unclear code. In fact, the code is a better record of what was actually done to the data than is the article. Moreover, the myriad minor decisions for which there simply is insufficient space in print are revealed in the code, and hopefully made clear via extensive commenting of the code. We recommend that the readme file list all the replication files with a brief description of each.

Writing code for replicable research is not as easy as it sounds… It takes an enormous amount of effort but, then again, this code does represent a contribution to the cumulative body of knowledge: in terms of actually building the body of knowledge and enabling others to make use of an author’s research, the code is in many ways no less important than the article itself. For example, the code should be written and commented so that someone with a different package can build on the existing research. Consequently, if an author writes an article using RATS and a second researcher uses TSP, as a general rule the second researcher should not need to obtain the RATS reference manual and learn RATS just to extend the author’s results. As a simple example, consider the following RATS code:

linreg(spread=x4) y
# constant x1 x2 x3
restrict 1
#2
#1 -1

It is obvious enough that this is some sort of regression of y on a constant and three independent variables, but the rest of the code is not intuitive. It may reproduce the results in the paper, but it hardly helps a TSP user understand what was done. Even RATS users may have to consult the reference manual. A few comment lines would remedy the situation…

Maybe. Maybe not. That depends on the comments, on how distant the user might be from the discipline or subdiscipline of the original authors, and on how old the software might be.

Howison says this article has had some follow-ons, which I haven’t looked at yet. He pointed me to Baiocchi, G. (2007). Reproducible research in computational economics: guidelines, integrated approaches, and open source software. Computational Economics 30:19–40, which includes the following quote:

They [McCullough et al., discussed above] convincingly argue that, though most empirical work could still not be reproduced, the requirement of a data and code archive should be adopted by more journals and that stricter rules that ensure compliance from the author should be introduced.

Yet another resort to the stick, not the carrot, to get metadata — even while recognizing that metadata won’t solve most of the problems.

The MMM response here might be: is such a policy worth the effort? When? How much effort is enough? And how do we know?

Clearly it would be worth it in some cases — but how many? Many results aren’t significant enough to warrant replication. And many codes would be impossible to use even 5 or 10 years after publication, due to the unavailability of the original versions and/or software contexts.

Alternatively, MMM would ask, if someone really wants to reproduce a result, isn’t it likely that s/he will contact the authors of the article for help? The obsession with metadata products obscures the major role of ad hoc processes — communication among scientists — as a normal and often fully adequate way of obtaining metadata. Another MMM response: the amount and type of metadata required will change with the would-be user’s distance in time and/or disciplinary context from those of the original authors. What’s adequate next year for economists looking at JMCB may be impossibly thin for economic historians looking at the same information 20 years from now.

Posted in Current Reading, Uncategorized | 6 Comments »

On time: Victorian “time table of the world’s principal cities”

Posted by Paul N. Edwards on January 21, 2011

We were talking about time again. Steve finally started reading Robert Grudin’s Time and the Art of Living, one of the most beautiful and worthwhile books I know.

Now, check out this Victorian-era infographic (1883) showing over 100 world cities with their local times relative to Washington D.C. (at the center of the temporal universe, according to the graphic.)

Look twice. Because the times at each city vary not by today’s standard (multiples of one hour), but by multiples of one minute. Or less.

When it’s noon in Washington D.C., it’s 7:49 PM in Mecca and 6:33 PM in Warsaw.

There’s no information with the graphic to tell you the basis of time at each location, but given the era it was probably loosely centered on the solar clock. Noon at each location was the moment the sun reached its zenith. A local observatory would have sent out a time signal by telegraph, and/or lowered a time ball mounted on a high mast, and/or fired a gun to mark the moment. Locals would have set their clocks and watches to that signal.

Observatories and telegraph companies also got into the business of selling time signals to businesses within a hundred miles or so of their location. Further west or east than that and it would’ve been time (ho ho) to get a different signal.

It’s part of an exhibit of Victorian graphics at BibliOdyssey, also worth checking out.

Posted in Current Reading, Reflections | 1 Comment »

G.E.R. Lloyd on disciplines

Posted by gbowker on July 14, 2010

 Humanist Discussion Group, Vol. 24, No. 100.
         Centre for Computing in the Humanities, King's College London
                       www.digitalhumanities.org/humanist
                Submit to: humanist@lists.digitalhumanities.org

        Date: Fri, 11 Jun 2010 07:42:02 +0100
        From: Willard McCarty <willard.mccarty@mccarty.org.uk>
        Subject: disciplines

Those interested in the historical dimension of disciplinarity will be
glad to know about G. E. R. Lloyd's latest book, Disciplines in the
Making (Oxford, 2009), in which he examines the development of
philosophy, mathematics, history, medicine, art, law, religion and
science from their beginnings, using comparative materials, chiefly from
ancient Greece and China. In the last footnote of the book
(unfortunately omitted by the publisher, here recovered from Lloyd
himself), he notes that,

> Lip-service is sometimes paid to the advantages of a mastery of a
> variety of disciplines, and polymaths such as Leonardo and Newton are
> held up as models of human genius. But when it comes to implementing
> programmes of collaborative research, the complaint is still often
> made that each of the participants approaches the problems too much
> influenced by the particular ways they were taught to handle them in
> their original specialisations.  (not on p. 181)

The great examples we have of major collaborative undertakings from the
sciences -- greatest of all, perhaps, the Manhattan Project -- involved
experts cooperating, sometimes made to cooperate by a commanding leader
such as Oppenheimer. At our local level, we see (but so far have not
studied) the beginnings of the sort of mastery Lloyd here speaks of, in
the settings and situations the digital humanities are capable of
bringing about. Lloyd's book (unsurprisingly when you think about it) is
a sobering, and thrilling, (re)minder of how large and complex the world
of disciplinarity is.

The story of incommesurability among ways of knowing and communicating
is told e.g. in the story of the Tower of Babel, with its prior vision
of one universal language, or we might say, one universal discipline.
But before that story was told, and ever since, poets and scholars have
not stopped triangulating on that which can never be reached except in
such visions. The scholar's way is exemplified magnificiently by Lloyd's
book. Read it tonight!

Posted in Current Reading, News, People, Uncategorized | Leave a Comment »

Lines, a brief history, by Tim Ingolds

Posted by jillian on January 15, 2010

For today’s call we read chapters 4 and 6 in Ingolds which was suggested by Geof as an interesting read. Ingolds proposes concepts of lines, traces, threads, surfaces, and braids, and uses applications like genealogy and circuits to help explore the concepts. While this may seem like an odd read for understanding CyberInfrastructure, but we found a new platform for unpacking CI development.

An interesting point from Leigh was the typical layered model that is used to describe infrastructure in contrast to the linear model Ingolds proposes. Comparing the line and the layer has forced us to ask what the relationship of these two geometric ways of depicting progress. Layers accrue in a linear progression, but are not inherently linear. This is especially of import given the move towards Cloud Computing, which is distinctly not layered.

We are now thinking of how we can use the notion of lines and specifically braiding to understand the various interplays between collaborative rhythms and other CI processes that fall within our gaze. All-in-all a very interesting read and one we are excited to explore more fully beyond our one hour conference call.

Posted in Current Reading | Leave a Comment »

Thévenot and Schmidt & Simone readings

Posted by archer on December 2, 2009

This week we have been reading Thevenot 2009 and Schmidt and Simone 1996. (What a contrast in writing! I found Schmidt and Simone to be much clearer and easier to read.)

Thevenot

At the top of pg. 795, Thévenot mentions that the heavy costs of activities related to standardization “prevent the most competent experts from participation in standardization work.” I think this is an important point that we may see play out in the dynamic between scientists and IT people working on scientific cyberinfrastructure. Do scientists delegate important aspects of standardization work to IT people? How do teams ensure that scientists can input into the process at appropriate times. Similarly, if teams can find ways to reduce metadata friction, they may be more successful at obtaining feedback from busy scientists.

I am annoyed by and uncomfortable with the language of “arbitrariness” about choosing forms – although I welcome attempts to persuade me otherwise. I have two problems with this. First, “arbitrary” implies coin flip to me. Forms may be chosen for more reasons than pure technical efficiency (industrial worth), which Thevenot seems to acknowledge. But I don’t think it’s arbitrary if other aspects of worth, or even decision dynamics outside of those, end up leading to a certain decision regarding form. Second, I think we need to be careful not to lump all types of form (or standards, or metadata) together. Some may have much more arbitrariness, or contention, than others. You can insist that their is some ambiguity in any given form – but I think there is more ambiguity in human skin color than in age. Some standards may easily be agreed upon while others are contentious – and it is helpful to focus our energy in allowing continued debate and flexibility in the recording of data about the challenging categories.

One of Thévenot’s core conclusions is that:

“Substantialist reduction tends to inspire the belief that the good being sought has been made real, that once the correct elements with the right properties have been assembled, ‘good’ need not signify anything more than conformity to the formulation of the standard and and its measurement. This reduction omits the disquieting face of the engagement with its dynamic exigencies of having to adjust to once’s dependencies on the environment, and with the perspective of guaranteeing a particular good.”

This I find helpful. Yes, standards serve as an infrastructure. Most of the time we can’t bother thinking through the details of how we should categorize or structure our work, so we just lean upon the pre-decided standard. But in so doing we miss opportunities to reexamine the tensions and perspective that relate to the way the standard was chosen. Thévenot terms this sort of operation as “quietude,” whereas disquietude is the infrastructural inversion that rethinks the standard. He calls this inversion “blinking” – the flitting between the usual quietude and those moments of disrupted disquiet.

In scientific cyberinfrastructure, one of the primary goals is scientific discovery. Opportunities for novel breakthroughs may be particularly suppressed by dominating quietude. It will be interesting to look for ways that the projects we are studying deal with disruption, and when they choose to “blink” and reexamine their standards to make room for discovery.

Schmidt and Simone

I appreciated Schmidt and Simone’s recognition that artifact designers can’t know everything – coordination protocols and articulation work will change – and so artifacts should support both temporary and permanent modification by users of their coordination protocols. (It strikes me that for power-users, open-sourced code affords the ability to both modify it for your own use and contribute more permanent changes/improvements back to the repository.) I also appreciated their perspective of the design object as a sociotechnical one, that we can decide what work to try to address with a technical system and what to leave on social components.

The first 34 pages seem more straightforward; my question is then what to do with the Ariadne notation that they develop and figure 4 (on pg. 190). How can we usefully apply this? It is not obvious to me how to use this on our work, but I would be very interested in discussing this more.

Posted in Current Reading | Leave a Comment »

Science Since Babylon

Posted by jillian on May 30, 2009

I am currently reading Derek J. de Solla Price’s Science Since Bablyon, in an effort to get back to my theories of science roots.

Posted in Current Reading | Leave a Comment »

 
Follow

Get every new post delivered to your Inbox.