James Howison gave a fascinating talk here yesterday about scientific software. Part of his argument was that different sciences have taken different routes to both producing and using specialized software. These range from widely adopted, full-on production codes, which tend to involve their authors in maintenance work that can go on for years (but may produce no scientific credit for them), to scripts written for personal use and never shared at all.
In the case of shared codes, Howison argued, the problem of making them work on different platforms, and in different software contexts (as operating systems and other software packages needed by the scientific software change around them), is far from trivial. It frequently means that scientific software packages don’t live very long.
In questions, one of our faculty pointed out that some economics journals have required authors to publish both data and code for some time. Didn’t this simply take care of the problems of (a) credit for software development and (b) replicability of results? Howison responded by citing McCullough, B. D., McGeary, K. A., & Harrison, T. D. (2006). Lessons from the JMCB Archive. Journal of Money, Credit, and Banking 38(4), 1093–1107 (direct link for UM people here.)
Here are a few tidbits from that article, relevant not only to our work on data, models, and software, but also to our thinking about metadata. Some highlights here are in the original, some are mine:
We examine the online archive of the Journal of Money, Credit, and Banking, in which an author is required to deposit the data and code that replicate the results of his paper. We find that most authors do not fulfill this requirement. Of more than 150 empirical articles, fewer than 15 could be replicated.
Part of the reason for this was noncompliance; despite the policy, only 58 articles included both data and code.
…58 archive entries had at least some data and some code. For each, we attempted to use the supplied data and code to replicate the published results. We made minor alterations to data and code to try to get the code to run with the data, but we did not attempt major alterations. Consequently, for some articles that we could not replicate, it is possible that there exists some data and code combination that will reproduce the reported results, but we are certain that it is not the combination that is in the JMCB archive. Since journal policy requires authors to deposit in the archive the data and code that will replicate their results, if we cannot use the data and code in the archive to replicate an author’s results, it is fair to say that the author did not honor the policy.
Many authors took what can only be considered a desultory approach to fulfilling the requirement, not even caring whether the data would run with the code. In many cases, the author had specifically not provided the data that ran with the code, and instead had provided the data in some alternate format. The obvious implication of such an action is that it makes replication difficult, sometimes requiring much effort to put the data into a format that would run with the code. For example, one author’s code reads an “.xls” file, but the code provided is in “.prn” format. Another author’s code calls a “.rat” file and his provided data are just the output from a “print” command that would take great effort to turn into a machine readable data file. Occasionally, authors provided data in a program-specific format, so that researchers who did not have that program could not access the data.
One author provides no readme file and two data files with no column headers: we are supposed to guess the names of the variables! Other authors seem to think that the entire world shares the exact same hard drive layout, with ‘‘C:\MYDATA\MYPROJECT\” sprinkled liberally throughout their code. Of course, a would-be replicator has to find and change all these. Moreover, the author might not realize all the data/subroutine files that his code utilizes, and forget to include said data/subroutine in his replication files. For example, some authors forgot to include code for a subroutine that existed in yet another subdirectory, and similarly other researchers forgot data files.
We recommend that all data be provided in ASCII format, and that the version of the code submitted to the archive call these same ASCII files. Additionally, the first program should print summary statistics on all the variables, so that subsequent researchers can be sure that they have loaded the data correctly.
Good luck with that.
Another problem, of course, is software versioning:
Simply having data and code and possessing the same software package does not guarantee that a researcher will be able to replicate the author’s results. We know of a package that in two successive releases produced different results for the same “calculate the correlation matrix” problem. We know also of a few packages that, in successive releases, produced different answers for the same nonlinear estimation problem.
The issue of software metadata — commented code — rose to the top as a primary concern:
…Putting code in an archive is not simply a matter of depositing uncommented, unclear code. In fact, the code is a better record of what was actually done to the data than is the article. Moreover, the myriad minor decisions for which there simply is insufficient space in print are revealed in the code, and hopefully made clear via extensive commenting of the code. We recommend that the readme file list all the replication files with a brief description of each.
Writing code for replicable research is not as easy as it sounds… It takes an enormous amount of effort but, then again, this code does represent a contribution to the cumulative body of knowledge: in terms of actually building the body of knowledge and enabling others to make use of an author’s research, the code is in many ways no less important than the article itself. For example, the code should be written and commented so that someone with a different package can build on the existing research. Consequently, if an author writes an article using RATS and a second researcher uses TSP, as a general rule the second researcher should not need to obtain the RATS reference manual and learn RATS just to extend the author’s results. As a simple example, consider the following RATS code:
linreg(spread=x4) y
# constant x1 x2 x3
restrict 1
#2
#1 -1It is obvious enough that this is some sort of regression of y on a constant and three independent variables, but the rest of the code is not intuitive. It may reproduce the results in the paper, but it hardly helps a TSP user understand what was done. Even RATS users may have to consult the reference manual. A few comment lines would remedy the situation…
Maybe. Maybe not. That depends on the comments, on how distant the user might be from the discipline or subdiscipline of the original authors, and on how old the software might be.
Howison says this article has had some follow-ons, which I haven’t looked at yet. He pointed me to Baiocchi, G. (2007). Reproducible research in computational economics: guidelines, integrated approaches, and open source software. Computational Economics 30:19–40, which includes the following quote:
They [McCullough et al., discussed above] convincingly argue that, though most empirical work could still not be reproduced, the requirement of a data and code archive should be adopted by more journals and that stricter rules that ensure compliance from the author should be introduced.
Yet another resort to the stick, not the carrot, to get metadata — even while recognizing that metadata won’t solve most of the problems.
The MMM response here might be: is such a policy worth the effort? When? How much effort is enough? And how do we know?
Clearly it would be worth it in some cases — but how many? Many results aren’t significant enough to warrant replication. And many codes would be impossible to use even 5 or 10 years after publication, due to the unavailability of the original versions and/or software contexts.
Alternatively, MMM would ask, if someone really wants to reproduce a result, isn’t it likely that s/he will contact the authors of the article for help? The obsession with metadata products obscures the major role of ad hoc processes — communication among scientists — as a normal and often fully adequate way of obtaining metadata. Another MMM response: the amount and type of metadata required will change with the would-be user’s distance in time and/or disciplinary context from those of the original authors. What’s adequate next year for economists looking at JMCB may be impossibly thin for economic historians looking at the same information 20 years from now.