Monitoring, Modeling, and Memory

Dynamics of Data and Knowledge in Scientific Cyberinfrastructures

Archive for the ‘Uncategorized’ Category

Replicating results with published data + code: very hard to do…

Posted by Paul N. Edwards on February 11, 2011

James Howison gave a fascinating talk here yesterday about scientific software. Part of his argument was that different sciences have taken different routes to both producing and using specialized software. These range from widely adopted, full-on production codes, which tend to involve their authors in maintenance work that can go on for years (but may produce no scientific credit for them), to scripts written for personal use and never shared at all.

In the case of shared codes, Howison argued, the problem of making them work on different platforms, and in different software contexts (as operating systems and other software packages needed by the scientific software change around them), is far from trivial. It frequently means that scientific software packages don’t live very long.

In questions, one of our faculty pointed out that some economics journals have required authors to publish both data and code for some time. Didn’t this simply take care of the problems of (a) credit for software development and (b) replicability of results? Howison responded by citing McCullough, B. D., McGeary, K. A., & Harrison, T. D. (2006). Lessons from the JMCB Archive. Journal of Money, Credit, and Banking 38(4), 1093–1107 (direct link for UM people here.)

Here are a few tidbits from that article, relevant not only to our work on data, models, and software, but also to our thinking about metadata. Some highlights here are in the original, some are mine:

We examine the online archive of the Journal of Money, Credit, and Banking, in which an author is required to deposit the data and code that replicate the results of his paper. We find that most authors do not fulfill this requirement. Of more than 150 empirical articles, fewer than 15 could be replicated.

Part of the reason for this was noncompliance; despite the policy, only 58 articles included both data and code.

…58 archive entries had at least some data and some code. For each, we attempted to use the supplied data and code to replicate the published results. We made minor alterations to data and code to try to get the code to run with the data, but we did not attempt major alterations. Consequently, for some articles that we could not replicate, it is possible that there exists some data and code combination that will reproduce the reported results, but we are certain that it is not the combination that is in the JMCB archive. Since journal policy requires authors to deposit in the archive the data and code that will replicate their results, if we cannot use the data and code in the archive to replicate an author’s results, it is fair to say that the author did not honor the policy.

Many authors took what can only be considered a desultory approach to fulfilling the requirement, not even caring whether the data would run with the code. In many cases, the author had specifically not provided the data that ran with the code, and instead had provided the data in some alternate format. The obvious implication of such an action is that it makes replication difficult, sometimes requiring much effort to put the data into a format that would run with the code. For example, one author’s code reads an “.xls” file, but the code provided is in “.prn” format. Another author’s code calls a “.rat” file and his provided data are just the output from a “print” command that would take great effort to turn into a machine readable data file. Occasionally, authors provided data in a program-specific format, so that researchers who did not have that program could not access the data.

One author provides no readme file and two data files with no column headers: we are supposed to guess the names of the variables! Other authors seem to think that the entire world shares the exact same hard drive layout, with ‘‘C:\MYDATA\MYPROJECT\” sprinkled liberally throughout their code. Of course, a would-be replicator has to find and change all these. Moreover, the author might not realize all the data/subroutine files that his code utilizes, and forget to include said data/subroutine in his replication files. For example, some authors forgot to include code for a subroutine that existed in yet another subdirectory, and similarly other researchers forgot data files.

We recommend that all data be provided in ASCII format, and that the version of the code submitted to the archive call these same ASCII files. Additionally, the first program should print summary statistics on all the variables, so that subsequent researchers can be sure that they have loaded the data correctly.

Good luck with that.

Another problem, of course, is software versioning:

Simply having data and code and possessing the same software package does not guarantee that a researcher will be able to replicate the author’s results. We know of a package that in two successive releases produced different results for the same “calculate the correlation matrix” problem. We know also of a few packages that, in successive releases, produced different answers for the same nonlinear estimation problem.

The issue of software metadata — commented code — rose to the top as a primary concern:

…Putting code in an archive is not simply a matter of depositing uncommented, unclear code. In fact, the code is a better record of what was actually done to the data than is the article. Moreover, the myriad minor decisions for which there simply is insufficient space in print are revealed in the code, and hopefully made clear via extensive commenting of the code. We recommend that the readme file list all the replication files with a brief description of each.

Writing code for replicable research is not as easy as it sounds… It takes an enormous amount of effort but, then again, this code does represent a contribution to the cumulative body of knowledge: in terms of actually building the body of knowledge and enabling others to make use of an author’s research, the code is in many ways no less important than the article itself. For example, the code should be written and commented so that someone with a different package can build on the existing research. Consequently, if an author writes an article using RATS and a second researcher uses TSP, as a general rule the second researcher should not need to obtain the RATS reference manual and learn RATS just to extend the author’s results. As a simple example, consider the following RATS code:

linreg(spread=x4) y
# constant x1 x2 x3
restrict 1
#2
#1 -1

It is obvious enough that this is some sort of regression of y on a constant and three independent variables, but the rest of the code is not intuitive. It may reproduce the results in the paper, but it hardly helps a TSP user understand what was done. Even RATS users may have to consult the reference manual. A few comment lines would remedy the situation…

Maybe. Maybe not. That depends on the comments, on how distant the user might be from the discipline or subdiscipline of the original authors, and on how old the software might be.

Howison says this article has had some follow-ons, which I haven’t looked at yet. He pointed me to Baiocchi, G. (2007). Reproducible research in computational economics: guidelines, integrated approaches, and open source software. Computational Economics 30:19–40, which includes the following quote:

They [McCullough et al., discussed above] convincingly argue that, though most empirical work could still not be reproduced, the requirement of a data and code archive should be adopted by more journals and that stricter rules that ensure compliance from the author should be introduced.

Yet another resort to the stick, not the carrot, to get metadata — even while recognizing that metadata won’t solve most of the problems.

The MMM response here might be: is such a policy worth the effort? When? How much effort is enough? And how do we know?

Clearly it would be worth it in some cases — but how many? Many results aren’t significant enough to warrant replication. And many codes would be impossible to use even 5 or 10 years after publication, due to the unavailability of the original versions and/or software contexts.

Alternatively, MMM would ask, if someone really wants to reproduce a result, isn’t it likely that s/he will contact the authors of the article for help? The obsession with metadata products obscures the major role of ad hoc processes — communication among scientists — as a normal and often fully adequate way of obtaining metadata. Another MMM response: the amount and type of metadata required will change with the would-be user’s distance in time and/or disciplinary context from those of the original authors. What’s adequate next year for economists looking at JMCB may be impossibly thin for economic historians looking at the same information 20 years from now.

Posted in Current Reading, Uncategorized | 7 Comments »

Critical Code Studies – digital humanities

Posted by Paul N. Edwards on January 27, 2011

I’ve been stuck on this mailing list for a while (Society for Literature, Science, and the Arts).

Occasionally, interesting bits float by. Here’s one of them.

- Paul

From: Mark Marino <markcmarino>

Date: January 27, 2011 3:06:07 AM EST

To: litsci-l

Subject: Critical Code Studies news

Reply-To: litsci-l, Mark Marino <markcmarino>

Hi, all,

At several SLSA’s I’ve given presentations on what I call Critical Code Studies, a way of using computer source code as an entry way into discussions of digital objects and culture.

Right now there are places to explore and extend those discussions:

1) HASTAC Scholars forum:
A very lively debate (happening as I type this), featuring a main thread and some objects for “code critiques.”

General thread:
http://www.hastac.org/forums/hastac-scholars-discussions/critical-code-studies

Code Critiques:
http://www.hastac.org/forums/hastac-scholars-discussions/code-critiques

2) CCS @ USC conference proceedings:
http://vectorsjournal.org/thoughtmesh/critcode
These just went online last night. They feature the texts of the talks, videos, and code, as well as an opportunity to join the discussion by hitting the “Peer Review” tab on each talk or panel.

The proceedings are published under USC’s Vectors journal on the Thoughtmesh platform, an exciting venue that in many ways recreates the conference experience of intersecting conversations, without all the cab fare and wrinkled outfits.

3) CCS Working Group in electronic book review
http://www.electronicbookreview.com/thread/firstperson/ningislanded

We’re editing and publishing the weekly threads from last year’s Critical Code Studies working group. The first week and an introductory essay are up now. The following weeks will appear over the next several moths.

4) The Critical Code Studies blog
http://criticalcodestudies.com

Please join us for these discussions as we explore ways of talking about and through analyses of code. These are the conversations that will develop this field and the CCS panels to come at future SLSAs.

Best,
Mark Marino
Writing Program
University of Southern California
http://WriterResponseTheory.org
http://CriticalCodeStudies.com
—-
The Litsci-L archive is viewable on the Web at:
http://litsci.org

Posted in Uncategorized | Leave a Comment »

Online and lightweight qual coding tool

Posted by dribes on January 26, 2011

An interesting possibility for qualitative data coding and analysis.
Free, online, and easily supports collaborative activity.

What is ASTOUNDING is how lightweight it is compared to the cluttered
horrors of NVIVO.

A concern would be the long term sustainability of the toolset
(project or facility?). They do allow you to export your data, but I
don’t know what the export file looks like…

david.

Originally from Matt Burton:

Sean Munson just showed me this, Saturate App, a web application for
collaborative qualitative data storage, coding, and analysis.
Check out this overview video, it looks pretty impressive.

this is INFINITELY more usable & collaborative than NVIVO, especially
on a mac or linux.


mcb

Posted in Uncategorized | Leave a Comment »

“Modeling” memory is easier than “monitoring” memory

Posted by archer on January 25, 2011

In reviewing some interview notes, I saw this segment from an interview with someone who works for the Earth System Grid. ESG is planning to add observational data to the data portal in addition to its current collection of model data. I asked him what it will be like to move from a system that handles just data output from models to include data collected from observational systems. His reply:

“Oh yeah it’s going to be a big jump. Because the model data is easy compared to the observational data…”

“What makes model data easy compared to observational data?”

“Well, for one thing, it’s all already in a nice gridded format. I mean, you got the nice 2D and 3D pieces, that doesn’t tend to be any missing data, like… I mean, observational data requires all the work just to be able to take it from what the center says to something that human can use. And it’s already in a pretty well-defined format, either GRIB or NetCDF or something like that. It’s just probably… I mean, it’s… Since it’s an idealized representation of the world, I guess, in some ways the data is seen as kind of an idealized data format and data that it’s a lot easier to… Easier doesn’t mean easy but… I’m reading articles about observational data and I’ve accessed enough it that it’s really, really hard sometimes.”

We haven’t often had our three themes of monitoring, modeling, and memory come up as analytical concepts, but this instance was striking because it nicely showed a relationship between them. Idealized system for generating data results in an idealized data format that is easier to store. Model runs with non-idealized data can be repeated, keeping the data cleaner. I don’t want to pretend that model data is always clean or uncomplicated, but it does seem that in some real senses it could be simpler than observational data.

Posted in Reflections, Uncategorized | Leave a Comment »

CFP on methods for studying virtual environments and online social networks

Posted by Paul N. Edwards on January 19, 2011

Caught this in Technoscience and thought our grad students should know about it.

This journal special issue encourages submissions from graduate students and junior academics.

Methodological approaches to the study of virtual environments and online social networks

Deadline: March 15 2011

http://gjss.org.

Call for Papers and Book Reviews: Methodological approaches to the study of virtual environments and online social networks The Graduate Journal of Social Science (GJSS) announces a Call for Papers and Book Reviews for a special issue dealing with methodological approaches to the study of virtual environments and online social networks. The journal encourages the submission of work by MSc/ MA/MS, MPhil, PhD students and junior academics from all geographic regions. All papers are submitted to a blind peer review process. The special issue is scheduled for December 2011.

Posted in Publications, Uncategorized | Leave a Comment »

Science review of A Vast Machine

Posted by Paul N. Edwards on January 15, 2011

A review by Richard Somerville just came out. You can see it here.

In the same review, Somerville discusses philosopher Eric Winsberg’s new book Science in the Age of Computer Simulation. Winsberg is one of a few intrepid philosophers who have taken up the challenge of understanding the logic of simulation and modeling, which lie at the core of modern science (and which I discuss extensively in A Vast Machine.)

From the review:

Winsberg suggests that philosophy of [contemporary] science… ought to concern itself with the subject of simulating complex phenomena within existing theory, as opposed to its traditional focus on the creation of novel scientific theories. Winsberg concludes,

[W]hat we might call the ontological relationship between simulations and experiments is quite complicated. Is it true that simulations are, after all, a particular species of experiment? I have tried to argue against this claim, while at the same time insisting that the differences between simulation and experiment are more subtle than some of the critics of the claim have suggested. Most important, I have tried to argue that we should disconnect questions about the identity of simulations and experiments from questions of the epistemic power of simulations.

Philosophy has been trailing the actual state of science for a long time now, so it’s good to see this kind of work coming out.

I’m afraid, though, that it’s still trailing the bleeding edge — we’ve entered an age of data-intensive science, which presents its own epistemic challenges: for example, how much does theory matter when statistical analysis of huge datasets reveals strong correlations? If predictive power is your main goal, sometimes data can take the place of explanation. (Not sure I actually believe this, but it’s a compelling point of view.) Take a look at Hey et al., The Fourth Paradigm if this kind of thing interests you.

Posted in News, Reflections, Reviews, Uncategorized | 1 Comment »

Time

Posted by Paul N. Edwards on January 14, 2011

We were talking about time and rhythm today – here are a few references:

Dohrn-van Rossum, Gerhard. History of the Hour: Clocks and Modern Temporal Orders. Chicago: University of Chicago Press. 1996. Focuses on the medieval world, but covers the ancient world through the present. Fascinating discussion of the medieval order — when hours were of variable length (1/12 of the time between sunset and sunrise, or vice versa) — and its evolution into the modern temporal order with hours and other units having equal length. Clocks and bells as signaling systems, going off all day and much of the night as well to mark various prayers, community tasks, come-in-from-the-fields, etc. — the medieval world was REALLY LOUD!

Grudin, Robert. Time and the Art of Living. New York: Ticknor & Fields. 1988. I adore this book. Put it by your bedside – numbered paragraphs, à la Wittgenstein, each a small meditation on time.

Høeg, Peter. Borderliners. New York: Farrar Straus and Giroux. 1994. Same guy that wrote Smilla’s Sense of Snow. An extremely controlled, possibly autobiographical story about growing up in an orphanage in Denmark, but much of the book is about the psychological experience of time.

Then, of course, there’s the immortal Pink Floyd song “Time,” from Dark Side of the Moon. Maybe the best thing ever written about mortality.

Posted in Reflections, Uncategorized | 1 Comment »

Grand challenges

Posted by Paul N. Edwards on September 14, 2010

Of interest…

NSF 10-069
Dear Colleague Letter for SBE 2020: Future Research in the Social, Behavioral & Economic Sciences

Dear Colleague:

At the end of the first decade of the 21st century, the social, behavioral, and economic sciences face extraordinary opportunities to address next-generation research challenges.   The landscape is vast and complex, stretching across temporal and spatial dimensions and multiple levels of analysis — from studying the human brain to implications of decision making in a dynamic and fragmented yet interconnected world.   As we look forward 10 or even 20 years, the Directorate for the Social, Behavioral, and Economic Sciences of the National Science Foundation (NSF/SBE) seeks to frame innovative research for the year 2020 and beyond that enhances fundamental knowledge and benefits society in many ways.

This request is part of a process that will help NSF/SBE make plans to support future research. Other activities will include a report by the Directorate’s Advisory Committee about the grand challenges facing the SBE sciences over the next decade and recommendations from the Directorate’s staff. The insights resulting from this process are threefold:  They will inform the substance of future research, the capacities to pursue that research, and the infrastructure to enable investigations that will be increasingly interdisciplinary and international and will involve multiple perspectives and intellectual frameworks, differing scales and contexts, and diverse approaches and methodologies.

As a first step in engaging its community, NSF/SBE invites individuals and groups to contribute white papers outlining grand challenge questions that are both foundational and transformative.  They are foundational in the sense that they reflect deep issues that engage fundamental assumptions behind disciplinary research traditions and are transformative because they seek to leverage current findings to unlock a new cycle of research. We expect these white papers to advance SBE’s mission to study human characteristics and human behaviors in its Social and Economic Sciences and Behavioral and Cognitive Sciences divisions, as well as to be the nation’s resource for understanding the structure and development of science through its Science Resources Statistics division. These white papers must:

  • Explain the challenge question, capability to be created, or scientific strategy; provide context in terms of recent research results and standing questions in the field; suggest the range of disciplines that may contribute, and indicate the implications for future research within and across disciplines.
  • Limit the white paper to 2,000 words with a 200-word maximum abstract and up to 3 references to relevant readings.
  • Include a Creative Commons Attribution Non-Commercial Share Alike license (http://creativecommons.org/about/licenses/) so that the material may be made widely available through the web.
  • Arrive by September 30, 2010 in a Microsoft Word-compatible format.    Submit to:http://www.nsf.gov/sbe/sbe_2020/.

NSF/SBE plans to use these contributions over the next year to assist in formulating plans that will guide its strategic scientific thinking. Consequently, we anticipate making all abstracts and papers accessible through the SBE 2020 website. Authors who do not wish to have their papers made available through the website may restrict access to NSF staff.  However, the author(s), title, and abstract will be included in the publicly accessible corpus.

Research is cumulative and progress is at times necessarily incremental.  We invite you, now, to step outside of present demands and to think boldly about future promises. We await your contributions to understanding the future of SBE science.

Posted in Uncategorized | Leave a Comment »

American Scientist review of A Vast Machine

Posted by Paul N. Edwards on August 1, 2010

This review of my book is really nice!!!

Posted in News, Reviews, Uncategorized | Leave a Comment »

A Vast Machine

Posted by Paul N. Edwards on July 16, 2010

I promise not to send you all every single review my book receives, but I can’t resist passing this one along because of its author.

-Paul

Dear Professor Edwards,

First let me introduce myself. Currently I am a faculty member at Princeton, but I did serve for a rather long time first as president of the University of Michigan and later as president of Princeton. More importantly I am currently chairing a committee appointed by the InterAcademy Council [IAC] to review the policies and procedures of the IPCC. Such a review had been requested by the UN. In any case it is in this connection that I read your book A Vast Machine and the real purpose of this brief note is to tell you how much I enjoyed and profited from this wonderful volume. Thank you for taking the time to write it.

Harold T. Shapiro

Posted in Uncategorized | 1 Comment »

 
Follow

Get every new post delivered to your Inbox.