Over at RealClimate I was prompted to add a comment in the discussion on software and data archiving:
I’m finding the discussion here reminiscent of my own career - 5 years as a postdoc mostly running computational codes of one sort or another, added to my graduate degree in physics, left me with somewhere around 50,000 lines of code I had either written or heavily modified for my purposes (mostly C, some fortran, some perl - this was 15 years ago). A few bits and pieces were original and I put some effort in to make them shareable - graphics and PostScript creation, a multi-dimensional function integrator, etc. A few were done as part of much larger projects and at least ended up under proper revision control as a contribution to that project (that was my intro to CVS). But most were one-off things that tested some hypothesis, interpreted some data file, or were some sort of attempt at analysis. 90% of the time they weren’t a lot of use, and spending extra time documenting would have seemed pretty worthless - I used “grep” a lot to find things later. Sure they could have been made public, but nobody would have any idea what command-line arguments I’d used or the processing steps I’d taken, except in those rare instances where I anticipated my own reuse and created an explanatory “README”. Probably simpler for another scientist to just do it over from scratch than try to figure out what I’d done from looking at the code.
And now I’m a professional software developer in a group where we have quite rigorous test and development procedures, everything is checked into a version control system and regularly built and run against regression tests to keep things robust. Nevertheless, I still have a directory with hundreds of one-off scripts that fit in that same category of being easier to rewrite than to generalize, and there’s little purpose in making them publicly available or putting them under version control since at most I’ll use them as starting points for other scripts rather than re-using as they are in any significant way.
I’m not sure it was Fred Brooks or somebody else, but the expression I recall reading long ago was that turning a prototype into an internal software product took roughly a factor of 3 more effort, and turning an internal product into something you could publicly distribute (or sell) took roughly a factor of 3 times the effort beyond that. Software always falls along this spectrum, and most of what scientists use tends to be at the “prototype” level, simply because of the exploratory nature of science. Theoretically it would be nice to have the resources to keep everything clean and nicely polished, but if 90% of it is code you’re never going to re-use, what’s the point?
As a specific example of exploratory prototype-level software I worked on as a postdoc (in Indiana!), I remember my preliminary work on this paper I published in the Journal of Mathematical Physics on one asymptotic form for Laguerre polynomials. As I recall, I started by examining the zeros, trying to find an expression for the location of the zeros of the polynomials in the limit when all three parameters are large. That involved an iterated series of short C programs, each run just a few times, with output to data files of differences, which I then graphed and looked at trying to spot patterns. At some point I made a guess that was extremely close - and then I had to backtrack mathematically and figure out why my guess worked. Nowhere in the paper is there any mention, or dependence on, the software I wrote, yet it was critical in formulating my intuition about the problem, and leading me to the accurate (and rather complex) approximation I ended up publishing.
The process in this and many similar examples is very far from that of writing a program from detailed specifications, validating it in some fashion, and then running it and trusting the results. It is rather an iterative process of building confidence and fitting pieces together to get a coherent picture. In some ways it is a bit like the more iterative agile methods that software gurus advocate these days, except the final product is not a software product in itself, but rather scientific understanding about the behavior of whatever system it is you are modeling.
And once you have that scientific understanding, doing anything further with the software you used to build it often seems quite beside the point.
Comments
John Mashey added a note on
John Mashey added a note on the RealClimate thread pointing out it was Fred Brooks, in figure 1.1 (first figure of his book). The commentary accompanying the figure is quite worthwhile to clarify what he meant (the terminology is old, but you can get the drift from this pretty well - and the relative effort issues haven't changed even with modern test-first or version control methods):
From Fred Borooks, "The Mythical Man-Month", 20th anniversary edition, chapter 1, p. 4-6, "The Tar Pit":
Post new comment