Misc scribbles

On over-reproducibility

2016-03-05

Recently, some posts were made by https://twitter.com/arjunrajlab about how perhaps we are aiming at "over-reproducibility". I think this is interesting, and would generally agree that not everyone needs to achieve total automation of their whole pipeline, but I think the post does a lot of "blaming your tools" and disparaging good development practices with regards to version control and figure generation.

I think that the complaint that version control and automated figures are not for everyone is probably true, but it is overgeneralizing a different problem. For example, students are not "trained" to work with Git, and they are not "trained" to do software engineering. In fact, even computer science students are not generally "trained" to do any of those things (computer science != software engineering). But that doesn't mean that your lab needs to forego using all those tools. Software development can be incredibly complex and sophisticated, but it's important to make sure things are "done right"! High-quality and easy-to-reproduce software is really about process, and engineering. But that is also why there is no one-true-way for reproducibility. Maybe Arjun doesn't have a reproducible workflow right now, but what about 5 years down the road, where he suddenly has a great framework for such things? This happens all the time in software development (for example, how long ago was it that "push to deploy" did not exist? how often would you just edit your files live on your site? now that is seen as bad practice!), but that said, processes for software quality can evolve pretty organically, so even though some best practices exist, people can grow their own quality environment.

Even if we agree that software development+version control=good, there are still a lot of complaints about it in the blogpost. For example, the complaint that git is too hard is pretty silly, and the xkcd comic about calling over graph theorist doesn't really help. As a software developer at work, I think that version control simply helps define a disciplined way of working. Version control makes you analyze your progress, summarize it as a commit message, format the code properly, make sure it passes tests, and then talk to your collaborators about accepting it. Dropbox might accomplish some of those things, but I would really doubt that it is covering that full scope. Arjun seems to agree with using version control for some of his labs software development, so again, there is a spectrum of needs being met. Nevertheless, there are some weird comments about whether commit messages are like a "lab notebook", but hint: they are not, write documentation for your project or keep a separate journal or blog or wiki. Commit messages in my opinion should be about one line, and the changes should be very self explanatory. But another big argument in the blogpost is whether version control works for something like paper writing, and I believe that this underscores something else: that paper writing is really a pretty messy procedure.

I think that perhaps the "google docs" mode of writing is probably pretty ok for many things, but it still needs a gatekeeper to incorporate the comments from coauthors and reviewers into the document in an organized way. In my experience as a "gatekeeper" with writing my senior thesis, I organized my paper using knitr, and I automated figures being generated by R wherever possible, and then I would convert the paper to .docx to share with my advisors. Then I would take their comments on the .docx and incorporate it back into my paper. This could be seen as burdensome ("why not just use google docs"), but I felt that it was a good way to incorporate review into a reproducible workflow.

Now, my pipeline precludes your PI from having to learn git to make a pull request on your paper. That's a good thing... and we still have reproducibility. But what about the figures themselves? I said I had knitr for reproducible figures, but what about everyone else? I think figures have high value, and so people might want to have more reproducibility invested in them. In the blog post, it was claimed that making "complex" pub-quality figures was difficult (i.e. the plea for Adobe Illustrator), but look at the annotation functions from ggplot2, and multifaceted images. I found these annotation functions to be very easy to pick up. There is also the on-going debate about ggplot2 vs base graphics on the simplystatistics blog, which covers making publication quality figures, and last I checked, I think the ggplot2′ers were winning. I don't know how it works in high profile journals like Nature, because it looks like they just re-do all the figures to make them have some consistent style, but that doesn't mean your original figure should be irreproducible.

The debate about reproducible figures is pretty tangible too in things like microscopy images. Simply look at the large amount of discussion from pubpeer about image fraud and possible duplications. The pubpeer community obviously has some pretty sophisticated tools for hunting out possibly manipulated microscopy images. These types of things also lead to investigations, and you can see in the high-profile retraction case over STAP cells that it looks like the investigating committee were simply asking how some figures were made, and upon finding that lab members don't know, a paper was retracted. The RetractionWatch blog covers these investigations http://retractionwatch.com/2016/02/26/stap-stem-cell-researcher-obokata-loses-another-paper/

You can't depend on other people to back your figure up, so you need to take responsibility for making sure your papers and your work are reproducible (and, there is a spectrum for reproducibility, but I believe that version control is a great example of highly disciplined work). I also think that just having folders on some hard drive is not a good way to do things either. There is a saying in software development that is "if it's not in version control, it doesn't exist". That's not to say that version control is for everything, big data obviously has trouble with being stored in git. But that shouldn't block you from creating reproducible analyses.

Another example from the over-reproducibility blogpost says that if you have "analysis1" and "analysis2", then version control advocates would tell you to delete analysis1 and just remember that it is in your history. I think that this is just a different issue. If you actually care about both analyses, just make them separate repositories, with basic README.md files explaining each them, and stop worrying about it. Having one repository containing too many miscellaneous scripts is actually an anti-pattern. Stop making repositories called "bioinfo-scripts" that just contain a mish-mash of analysis scripts! Make your work purpose driven and do tasks. Also, this is an argument against REPL tools: your R REPL history is not a reproducible script. Make your code into a script that generates well defined outputs. Windows users: you might not understand this because the command line on windows is crippled, but you have to make things run on the command line.

Now I wish I could say that I live by my words, but having been involved in coauthoring several papers, I will just have to admit that it is really a messy procedure despite my best intentions as an editor and coauthor. I wish things would be better!

On over-reproducibility: there is no such thing! There are pretty good arguments to really automate most of a process, especially if it is done repeatedly, to remove human errors, because meat-machines genuinely do things wrong all the time.

And, as my parents would say around the dinner table: "you can always have more, but you can never have less"...so, you're not going to get to a point of over-reproducibility. We shouldn't cargo cult it as the only way to do science but it's not a bad thing to have.