Welcome to the web home of the Stoltzfus research group at The Institute Formerly Known as CARB.

Best practices for scientific programmers - top ten

Today I'm teaching a session on "Best Practices" in a "Programming for Biologists" course.  My course materials are online (feedback from other instructors is welcome).  I'll start out with my "top ten" list:

  • Interface, interface, interface
  • Modularize
  • Write code to be understood
  • Write tests and trap errors
  • Stamp your output
  • Use revision control
  • Make use of prior art
  • Create an installable package
  • Make your project open source
  • Set up a project management infrastructure

The first 3 are universally important, not just for scientists.  For scientists without formal training, I would stress the practice of designing interfaces before writing the "guts" of the code.  To stress the importance of interfaces, we have an exercise to write the skeleton of a script that has only 1 line of bioinformatics (going to NCBI to get something), and all the rest is interface-- including command-line options, help message, internal documentation, output-translation.  

To prioritize the others, we have to take into account the typical conditions of scientific programming.  In my experience, the typical scientific software product:  

  • is supposed to do one thing accurately and reproducibly 
  • has one or a few users 
  • has a short (project-specific) lifespan 

This kind of software typically is not-- and does not need to be-- robust (to various inputs), optimized for performance, and well documented. For the scientific programmer developing scripts for a project, the focus should be on accuracy and provenance, so that when you are getting ready to write up your results for publication, you know exactly what was done, and when it was done. Revision control is part of keeping an accurate record-- it lets you know which version of a script was current at each point in time.  Stamping all of your output files with name, date, and version is also critical to record-keeping.

Even if you are just writing scripts for yourself, its helpful to get in the habit of writing scripts as though your friends and colleagues were going to use them as well.  This means keeping the interfaces clean and providing internal documentation.  Even if you are the only one who uses your scripts, you will find this helpful-- every programmer learns within a few years that we quickly forget things like how a script was used, or why it was written in a particular way.

Most scientific programmers do not spend nearly enough time incorporating tests and traps into their code, which is critical for getting accurate and robust code. And of course, once you get to the stage of having a package of inter-dependent software parts, it is essential to have a test-suite that allows you to maintain functionality while adapting and improving the code

If I had a second class session to use, I would focus the whole thing on writing tests and traps.

Best practices for scientific programmers - top ten

Today I'm teaching a session on "Best Practices" in a "Programming for Biologists" course.  My course materials are online (feedback from other instructors is welcome).  I'll start out with my "top ten" list:

  • Interface, interface, interface
  • Modularize
  • Write code to be understood
  • Write tests and trap errors
  • Stamp your output
  • Use revision control
  • Make use of prior art
  • Create an installable package
  • Make your project open source
  • Set up a project management infrastructure

The first 3 are universally important, not just for scientists.  For scientists without formal training, I would stress the practice of designing interfaces before writing the "guts" of the code.  To stress the importance of interfaces, we have an exercise to write the skeleton of a script that has only 1 line of bioinformatics (going to NCBI to get something), and all the rest is interface-- including command-line options, help message, internal documentation, output-translation.  

To prioritize the others, we have to take into account the typical conditions of scientific programming.  In my experience, the typical scientific software product:  

  • is supposed to do one thing accurately and reproducibly 
  • has one or a few users 
  • has a short (project-specific) lifespan 

This kind of software typically is not-- and does not need to be-- robust (to various inputs), optimized for performance, and well documented. For the scientific programmer developing scripts for a project, the focus should be on accuracy and provenance, so that when you are getting ready to write up your results for publication, you know exactly what was done, and when it was done. Revision control is part of keeping an accurate record-- it lets you know which version of a script was current at each point in time.  Stamping all of your output files with name, date, and version is also critical to record-keeping.

Even if you are just writing scripts for yourself, its helpful to get in the habit of writing scripts as though your friends and colleagues were going to use them as well.  This means keeping the interfaces clean and providing internal documentation.  Even if you are the only one who uses your scripts, you will find this helpful-- every programmer learns within a few years that we quickly forget things like how a script was used, or why it was written in a particular way.

Most scientific programmers do not spend nearly enough time incorporating tests and traps into their code, which is critical for getting accurate and robust code. And of course, once you get to the stage of having a package of inter-dependent software parts, it is essential to have a test-suite that allows you to maintain functionality while adapting and improving the code

If I had a second class session to use, I would focus the whole thing on writing tests and traps.

The state of the art in re-usable trees

We recently completed an analysis of current practices for archiving trees and associated data, current practices for re-using trees, and barriers to re-use experienced by users (Stoltzfus, O'Meara, Whitacre, Mounce, Kumar, Rosauer & Vos).  The results of this analysis have helped us to think strategically about how technology and standards can be used to facilitate data re-use, so as to promote integrative and synthetic science.  

The project started at the 2010 TDWG meeting in Woods Hole, where the Phylogenetic Standards interest group held a workshop.  Dan Rosauer, Jamie Whitacre, Torsten Eriksson and I wanted to assess the current state of the art in publishing trees that can be linked into that big world of data out there.  Over the next year, this project gathered collaborators and morphed into a larger analysis of sharing and re-use of phylgoenetic trees and associated data.  Probably the most interesting thing we did was to get a more systematic sense of what is going on "in the wild" by examining several samples of randomly or arbitrarly chosen phylogeny-related papers (ones that match the term "phylogen*").  We discovered that producers of phylogenies rarely make their results easily accessible by archiving them.  Most trees remain on someone's hard-drive, apparently.  In spite of some interest in a MIAPA (minimum information about a phylogenetic analysis) standard, currently there are no community standards to guide users as to what kinds of data and metadata to include in order to facilitate data re-use.  Users who are interested in re-using published trees face many barriers due to the difficulting of discovering, accessing, decoding, interpreting and evaluating phylogenetic results.

Nevertheless, in spite of the generally dismal state of things, we found a lot of room for optimism. While the overall rate of archiving is low, various types of information are being made available.  While most studies do not rely on re-used data (other than sequences from GenBank), a large minority of studies re-use alignments or trees.  We actually found 5 different studies that use the APG (Angiosperm Phylogeny Group) tree for plants via Phylomatic, which provides grafting and pruning operations so that users can make a custom tree for the set of species they wish to analyze.  And of course, there are some high-profile cases of re-use, the most extensive of which is probably TimeTree, which synthesizes information from nearly 1000 publications, together with a "tree of life" (I'm putting that in quotation marks so as not to offend purists, because the tree is actually the NCBI taxonomy hierarchy), to provide users with estimated dates of divergence.  TimeTree literally gets 10's of thousands of queries per month.  

Our overall impression is that, due to recent developments in regard to policies, software, infrastructure, and community organizing, evolutionary informatics is poised for a great leap forward-- if a broader community of stakeholders can get involved.  

Syndicate content