by Graham Williams, Director of Data Science, Microsoft
Programming is an art and a way we express ourselves. As we write our programs we should keep in mind that someone else is very likely to be reading it. We can facilitate the accessibility of our programs through a clear presentation of the messages we are sharing.
As data scientists we also practice this art of programming. Indeed even more so we aim to share the narrative of our discoveries through our living and breathing of data through programming over the data. Writing programs so that others understand why and how we analysed our data is crucial. Data science is so much more than simply building black box analyses and models and we should be seeking to expose and share the process and particularly the knowledge that is discovered from the data.
Style is important in making the code we share readily accessible. Dictating a style to others is a sensitive issue. We thrive on our freedom to innovate and to express ourselves how we want but we also need consistency in how we do that and a style guide supports that. A style guide also helps us journey through a new language, providing a foundation for developing, over time, our own style in that language.
Through a style guide we share the tips and tricks for communicating clearly through our programs. We communicate through the language — a language that also happens to be executable by a computer. In this language we follow precisely specified syntax to develop sentences, paragraphs, and whole stories. Whilst there is infinite leeway in how we express ourselves in any language we can share a common set of principles as our style guide.
Over the years styles developed for very many different languages have evolved together with the medium for interacting with computers. I have a style guide for R that presents my personal current choices. This is the style guide I suggest (even require) for projects I lead and will appear in an upcoming book.
I hope the guide might be useful to others. It augments the other R style guides out there by providing the rationale for my choices. Irrespective of whether specific style suggestions suit you or not, choose your own and use them consistently. Do focus on communicating with others in the first instance and secondarily on the execution of your code (though critical it is). Think of writing programs as writing narratives for others to read, to enjoy, to learn from and to build upon. It is a creative act to communicate well with our colleagues — be creative with style.
Hands On Data Science: Sharing R Code — With Style
Thanks for the helpful tips.
The one suggestion I have concerns the distinction between the piping operator of magrittr & dplyr, "%>%", and the layer addition operator of ggplot2, "+". Given their different semantics, I've taken to placing the ggplot() call and it's added layers inside an expression block of its own within the flow of the "pipe". For example:
ds %>%
group_by(location) %>%
mutate(rainfall=cumsum(risk_mm)) %>%
{
ggplot(., aes(date, rainfall)) +
geom_line() +
facet_wrap(~location) +
theme(axis.text.x=element_text(angle=90))
}
Note that this requires explicit specification of the data as first argument to ggplot() using the special ".".
Also the pipe can simply continue with processing of the plot object after the expression block by adding another "%>%" and whatever function call is required or another expression block to add more layers.
(As I preview this comment, the example above loses its indentation, which I use and which complies with your suggested style. To see the intended indentation, paste the example into RStudio's editor, select it, and type Ctrl-I.)
Posted by: Michael Thompson | October 26, 2016 at 00:37
Great stuff.
Although my background is largely as a SAS programmer in a variety of environments, I'm struck by the closeness of our evolutionary paths. For example, the use of ##s to clearly indicate the "run-order" of a series of related modules. Including the use of 00 to indicate the setup section, that's exactly what I do.
One thing you don't explicitly say is that the use of single-character or abbreviated cryptic names is very bad practice, unless the use context is one where the the norms are well established. (Such as x,y as positional parms for a function.) Every object should have a name that is self-evident to the next person who inherits the code. Which could be you the original author, three years later. Self-documentation is the easiest and best approach.
Posted by: Doug Dame | October 26, 2016 at 13:26
Hi Michael,
Nice idea to include the ggplot() and layers within an expression block. When properly formatted (as you note) the plot story stands out nicely as a separate stanza amongst the narrative rather than otherwise getting lost amongst the trees!
Thanks for the suggestion.
Posted by: Graham Williams | October 27, 2016 at 17:30
Hi Doug,
Thanks for the feedback.
The ##_stepname.R concept has certainly been around for a long time. Useful also in scripts and configuration files and over the years it has been used to order system initialisations in Linux/Unix and no doubt elsewhere.
Good point about short variable names. Agree completely that each object be named in a self explanatory way, though don't tell the whole story in the name itself (i.e., avoid too long variable names).
Posted by: Graham Williams | October 27, 2016 at 17:41