By Jay Emerson and Mike Kane
We’re very happy to announce our recent publication with Steve Weston in the Journal of Statistical Software (JSS), “Scalable Strategies for Computing with Massive Data”, JSS Volume 55 Issue 14. In a nutshell:
This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.
We also welcome Pete Haverty from Genentech as an author on the bigmemory package. Pete and his colleagues at Genentech have made some substantial improvements to the package and are some of the heaviest users of these extensions (at least, to the best of our knowledge).
Secondly, we’d like to announce a new package, BH, with lead author and maintainer Dirk Eddelbuettel (he of Rcpp fame, also the first user and constructive critic of bigmemory). BH contains a subset of Boost headers used by bigmemory and other packages, some in active development and not yet on CRAN:
Boost provides free peer-reviewed portable C++ source libraries. A large part of Boost is provided as C++ template code, which is resolved entirely at compile-time without linking. This package aims to provide the most useful subset of Boost libraries for template use among CRAN package. By placing these libraries in this package, we offer a more efficient distribution system for CRAN as replication of this code in the sources of other packages is avoided.
New libraries from Boost may be included upon request (though we limit it to headers only, with no compiled code). Please visit our new Github site for more information.
Finally, we’d like to call attention to a change in JSS software license policy. With the publication of “Scalable Strategies for Computing with Massive Data” JSS now accepts software licensed under either GPL-2 or GPL-3. GPL-3 in turn is compatible with Apache-2.0, and all these licenses are compatible with Boost’s very permissive BSL-1.0 license. This should help to broaden the software contributions documented and reviewed in JSS, and we are grateful to the Editors of JSS for this shift in policy.
Comments
You can follow this conversation by subscribing to the comment feed for this post.