Hadley Wickham's dplyr package is an amazing tool for restructuring, filtering, and aggregating data sets using its elegant grammar of data manipulation. By default, it works on in-memory data frames, which means you're limited to the amount of data you can fit into R's memory. Hadley also provided an extension mechanism to make dplyr work with external data sources, and so Hong Ooi created the dplyrXdf package to work with Xdf data files. With dplyrXdf you can manipulate data files of virtually unlimited size using R, and even use the pipe operator %>% from the magrittr package.
To use the dplyrXdf package, you will need to use Microsoft R Client (free download for Windows) or Microsoft R Server (on Windows, Linux, Hadoop or HDInsight with Spark). The Xdf files you create can then be used with the big-data functions of the included ScaleR package, enabling you to use R to perform statistical analysis of files hundreds of gigabytes in size.
To help you get started with the dplyrXdf packaghe, Hong has created a new dplyrXdf cheat sheet (pdf). This handy and printable 2-page document explains how dplyrXdf:
- Extends dplyr framework to large, on-disk data sets
- Simplifies current interface to xdf functionality
- Handles the task of file management for the user
- Is transparent to other xdf-aware functions
It also includes some extended examples of working with big data with dplyrXdf and analyzing them with the ScaleR package. To download the cheat-sheet, click on the link below.
Microsoft Advanced Analytics: dplyrXdf cheat sheet
Comments
You can follow this conversation by subscribing to the comment feed for this post.