by Joseph Rickert

One of the remarkable features of the R language is its adaptability. Motivated by R’s popularity and helped by R’s expressive power and transparency developers working on other platforms display what looks like inexhaustible creativity in providing seamless interfaces to software that complements R’s strengths. The H2O R package that connects to 0xdata’s H2O software (Apache 2.0 License) is an example of this kind of creativity.

According to the 0xdata website, H2O is “The Open Source In-Memory, Prediction Engine for Big Data Science”. Indeed, H2O offers an impressive array of machine learning algorithms. The H2O R package provides functions for building GLM, GBM, Kmeans, Naive Bayes, Principal Components Analysis, Principal Components Regression, Random Forests and Deep Learning (multi-layer neural net models). Examples with timing information of running all of these models on fairly large data sets are available on the 0xdata website. Execution speeds are very impressive. In this post, I thought I would start a little slower and look at H2O from an R point of View.

H2O is a Java Virtual Machine that is optimized for doing “in memory” processing of distributed, parallel machine learning algorithms on clusters. A “cluster” is a software construct that can be can be fired up on your laptop, on a server, or across the multiple nodes of a cluster of real machines, including computers that form a Hadoop cluster. According to the documentation a cluster’s “memory capacity is the sum across all H2O nodes in the cluster”. So, as I understand it, if you were to build a 16 node cluster of machines each having 64GB of DRAM, and you installed H2O everything then you could run the H2O machine learning algorithms using a terabyte of memory.

Underneath the covers, the H2O JVM sits on an in-memory, non-persistent key-value (KV) store that uses a distributed JAVA memory model. The KV store holds state information, all results and the big data itself. H2O keeps the data in a heap. When the heap gets full, i.e. when you are working with more data than physical DRAM, H20 swaps to disk. (See Cliff Click’s blog for the details.) The main point here is that the data is not in R. R only has a pointer to the data, an S4 object containing the IP address, port and key name for the data sitting in H2O.

The R H2O package communicates with the H2O JVM over a REST API. R sends RCurl commands and H2O sends back JSON responses. Data ingestion, however, does not happen via the REST API. Rather, an R user calls a function that causes the data to be directly parsed into the H2O KV store. The H2O R package provides several functions for doing this Including: h20.importFile() which imports and parses files from a local directory, h20.importURL() which imports and pareses files from a website, and h2o.importHDFS() which imports and parses HDFS files sitting on a Hadoop cluster.

So much for the background: let’s get started with H2O. The first thing you need to do is to get Java running on your machine. If you don’t already have Java the default download ought to be just fine. Then fetch and install the H2O R package. Note that the h2o.jar executable is currently shipped with the h2o R package. The following code from the 0xdata website ran just fine from RStudio on my PC:

# The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # Next, we download, install and initialize the H2O package for R. install.packages("h2o", repos=(c("http://s3.amazonaws.com/h2o-release/h2o/rel-kahan/5/R", getOption("repos")))) library(h2o) localH2O = h2o.init() # Finally, let's run a demo to see H2O at work. demo(h2o.glm)

Created by Pretty R at inside-R.org

Note that the function h20.init() uses the defaults to start up R on your local machine. Users can also provide parameters to specify an IP address and port number in order to connect to a remote instance of H20 running on a cluster. h2o.init(Xmx="10g") will start up the H2O KV store with 10GB of RAM. demo(h2o,glm) runs the glm demo to let you know that everything is working just fine. I will save examining the model for another time. Instead let's look at some other H2O functionality.

The first thing to get straight with H2O is to be clear about when you are working in R and when you are working in the H2O JVM. The H2O R package implements several R functions that are wrappers to H2O native functions. "H2O supports an R-like language" (See a note on R) but sometimes things behave differently than an R programmer might expect.

For example, the R code:

y <- apply(iris[,1:4],2,sum)

y

produces the following result:

Sepal.Length Sepal.Width Petal.Length Petal.Width

876.5 458.6 563.7 179.9

Now, let's see how things work in H2O, The following code loads the H2O package, starts a local instance of H2O, uploads the iris data set into the H2O instance from the H2O R package and produces a very R-like summary.

library(h2o) # Load H2O library localH2O = h2o.init() # initial H2O locl instance # Upload iris file from the H2O package into the H2O local instance iris.hex <- h2o.uploadFile(localH2O, path = system.file("extdata", "iris.csv", package="h2o"), key = "iris.hex") summary(iris.hex)

However, the apply() function from the H2O R package behaves a bit differently

x <- apply(iris.hex[,1:4],2,sum)

x

IP Address: 127.0.0.1

Port : 54321

Parsed Data Key: Last.value.17

Instead of returning the the results it returns the attributes of file in which the results are stored. You can see this from looking at the structure of x.

str(x)

Formal class 'H2OParsedData' [package "h2o"] with 3 slots

..@ h2o :Formal class 'H2OClient' [package "h2o"] with 2 slots

.. .. ..@ ip : chr "127.0.0.1"

.. .. ..@ port: num 54321

..@ key : chr "Last.value.17"

..@ logic: logi FALSE

H2O dataset 'Last.value.17': 4 obs. of 1 variable:

$ C1: num 876.5 458.1 563.8 179.8

You can get the data out, by coercing x into being a data frame.

df <- as.data.frame(x)

df

C1

1 876.5

2 458.1

3 563.8

4 179.8

So, as one might expect, there are some differences that take a little getting used to. However, the focus ought not to be on the differences from R but on the pontential of having some capabilities for manipulating huge data sets from with R. In combination, the H2O R package functions h2o.ddply() and h2o.addFunction(), which permits users to push a new function into the H2O JVM, do a fine job of providing some ddply() features to H2O data sets.

The following code loads one year of the airlines data set from my hard drive into the H2O instance, gives me the dimensions of the data, and lets me know what variables I have.

path <- "C:/DATA/Airlines_87_08/2008.csv"

air2008.hex <- h2o.uploadFile(localH2O, path = path,key="air2008")

dim(air2008.hex)

[1] 7009728 29

colnames(air2008.hex)

Then, using h20.addFunction(), define a function to compute the average departure delay, and create a new H2O data set without DepDelay missing values that would otherwise blow up the added function.

# Define function to compute an average for colume 16

fun = function(df) { sum(df[,16])/nrow(df) }

h2o.addFunction(localH2O, fun) # Push the function to H2O

# Filter out missing values

air2008.filt = air2008.hex[!is.na(air2008.hex$DepDelay),]

head(air2008.filt)

Finally, run h2o.ddply() to get average departure delay by day of the week and pull down the results from H2O.

airlines.ddply = h2o.ddply(air2008.filt, "DayOfWeek", fun)

as.data.frame(airlines.ddply)

DayOfWeek C1

1 2 8.976897

2 6 8.645681

3 7 11.568973

4 4 9.772897

5 1 10.269990

6 5 12.158036

7 3 8.289761

Exactly, what you would expect!

Having h2o.ddply() being limited to functions that can be pushed to H2O may seem limiting to some. However, in the context of working with huge data sets I don't see this to be a problem. Presumably the real data cleaning and preperation will be accompished by other tools that are appropriate for the environment (e.g. Hadoop) where the data resides. In a future post, I hope to more closely examine H2O's machine learning algorithms. As it stands, from and R perspective H2O appears to be an impressive accomplishment and welcome addition to the open source world.

In a November 2013 TED talk, Philip Evans describes how data will transform business by fundamentally reshaping strategy. He describes how as the internet revolution moves into its third decade — defined by data sharing — that falling transaction costs are removing the glue holding vertically integrated companies together and allowing smaller, more focused companies to compete and thrive.

Meanwhile, an op-ed in the New York Times describes several pitfalls that can arise with the implementation of Big Data, from combining data collected under different protocols, to feedback loops contaminating predictive models, to failure to anticipate to how subjects will react to predictions from models (and affect their predictive power). These issues will be familiar to statisticians and most data scientists, but seemed destined to re-discovery in the world of Big Data.

by Joseph Rickert

Generalized Linear Models have become part of the fabric of modern statistics, and logistic regression, at least, is a “go to” tool for data scientists building classification applications. The ready availability of good GLM software and the interpretability of the results logistic regression makes it a good baseline classifier. Moreover, Paul Komarek argues that, with a little bit tweaking, the basic iteratively reweighted least squares algorithm used to evaluate the maximum likelihood estimates can be made robust and stable enough to allow logistic regression to challenge specialized classifiers such as support vector machines.

It is relatively easy to figure how to code a GLM in R. Even a total newcomer to R is likely to figure out that the glm()function is part of the core R language within a minute or so of searching. Thereafter though, it gets more difficult to find other GLM related stuff that R has to offer. Here is a far from complete, but hopefully helpful, list of resources.

Online documentation that I have found helpful includes the contributed book by Virasakdi Chongsuvivatwong and the tutorials from Princeton and UCLA. Here is slick visualization of a poisson model from the Freakonometrics blog.

But finding introductory materials is on GLMs is not difficult. Almost all of the many books on learning statistics with R have chapters on the GLM including the classic Modern Applied Statistics with S, by Venables and Ripley, and one of my favorite texts, Data Analysis and Graphics Using R, by Maindonald and Braun. It is more of a challenge, however, to sort through the more than 5,000 packages on CRAN to find additional functions that could help with various specialized aspects or extensions to the GLM. So here is a short list of GLM related packages.

*Packages to help with convergence and improve the fit*

- glm2 implements a refinement to the iteratively reweighted least squares algorithm in order to help with convergence issues commonly associated with nonstandard link functions.
- brglm fits binomial response models with a bias reduction method
- safeBinaryRegression provides a function that overloads glm() to provide a test for the existence of the maximum likelihood estimates for binomial models
- pscl provides goodness of fit measures for GLMs

*Packages for variable selection and regularization*

- bestglm selects a “best” subset of input variables for GLMs using cross validation and various information criteria.
- glmnet provides functions to fit linear regression, binary logistic regression and multinomial normal regression with convex penalties.
- penalized fits high dimensional logistic and poisson models with L1 and L2 penalties

*Packages for special models*

- mlogit fits multinomial logit models.
- lme4 provides functions to fit mixed-effect GLMS
- hglm fits hierarchical GLMs with both fixed and random effects
- glmmML provides functions to fit binomial and poisson models with clustering.

*Bayesian GLMs*

- arm provides functions for Bayesian GLMs (Look here for a discussion of how Bayesian ideas can help with GLM problems.)
- bayesm contains functions for Bayesian GLMs including binary and ordinal probit, multinomial logit, multinomial probit models and more
- MCMCglmm provides functions to fit mixed GLMs using MCMC techniques

*GLMs for Big Data*

- The bigglm() function in the biglm package fits GLMs that are too big to fit into memory.
- H20 package from 0xdata provides an R wrapper for the h2o.glm function for fitting GLMs on Hadoop and other platforms
- speedglm fits GLMs to large data sets using an updating procedure.
- RrevoScaleR (Revolution R Enterprise) provides parallel external memory algorithms for fitting GLMs on clusters, Hadoop, Teradata and other platforms

*Generalized Additive Models, GAMS,generalize GLMs*

- gam provides functions to fit the Generalized Additive Model
- gamm4 fits mixed GAMs.
- mgcv provides functions to fit GAMs with muttiple smoothing methods.
- VGAM provides functions to fit vector GLMs and GAMs.

Beyond the documentation and and a list of packages that may be useful, it is also nice to have the benefit of some practical experience. John Mount has written prolifically about logistic regression in his Win-Vector Blog over the past few years. His post, How robust is logistic regression, is an illuminating discussion of convergence issues surrounding Newton-Raphson/Iteratively-Reweighted-Least Squares. It contains pointers to examples illustrating the trouble caused by complete or quasi-complete separation as well as links to the academic literature. This post is a classic, but all of the other posts in the series are very much worth the read.

Finally, as a reminder of the trouble you can get into interpreting t-values from a GLM, here is another classic, a post from the S-News archives on the Hauck-Donner phenomenon.

The fine folks behind the Big Data Journal have just published a new e-book Big Data: Harnessing the Power of Big Data Through Education and Data-Driven Decision Making. (Note: Adobe Flash is required to view the e-book.) In the eBook, you'll find the following technical papers on the topics of Big Data, Data Science, and R:

*Data Science and its Relationship to Big Data and Data-Driven Decision Making*, by Foster Provost and Tom Fawcett.*Predictive Modeling With Big Data: Is Bigger Really Better?*, by Enric Junqué de Fortuny, David Martens, and Foster Provost.*Educating the Next Generation of Data Scientists*, a roundtable discussion including Edd Dumbill, Elizabeth D. Liddy, Jeffrey Stanton, Kate Mueller, and Shelly Farnham*Delivering Value from Big Data with Revolution R Enterprise and Hadoop*, by Thomas Dinsmore and Bill Jacobs.

There's also a video introduction to the papers from yours truly. View the e-book — sponsored by Revolution Analytics — here: Big Data: Harnessing the Power of Big Data Through Education and Data-Driven Decision Making

by Joseph Rickert

Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in *Doing Data Science* by Rachel Schutt and Cathy O’Neil. Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available.

However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good:

- It should be big enough to pose some computational challenges without being so big that it requires a cluster or some specialized hardware just to get started.
- It should require some cleaning or pre-processing (making decisions about mission data for example) but not appear to be hopelessly corrupt; dirty but not to dirty.
- It should be rich enough that once you have gone through the trouble of accessing and cleaning it there are enough variables or features to suggest multiple questions to analyze, or make it possible to try out different machine learning algorithms.

Here are three data sets that meet these criteria in ascending order of degree of difficulty:

The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files.

Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.

Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here.

As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account.

Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.

If you missed our recent webinar Creating Value that Scales, you missed out on a live demonstration of the big-data analytics of Revolution R Enterprise embedded in the drag-and-drop visual workflow interface of Alteryx.

If you want to see how a decision-maker can use the results of workflows created by data scientists, skip ahead to 25:45 to see a demo of a simple web-based interface for creating a sales forecast by store and by product (using the R "arima" time series forcasting function in the background). Or if you'd like to see the Alteryx Designer in action, skip ahead to 34:44 to see how to create an Alteryx workflow to optimize a direct mail campaign using the big-data regression capabilities of Revolution R Enterprise.

You can also download the slides and video from the webinar at the link below.

Revolution Analytics webinars: Creating Value That Scales with Revolution Analytics & Alteryx

by Joseph RIckert

Last July, I blogged about rxDTree() the RevoScaleR function for building classification and regression trees on very large data sets. As I explaned then, this function is an implementation of the algorithm introduced by Ben-Haim and Yom-Tov in their 2010 paper that builds trees on histograms of data and not on the raw data itself. This algorithm is designed for parallel and distributed computing. Consequently, rxDTree() provides the best performance when it is running on a cluster: either an Microsoft HPC cluster or a Linux LSF cluster.

rxDForest() (new with Revolution R Enterprise 7.0) uses rxDTree() to take the next logical step and implement a random forest type algorithm for building both classification and regression forests. Each tree of the ensemble constructed by rxDForest() is built with a bootstrap sample that uses about 2/3 of the original data. The data not used in builting a particular tree is used to make predictions with that tree. Each point of the original data set is fed through all of the trees that were built without it. The decision forest prediction for that data point is the statistical mode of the individual tree predictions. (For classification problems the prediction is a majority vote, for regression problems the prediction is the mean of the predictions.)

Only a couple of parameters need to be set to fit a decision forest. nTree specifies the number of trees to grow and mTry spedifies the number of variables to sample as split candidates at each tree node. Of course, many more parameters can be set to control the algorithm, including the parameters that control the underlying rxDTree() algorithm.

The following is a small example of the rxDForest() fucntion using the mortgage default dataset that can be downloaded from Revolution Analytic's website. Here are the first three lines of data.

creditScore houseAge yearsEmploy ccDebt year default

1 615 10 5 2818 2000 0

2 780 34 5 3575 2000 0

3 735 12 1 3184 2000 0

The idea is to see if the variables creditScore, houseAge etc. are useful in predicting a default. The RevoScaleR R code in the file Download RxDForest reads in the mortgage data, splits the data into a training file and a test file, uses rxDTree() to build a single tree (just to see what one looks like for this file) and plots the tree. Then rxDForest() is run against the training file to to build an ensemble model and this model run against the test file to make predictions. Finally, the code plots the ROC curve for the decision forest ensemble model.

Here is what the first few nodes of the tree looks like. (The full tree is printed at the bottom of the code in the file above.)

Call:

rxDTree(formula = form1, data = "mdTrain", maxDepth = 5)

File: C:\Users\Joe.Rickert\Documents\Revolution\RevoScaleR\mdTrain.xdf

Number of valid observations: 8000290

Number of missing observations: 0

Tree representation:

n= 8000290

node), split, n, deviance, yval

* denotes terminal node

1) root 8000290 39472.30000 4.958445e-03

2) ccDebt< 9085.5 7840182 21402.25000 2.737309e-03

4) ccDebt< 7844 7384170 8809.46500 1.194447e-03

He is a plot of the right part of the tree drawn with RevoScaleR's creatTreeView() function that enables plot() to put the graph in your browser.

And, finally, here is the ROC curve for the decision Forest model. (The text output describing the model is also in the file containing the code.)

I plan to try rxDForest() out on a cluster with a bigger data set. When I do, I will let you know.

Want to see how you can use a drag-and-drop user interface to run and share R code? Check out our webinar next Wednesday January 29 (hosted by Alteryx and Revolution Analytics): Creating Value That Scales with Revolution Analytics & Alteryx.

In the webinar, Dan Putler (Alteryx's Data Artisan in Residence) will demonstrate the Alteryx drag-and-drop Alteryx GUI, which provides access to many of R functions (and which you can extend with your own). And because it connects with Revolution R Enterprise, you also benefit from its performance and big-data capabilities. During the demo, Dan will show:

- How to create a complete data analysis workflow with the Alteryx drag-and-drop user interface;
- How to run R functions as workflow nodes, and create your own custom R-based nodes;
- How to analyze big data files without memory limitations, thanks to the Revolution R Enterprise XDF out-of-memory data object;
- How to publish an Alteryx workflow to a gallery, with an easy-to-use web-based interface that anyone can use to control its operation.

If you're an R user looking for an easier way to interact with R, or looking for a way to share access to your R code running on big data in an easy-to-use web-based interface, you'll want to register for the webinar at the link below.

Webinar Jan 29: Creating Value That Scales with Alteryx and Revolution Analytics

By Jay Emerson and Mike Kane

We’re very happy to announce our recent publication with Steve Weston in the Journal of Statistical Software (JSS), “Scalable Strategies for Computing with Massive Data”, JSS Volume 55 Issue 14. In a nutshell:

This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the **foreach** package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the **bigmemory** package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.

We also welcome Pete Haverty from Genentech as an author on the **bigmemory** package. Pete and his colleagues at Genentech have made some substantial improvements to the package and are some of the heaviest users of these extensions (at least, to the best of our knowledge).

Secondly, we’d like to announce a new package, **BH**, with lead author and maintainer Dirk Eddelbuettel (he of **Rcpp** fame, also the first user and constructive critic of **bigmemory**). **BH** contains a subset of Boost headers used by **bigmemory** and other packages, some in active development and not yet on CRAN:

Boost provides free peer-reviewed portable C++ source libraries. A large part of Boost is provided as C++ template code, which is resolved entirely at compile-time without linking. This package aims to provide the most useful subset of Boost libraries for template use among CRAN package. By placing these libraries in this package, we offer a more efficient distribution system for CRAN as replication of this code in the sources of other packages is avoided.

New libraries from Boost may be included upon request (though we limit it to headers only, with no compiled code). Please visit our new Github site for more information.

Finally, we’d like to call attention to a change in JSS software license policy. With the publication of “Scalable Strategies for Computing with Massive Data” JSS now accepts software licensed under either GPL-2 or GPL-3. GPL-3 in turn is compatible with Apache-2.0, and all these licenses are compatible with Boost’s very permissive BSL-1.0 license. This should help to broaden the software contributions documented and reviewed in JSS, and we are grateful to the Editors of JSS for this shift in policy.

Optimization is something that I hear clients ask for on a fairly regular basis. There are many problems that a few functions to carry out optimization can solve. `R`

has much of this functionality in the base product, such as `nlm()`

, and `optim()`

. There are also many packages that address this issue, as well as a task view devoted to it (Optimization and Mathematical Programming).

The reason I hear requests is that clients wonder how to scale these problems up. Most of the time I am asked if we have a `scaleR`

function that does optimization, i.e. optimization for “Big Data.” Though I am sure this will come in the near future, that doesn't mean that you can't do optimization on “Big Data” right now. As mentioned above, `R`

has lots of functionality around optimization, and Revolution R Enterprise provides a framework for for dealing with “Big Data.” Combining these two means that we can easily carry out optimization on “Big Data.”

Below is a description of some basic optimization in `R`

as well as some code to demonstrate, and then how to do the same sort of thing in `Revolution R Enterprise`

(`RRE`

).

`R`

`optim()`

is often the first function used in `R`

when one tackles an optimization problem, so we will also use it as a first step. In general we are trying to minimize a function, and here that will be the maximum likelihood estimate for a normal distribution. This is of course silly since there is a closed form for the mean and standard deviation, but it does help clearly demonstrate optimization scaled up.

Let's generate some data for use in this section and in the next “Big Data” section. These data are not that large but they will be broken down into 4 “chunks” to demonstrate out of memory processing.

Below is a very simple maximum likelihood estimation of the mean and standard deviation using the log-likelihood function.

logLikFun <- function(param, mleData) { mu <- param[1] sigma <- param[2] -sum(dnorm(mleData, mean = mu, sd = sigma, log = TRUE)) } timing1 <- system.time(mle1 <- optim(par = c(mu = 0, sigma = 1), fn = logLikFun, mleData = x)) rbind(mle = mle1$par, standard = theta) ## mu sigma ## mle 3.990 3.993 ## standard 3.991 3.994

As we can see above, the MLE is slightly different than the actual mean and standard deviation, but that is to be expected. We can also see that the process was pretty quick at 0.05 seconds.

`RRE`

As we saw above, the log-likelihood function only depends on `mu`

, `sigma`

, and the data. More specifically it is a sum of a transformation of the data. Since it is a sum, that means that if we calculate a sum on parts of the data and the sum up the sums, it will be equivalent to the sum of all the data. This is exactly what `rxDataStep()`

in the `RevoScaleR`

package allows us to do, i.e. step through data, calculate a sum on each chunk, and update the total sum as we go. We can reuse the same log-likelihood function we defined for small data, we just need to define how that function will be used on individual chunks and then combined into a single return value.

library(RevoScaleR) mleFun <- function(param, mleVar, data, llfun) { mleXform <- function(dataList) { .rxModify("sumLL", llfunx(paramx, mleData = dataList[[mleVarx]])) return(NULL) } rxDataStep(inData = data, transformFunc = mleXform, transformObjects = list(llfunx = llfun, sumLL = 0, paramx = param, mleVarx = mleVar), returnTransformObjects = TRUE, reportProgress = 0)$sumLL } timing2 <- system.time(mle2 <- optim(par = c(mu = 0, sigma = 1), fn = mleFun, mleVar = "x", data = "optim-work.xdf", llfun = logLikFun)) all.equal(mle1$par, mle2$par) ## [1] TRUE rbind(mle1 = mle1$par, mle2 = mle2$par, standard = theta) ## mu sigma ## mle1 3.990 3.993 ## mle2 3.990 3.993 ## standard 3.991 3.994

Again, the MLE is slightly different than the actual mean and standard deviation, but it is equivalent to our in-memory calculation using the log-likelihood. We can also see that the process was a bit slower at 5.6 seconds.

The data we are using are not that large, but it is being processed from disk and iterated over which means that it will be slower than the equivalent in memory computation. We should not be too surprised by this. What is of more interest is that utilizing the big data framework in `RRE`

, the same optimization can be carried out on data much larger than the available RAM, and as the data sizes grow the overhead that we see above is diminished. As with anything involving “Big Data”, this just means that you need to think a bit more about what you are doing (*Probably good advice all the time*).