by Derek McCrae Norton - Senior Sales Engineer
Optimization is something that I hear clients ask for on a fairly regular basis. There are many problems that a few functions to carry out optimization can solve. R
has much of this functionality in the base product, such as nlm()
, and optim()
. There are also many packages that address this issue, as well as a task view devoted to it (Optimization and Mathematical Programming).
The reason I hear requests is that clients wonder how to scale these problems up. Most of the time I am asked if we have a scaleR
function that does optimization, i.e. optimization for “Big Data.” Though I am sure this will come in the near future, that doesn't mean that you can't do optimization on “Big Data” right now. As mentioned above, R
has lots of functionality around optimization, and Revolution R Enterprise provides a framework for for dealing with “Big Data.” Combining these two means that we can easily carry out optimization on “Big Data.”
Below is a description of some basic optimization in R
as well as some code to demonstrate, and then how to do the same sort of thing in Revolution R Enterprise
(RRE
).
Optimization in R
optim()
is often the first function used in R
when one tackles an optimization problem, so we will also use it as a first step. In general we are trying to minimize a function, and here that will be the maximum likelihood estimate for a normal distribution. This is of course silly since there is a closed form for the mean and standard deviation, but it does help clearly demonstrate optimization scaled up.
Let's generate some data for use in this section and in the next “Big Data” section. These data are not that large but they will be broken down into 4 “chunks” to demonstrate out of memory processing.
Below is a very simple maximum likelihood estimation of the mean and standard deviation using the log-likelihood function.
logLikFun <- function(param, mleData) { mu <- param[1] sigma <- param[2] -sum(dnorm(mleData, mean = mu, sd = sigma, log = TRUE)) } timing1 <- system.time(mle1 <- optim(par = c(mu = 0, sigma = 1), fn = logLikFun, mleData = x)) rbind(mle = mle1$par, standard = theta) ## mu sigma ## mle 3.990 3.993 ## standard 3.991 3.994
As we can see above, the MLE is slightly different than the actual mean and standard deviation, but that is to be expected. We can also see that the process was pretty quick at 0.05 seconds.
“Big Data” Optimization in RRE
As we saw above, the log-likelihood function only depends on mu
, sigma
, and the data. More specifically it is a sum of a transformation of the data. Since it is a sum, that means that if we calculate a sum on parts of the data and the sum up the sums, it will be equivalent to the sum of all the data. This is exactly what rxDataStep()
in the RevoScaleR
package allows us to do, i.e. step through data, calculate a sum on each chunk, and update the total sum as we go. We can reuse the same log-likelihood function we defined for small data, we just need to define how that function will be used on individual chunks and then combined into a single return value.
library(RevoScaleR) mleFun <- function(param, mleVar, data, llfun) { mleXform <- function(dataList) { .rxModify("sumLL", llfunx(paramx, mleData = dataList[[mleVarx]])) return(NULL) } rxDataStep(inData = data, transformFunc = mleXform, transformObjects = list(llfunx = llfun, sumLL = 0, paramx = param, mleVarx = mleVar), returnTransformObjects = TRUE, reportProgress = 0)$sumLL } timing2 <- system.time(mle2 <- optim(par = c(mu = 0, sigma = 1), fn = mleFun, mleVar = "x", data = "optim-work.xdf", llfun = logLikFun)) all.equal(mle1$par, mle2$par) ## [1] TRUE rbind(mle1 = mle1$par, mle2 = mle2$par, standard = theta) ## mu sigma ## mle1 3.990 3.993 ## mle2 3.990 3.993 ## standard 3.991 3.994
Again, the MLE is slightly different than the actual mean and standard deviation, but it is equivalent to our in-memory calculation using the log-likelihood. We can also see that the process was a bit slower at 5.6 seconds.
The data we are using are not that large, but it is being processed from disk and iterated over which means that it will be slower than the equivalent in memory computation. We should not be too surprised by this. What is of more interest is that utilizing the big data framework in RRE
, the same optimization can be carried out on data much larger than the available RAM, and as the data sizes grow the overhead that we see above is diminished. As with anything involving “Big Data”, this just means that you need to think a bit more about what you are doing (Probably good advice all the time).
This is very cool! Could you do an example of GMM following the same principles?
Posted by: Randcore | January 09, 2014 at 09:06
interesting
Posted by: omar | January 13, 2014 at 20:24
this is a nice parallel method. however with simple models like a normal distribution, you should check if sufficient statistics exist. for the normal distribution, there is no need to loop through each data pointt, get each data points log pdf, and sum the resuts. if you work out the math you will find that you only need the sum and sum of squares from the entire data set, regardless of size, to get the likelihood. therefore you only need one initial loop to get the sufficient statistics, as opposed to looping for each optimization iteration.
however this is dependent on the underlying statistical model. cimplicated or convoluted likelihoods or distributions without a full set of sufficient statistics will need something like this.
Posted by: Richard | January 16, 2014 at 08:04
Richard, thanks for the comment, however, your point about sufficient statistics is already mentioned in the post, i.e. "This is of course silly since there is a closed form for the mean and standard deviation, but it does help clearly demonstrate optimization scaled up."
Posted by: Derek Norton | January 16, 2014 at 08:21