*by Derek McCrae Norton - Senior Sales Engineer*

Optimization is something that I hear clients ask for on a fairly regular basis. There are many problems that a few functions to carry out optimization can solve. `R`

has much of this functionality in the base product, such as `nlm()`

, and `optim()`

. There are also many packages that address this issue, as well as a task view devoted to it (Optimization and Mathematical Programming).

The reason I hear requests is that clients wonder how to scale these problems up. Most of the time I am asked if we have a `scaleR`

function that does optimization, i.e. optimization for “Big Data.” Though I am sure this will come in the near future, that doesn't mean that you can't do optimization on “Big Data” right now. As mentioned above, `R`

has lots of functionality around optimization, and Revolution R Enterprise provides a framework for for dealing with “Big Data.” Combining these two means that we can easily carry out optimization on “Big Data.”

Below is a description of some basic optimization in `R`

as well as some code to demonstrate, and then how to do the same sort of thing in `Revolution R Enterprise`

(`RRE`

).

## Optimization in `R`

`R`

`optim()`

is often the first function used in `R`

when one tackles an optimization problem, so we will also use it as a first step. In general we are trying to minimize a function, and here that will be the maximum likelihood estimate for a normal distribution. This is of course silly since there is a closed form for the mean and standard deviation, but it does help clearly demonstrate optimization scaled up.

Let's generate some data for use in this section and in the next “Big Data” section. These data are not that large but they will be broken down into 4 “chunks” to demonstrate out of memory processing.

Below is a very simple maximum likelihood estimation of the mean and standard deviation using the log-likelihood function.

logLikFun <- function(param, mleData) { mu <- param[1] sigma <- param[2] -sum(dnorm(mleData, mean = mu, sd = sigma, log = TRUE)) } timing1 <- system.time(mle1 <- optim(par = c(mu = 0, sigma = 1), fn = logLikFun, mleData = x)) rbind(mle = mle1$par, standard = theta) ## mu sigma ## mle 3.990 3.993 ## standard 3.991 3.994

As we can see above, the MLE is slightly different than the actual mean and standard deviation, but that is to be expected. We can also see that the process was pretty quick at 0.05 seconds.

## “Big Data” Optimization in `RRE`

`RRE`

As we saw above, the log-likelihood function only depends on `mu`

, `sigma`

, and the data. More specifically it is a sum of a transformation of the data. Since it is a sum, that means that if we calculate a sum on parts of the data and the sum up the sums, it will be equivalent to the sum of all the data. This is exactly what `rxDataStep()`

in the `RevoScaleR`

package allows us to do, i.e. step through data, calculate a sum on each chunk, and update the total sum as we go. We can reuse the same log-likelihood function we defined for small data, we just need to define how that function will be used on individual chunks and then combined into a single return value.

library(RevoScaleR) mleFun <- function(param, mleVar, data, llfun) { mleXform <- function(dataList) { .rxModify("sumLL", llfunx(paramx, mleData = dataList[[mleVarx]])) return(NULL) } rxDataStep(inData = data, transformFunc = mleXform, transformObjects = list(llfunx = llfun, sumLL = 0, paramx = param, mleVarx = mleVar), returnTransformObjects = TRUE, reportProgress = 0)$sumLL } timing2 <- system.time(mle2 <- optim(par = c(mu = 0, sigma = 1), fn = mleFun, mleVar = "x", data = "optim-work.xdf", llfun = logLikFun)) all.equal(mle1$par, mle2$par) ## [1] TRUE rbind(mle1 = mle1$par, mle2 = mle2$par, standard = theta) ## mu sigma ## mle1 3.990 3.993 ## mle2 3.990 3.993 ## standard 3.991 3.994

Again, the MLE is slightly different than the actual mean and standard deviation, but it is equivalent to our in-memory calculation using the log-likelihood. We can also see that the process was a bit slower at 5.6 seconds.

The data we are using are not that large, but it is being processed from disk and iterated over which means that it will be slower than the equivalent in memory computation. We should not be too surprised by this. What is of more interest is that utilizing the big data framework in `RRE`

, the same optimization can be carried out on data much larger than the available RAM, and as the data sizes grow the overhead that we see above is diminished. As with anything involving “Big Data”, this just means that you need to think a bit more about what you are doing (*Probably good advice all the time*).