by Jamie F Olson
Professional Services Consultant, Revolution Analytics
One challenge in transitioning R code into a production environment is ensuring consistency and reliability. These challenges span a wide variety of issues, but runtime characteristics are an important operational characteristic. Specifically, production code should have a consistent, predictable runtime for a particular computational infrastructure. Among other things, this makes it possible to plan and scale IT infrastructure based on operational requirements.
Analytics in general and R in particular possess certain characteristics that can make this challenging. Many statistical models don't have a single consistent run time cost but are instead based on iterative algorithms that continue until some convergence criteria is met. This means that the actual run time for any given model can depend significantly on the actual data being modeled. This is particularly important considering that in production analytics workflows, the data is continuously changing.
Let's consider the bats exponential smoothing multi-seasonal state space model in the forecast package. We repeatedly take random subsets of the "taylor" data to demonstrate the variability in run time:
library(forecast)
subsample_ts <- function(x, N, f = frequency(x)) {
x_i <- floor(runif(1, 1, length(x) - N))
msts(x[x_i:(x_i + N)], f)
}
times <- replicate(10, system.time(tbats(subsample_ts(taylor,
100, c(48, 336)), use.parallel = FALSE)))
sd(times["elapsed", ])
## [1] 0.9392113
summary(times["elapsed", ])
##Min. 1st Qu. MedianMean 3rd Qu.Max.
## 3.963 4.816 5.568 5.522 6.140 7.085
As you can see, certain configurations of data can be require more time to model. In order to circumvent this, we may want to ensure that any particular model has a pre-defined maximum run time. This does not prevent the run time from varying, but it does ensure a "maximum upper bound" on run time performance.
There are a variety of tools we can use in R to implement this. For example, there is a base::setTimeLimit function that allows us to set certain time limits for any "top-level computation".
system.time(Sys.sleep(5))
##user system elapsed
## 0.000 0.000 5.005
system.time(local({
setTimeLimit(elapsed = 1, transient = TRUE)
Sys.sleep(5)
}))
## Error in Sys.sleep(5): reached elapsed time limit
## Timing stopped at: 0 0 5.006
This simple function can be extremely useful in many scenarios, but it may not be enough in production environments. The setTimeLimit documentation describes a key limitation: Time limits are checked ... "only at points in compiled C and Fortran code identified by the code author."
Note: It's important to mention that just because a function uses C code doesn't mean that setTimeLimit won't work as expected. The "top-level" R code in such functions can be interrupted as well additional points in the C code explicitly checked in the source code, but it's up to the
package author to implement these checks. In situations where this is not a concern, you may also be interested in the withTimeout function in the R.utils package.
This can create a problem for statistical models that offload the computation to C code, like the neural network models for autoregressive processes estimated by the nnetar function from the forecast package. In these cases setTimeLimit may not have the intended effect:
system.time(nnetar(as.ts(taylor)))
##user system elapsed
## 50.261 0.017 50.328
system.time(local({
setTimeLimit(elapsed = 4, transient = TRUE)
nnetar(as.ts(taylor))
}))
## Error in NROW(x): reached elapsed time limit
## Timing stopped at: 5.144 0.005 5.155
The OpenCPU project contains eval_fork function which provides an alternative method of controlling runtime on non-Windows platforms. A slightly simplified version is reproduced below.
The core idea is to use fork to run the desired command in a separate process that if then "killed" after the desired amount of time has elapsed. This is accomplished via parallel::mcparallel function, which will return after "timeout" seconds with either a result, if the expression completed, or "NULL".
eval_fork <- function(..., timeout = 60) {
myfork <- parallel::mcparallel({
eval(...)
}, silent = FALSE)
# wait max n seconds for a result.
myresult <- parallel::mccollect(myfork, wait = FALSE, timeout = timeout)
# kill fork after collect has returned
tools::pskill(myfork$pid, tools::SIGKILL)
tools::pskill(-1 * myfork$pid, tools::SIGKILL)
# clean up:
parallel::mccollect(myfork, wait = FALSE)
# timeout?
if (is.null(myresult))
stop("reached elapsed time limit")
# move this to distinguish between timeout and NULL returns
myresult <- myresult[[1]]
# send the buffered response
return(myresult)
}
system.time(eval_fork(nnetar(as.ts(taylor)), timeout = 4))
## Error in eval_fork(nnetar(as.ts(taylor)), timeout = 4): reached elapsed time limit
## Timing stopped at: 0.005 0.023 4.032
We can safely capture the results from eval_fork using try:
mynnet <- try(eval_fork(nnetar(as.ts(taylor)), timeout = 4),
silent = TRUE)
These techniques make it possible to achieve consistent operational requirements in an environment of changing data and computational unpredictable algorithms. In the next post on R in Production, we'll talk more about how to use try and similar functions to capture and safely handle warnings and errors.
Comments
You can follow this conversation by subscribing to the comment feed for this post.