by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.

by Seth Mottaghinejad

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

- if n is 1, stop, otherwise recurse on the following
- if n is even, then divide it by 2
- if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

- Is the integer a prime number?
- What are its proper divisors?
- What is the remainder upon dividing the integer by some other number?
- What are the sum of its digits?
- Is the integer a triangular number?
- Is the integer a square number?
- Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n - 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 - sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }

Here is a list of other helper functions in collatz.cpp:

- 'is_prime': given an integer vector, returns a logical vector
- 'gen_primes': given some integer input, n, generates the first n prime numbers
- 'gen_divisors': given an integer n, returns an integer vector of all its proper divisors
- 'is_perfect': given an integer vector, returns a logical vector

sum_digits <- function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f <- function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

maxint <- 10^5 df <- data.frame(int = 1:maxint) # the original number df <- transform(df, st = cpp_collatz(int)) # stopping times df <- transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1) - 1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead

[1] FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9

abs(.3 - (.4 - .1)) < eps # returns TRUE

[1] TRUE

df <- transform(df, is_triangular = abs(inv_triangular - as.integer(inv_triangular)) < eps, is_square = abs(inv_square - as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal - as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df <- df[ , names(df)[-grep("^inv_", names(df))]]

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the *'unique factorization theorem'*. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.

Here's a sample of 10 rows from the data:

df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE

We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.

my_summary <- function(df) { primes <- gen_primes(9) res <- with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }

To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.

To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.

flags <- names(df)[grep("^is_", names(df))] res <- lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) <- flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996

As we can see from the above results, most comparisons appear to be non-significant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:

- On average, stopping times are slightly higher for prime numbers compared to composite numbers.
- On average, stopping times are slightly lower for triangular numbers, square numbers, and pentagonal numbers compared to their corresponding counterparts.
- Despite having lower average stopping times, triangular numbers, square numbers and pentagonal numbers are more strongly correlated with their stopping times than their corresponding counterparts. A larger sample size may help in this case.
- On average, odd numbers have a higher stopping time than even numbers.

This last item could just be restatement of 1. however. Let's take a closer look:

ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA

When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.

Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.

The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:

- We could think of other features that we could add to the data.
- We could think of other statistical summaries to include in the analysis.
- We could try to scale up by looking at all integers form 1 to 1 billion instead of 1 to 1 million.

In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memory-bound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:

- as the data will no longer be bound by available memory, we will scale with the size of the data,
- since the XDF format is also a distributed data format, we will use the multiple cores available on a single machine to distribute the computation itself by having each core process a separate chunk of the data. On a cluster of machines, the same analysis could then be distributed over the different nodes of the cluster.

Hilary Parker has contributed a lovely article to *Significance*, the magazine of the American Statistical Association and the Royal Statistical Society, on using R to set your Google calendar to mark the time of sunsets. Hilary details the process in the article, but the basic idea is to use the sunrise.set function from the StreamMetabolism package to calculate sunset times at future date, and then create an R function to write a file of calendar appointments which you can import into Google Calendar. (You can use a similar process to set up calendar events for anything you like — just adapt the R code accordingly.) Now that her phone buzzes to alert her to the upcoming sunset, Hilary has been able to capture some beautiful sunsets framed by the NYC skyline, like this one:

Hilary is a data analyst at Etsy, and writes interesting articles about R and statistics at her blog, Not So Standard Deviations. If — like Hilary — you use R at work, you should also check out her post, Writing an R package from scratch, which includes some great tips on organizing your R functions using personal packages. Hilary also has some great advice on testing R scripts for production use in her useR! 2014 poster, testdat: An R Package for Unit Testing of Tabular data.

Significance: Never miss another sunset with R

*by Nick Elprin, Co-Founder of Domino Data Lab*

We built a platform that lets analysts deploy R code to an HTTP server with one click, and we describe it in detail below. If you have ever wanted to invoke your R model with a simple HTTP call, without dealing with any infrastructure setup or asking for help from developers — imagine Heroku for your R code — we hope you’ll enjoy this.

**Introduction**

Across industries, analytical models are powering core business processes and applications as more companies realize that that analytics are key to their competitiveness. R is particularly well suited to developing and expressing such models, but unfortunately, the final step of integrating R code into existing software systems remains difficult. This post describes our solution to this problem: “one-click” publishing of R code to an API server, allowing easy integration between R and other languages, and freeing data scientists to change their models on their own, without help from any developers or engineers.

Today, two problems — one technical, and one organizational — create friction when trying to integrate R code into existing software applications. First, while R is a great language for analytical code, most enterprise software systems are written in more general purpose languages, such as Java, PHP, C#, C++, or even data pipeline tools such as Informatica or Microsoft’s SSIS. Invoking R code from these languages requires some non-trivial technical work, or translation to another language. This leads to the second problem: in most companies, software engineering teams are separate from analytics teams, so when analysts need engineering help, they are forced to compete against other priorities, or they must do their own engineering. Even after an initial deployment of R code, when the model is updated, the deployment process must be repeated, resulting in a painful iteration cycle.

**A Solution: Domino and API Endpoints**

Domino is a platform for doing data science in the enterprise: it provides turnkey functionality for job distribution, version control, collaboration, and model deployment, so that data science teams can be productive without their own engineers and developers. We built our “API Endpoints” feature to address the use case I describe above, reducing the friction associated with integrating R (or Python) models into production systems. Here’s how it works:

Let’s say we are building a library for arithmetic. We have a file, arithmetic.R, with this code:

add <- function(a, b) {

a + b

}

multiply <- function(a, b) {

a * b

}

Now we’d like to make that code accessible to external applications. You can upload this file to Domino (via our web interface, command-line client, or our R package). Once uploaded, you can define a new “API Endpoint” by specifying this file, and the name of the function to invoke when the API is used.

When we hit “publish,” Domino deploys the script to a low-latency server and listens for incoming HTTP requests. If your script performs any one-time initialization (for example, installing custom R packages, or calculating any model parameters), it would run once upon publishing. When a request comes in, Domino passes the request parameters to your specified R function, and returns the results.

It’s that simple. We can test this with a simple curl command (or the equivalent operation in any modern programming language)

curl -v -X POST http://rt.dominoup.com:9000/v1/nick/demo/rt

-H "Content-Type:application/json"

-d '{"parameters": [10,20] }'

Our two parameters (10, 20) are the inputs to the function. Domino pulls them out of the incoming HTTP request, passes them to the R function, and marshals the “result” value back in the HTTP response, along with some status info:

< HTTP/1.1 200 OK

< Content-Type: text/plain; charset=utf-8

< Content-Length: 70

<

* Connection #0 to host rt.dominoup.com left intact

{"runId":"53d3c2986fe0206fee536283","status":"Succeeded","result":200}

This technique works for more complex return types, as well, including strings and arrays of multiple values.

Now our R code accessible via an HTTP call, providing a clean interface to any other system or language. Moreover, analysts can deploy updates to the R code on their own, enabling a faster iteration cycle.

**Training Steps with Scheduled Runs**

A common workflow in machine learning tasks is to create a training task, which might take hours to run, and a classify step, which uses the output of the training step to classify inputs very quickly. Domino let’s you schedule recurring tasks, which can automatically publish their results to an API Endpoint you have defined. In this example, we have a “training_task.R” script that runs a computationally intensive regression and saves the resulting parameters to an RData file.

We can then create a classify function (and API Endpoint) which reads in our parameters and quickly classifies incoming requests. Because our scheduled tasks is set to update the API Endpoint upon completion, the API will get the latest training parameters each night when the training script runs.

**Change Management: Production vs Development**

Once your R code is consumed by production processes, it’s critical to be careful when changing it. Domino facilitates this by automatically keeping a revisioned history of all your files, and by keeping your API Endpoints “pinned” to specific revisions of your code until you explicitly update it. As an example, this screenshot shows Domino’s view of the history of our API releases. We could edit our R files all we want, but the most recent API release (v8) would remain pointed to the exact version of the code we had when we published it. And if we ever needed to debug a production issue from a past release, we could go back to the exact version of our code associated with earlier releases simply by clicking the “commit” link.

**Implementation**

Domino actually lets you publish R or Python code as an API Endpoint, but since this is an R blog, I will focus on how we have implemented this for R code. Our API functionality is driven by two powerful tools under the hood:

- Rserve, along with its Java client, lets us programmatically control an in-memory R session.
- Docker lets us isolate users’ code in separate containers, so they cannot interfere with each other or with our host machines.

With these tools, the implementation of our API functionality works roughly like this:

When we “publish” a new endpoint, we get a source_file and a function_name from the user. Then we:

- Create a new Docker container
- Write a conf file telling Rserve to source the source_file script
- In the container, start an Rserve session and tell it to use the conf file we just created

To start Rserve, we invoke R from the command line (e.g., R --no-save --no-environ) and pass it the following STDIN content:

library(Rserve)

Rserve(debug=TRUE,port=$rservePort,'--vanilla --RS-conf $conf_file')

Where $conf_file is the path to a file containing:

remote enable

source $source_file

By deploying the script and running any initialization when the user hits “publish,” we minimize the work necessary each time the API is invoked. In fact, the overhead of an API request to this published endpoint is only about 150ms. This is critical to support production applications with low latency requirements.

When an HTTP request comes in, we can then do the following:

- Lookup the Docker container that corresponds to the requested endpoint
- Build a command_string from the function_name corresponding to the requested endpoint, and the POST parameters from the HTTP request. E.g., “multiply(10, 20)”
- Invoke the Rserve API to execute this command in the process running inside the specific Docker contain. In Scala, this is how that looks:

import org.rosuda.REngine.REXP

import org.rosuda.REngine.Rserve.RConnection

val c = new RConnection(null, 9999) // second parameter is the port

try {

new RResult(c.eval(command_string))

} finally {

c.close()

}

RConnection.eval() returns an REXP object. Our helper class, RResult, knows how to translate this object into appropriate JSON, so that we can return it in our HTTP response:

class RResult(val rexp: REXP) {

def toJson = {

if (rexp.isString) {

Json.toJson(rexp.asString)

} else if (rexp.isVector && rexp.length > 1) {

Json.toJson(rexp.asDoubles())

} else if (rexp.isList) {

val list = rexp.asList

if (list.isNamed) {

JsObject(

for {

key <- list.keys

} yield {

key -> Json.toJson(new RResult(list.at(key)))

}

)

} else {

JsArray(

for {

i <- 0 to list.capacity

} yield {

Json.toJson(new RResult(list.at(i)))

}

)

}

} else if (rexp.isNumeric) {

JsNumber(rexp.asDouble)

} else {

JsString(rexp.asNativeJavaObject.toString)

}

}

}

**Conclusion**

We are excited to see more organizations making analytical models a central part of their business processes, and we hope that Domino can empower data scientists to accelerate this trend. We are always eager for feedback, so please check our free trial, including API Endpoints, and let us know what you think at info@dominodatalab.com or on Twitter at @dominodatalab.

*The latest in a series by Daniel Hanson*

**Introduction**

Correlations between holdings in a portfolio are of course a key component in financial risk management. Borrowing a tool common in fields such as bioinformatics and genetics, we will look at how to use heat maps in R for visualizing correlations among financial returns, and examine behavior in both a stable and down market.

While base R contains its own heatmap(.) function, the reader will likely find the heatmap.2(.) function in the R package gplots to be a bit more user friendly. A very nicely written companion article entitled A short tutorial for decent heat maps in R (Sebastian Raschka, 2013), which covers more details and features, is available on the web; we will also refer to it in the discussion below.

We will present the topic in the form of an example.

**Sample Data**

As in previous articles, we will make use of R packages Quandl and xts to acquire and manage our market data. Here, in a simple example, we will use returns from the following global equity indices over the period 1998-01-05 to the present, and then examine correlations between them:

S&P 500 (US)

RUSSELL 2000 (US Small Cap)

NIKKEI (Japan)

HANG SENG (Hong Kong)

DAX (Germany)

CAC (France)

KOSPI (Korea)

First, we gather the index values and convert to returns:

library(xts) library(Quandl) my_start_date <- "1998-01-05" SP500.Q <- Quandl("YAHOO/INDEX_GSPC", start_date = my_start_date, type = "xts") RUSS2000.Q <- Quandl("YAHOO/INDEX_RUT", start_date = my_start_date, type = "xts") NIKKEI.Q <- Quandl("NIKKEI/INDEX", start_date = my_start_date, type = "xts") HANG_SENG.Q <- Quandl("YAHOO/INDEX_HSI", start_date = my_start_date, type = "xts") DAX.Q <- Quandl("YAHOO/INDEX_GDAXI", start_date = my_start_date, type = "xts") CAC.Q <- Quandl("YAHOO/INDEX_FCHI", start_date = my_start_date, type = "xts") KOSPI.Q <- Quandl("YAHOO/INDEX_KS11", start_date = my_start_date, type = "xts") # Depending on the index, the final price for each day is either # "Adjusted Close" or "Close Price". Extract this single column for each: SP500 <- SP500.Q[,"Adjusted Close"] RUSS2000 <- RUSS2000.Q[,"Adjusted Close"] DAX <- DAX.Q[,"Adjusted Close"] CAC <- CAC.Q[,"Adjusted Close"] KOSPI <- KOSPI.Q[,"Adjusted Close"] NIKKEI <- NIKKEI.Q[,"Close Price"] HANG_SENG <- HANG_SENG.Q[,"Adjusted Close"] # The xts merge(.) function will only accept two series at a time. # We can, however, merge multiple columns by downcasting to *zoo* objects. # Remark: "all = FALSE" uses an inner join to merge the data. z <- merge(as.zoo(SP500), as.zoo(RUSS2000), as.zoo(DAX), as.zoo(CAC), as.zoo(KOSPI), as.zoo(NIKKEI), as.zoo(HANG_SENG), all = FALSE) # Set the column names; these will be used in the heat maps: myColnames <- c("SP500","RUSS2000","DAX","CAC","KOSPI","NIKKEI","HANG_SENG") colnames(z) <- myColnames # Cast back to an xts object: mktPrices <- as.xts(z) # Next, calculate log returns: mktRtns <- diff(log(mktPrices), lag = 1) head(mktRtns) mktRtns <- mktRtns[-1, ] # Remove resulting NA in the 1st row

**Generate Heat Maps**

As noted above, heatmap.2(.) is the function in the gplots package that we will use. For convenience, we’ll wrap this function inside our own generate_heat_map(.) function, as we will call this parameterization several times to compare market conditions.

As for the parameterization, the comments should be self-explanatory, but we’re keeping things simple by eliminating the dendogram, and leaving out the trace lines inside the heat map and density plot inside the color legend. Note also the setting Rowv = FALSE, this ensures the ordering of the rows and columns remains consistent from plot to plot. We’re also just using the default color settings; for customized colors, see the Raschka tutorial linked above.

require(gplots) generate_heat_map <- function(correlationMatrix, title) { heatmap.2(x = correlationMatrix, # the correlation matrix input cellnote = correlationMatrix # places correlation value in each cell main = title, # heat map title symm = TRUE, # configure diagram as standard correlation matrix dendrogram="none", # do not draw a row dendrogram Rowv = FALSE, # keep ordering consistent trace="none", # turns off trace lines inside the heat map density.info="none", # turns off density plot inside color legend notecol="black") # set font color of cell labels to black }

Next, let’s calculate three correlation matrices using the data we have obtained:

- Correlations based on the entire data set from 1998-01-05 to the present
- Correlations of market indices during a reasonably calm period -- January through December 2004
- Correlations of falling market indices in the midst of the financial crisis - October 2008 through May 2009

Now, let’s call our heat map function using the total market data set:

generate_heat_map(corr1, "Correlations of World Market Returns, Jan 1998 - Present")

And then, examine the result:

As expected, we trivially have correlations of 100% down the main diagonal. Note that, as shown in the color key, the darker the color, the lower the correlation. By design, using the parameters of the heatmap.2(.) function, we set the title with the main = title parameter setting, and the correlations shown in black by using the notecol="black" setting.

Next, let’s look at a period of relative calm in the markets, namely the year 2004:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

This gives us:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

Note that in this case, at a glance of the darker colors in each of the cells, we can see that we have even lower correlations than those from our entire data set. This may of course be verified by comparing the numerical values.

Finally, let’s look at the opposite extreme, during the upheaval of the financial crisis in 2008-2009:

generate_heat_map(corr3, "Correlations of World Market Returns, Oct 2008 - May 2009")

This yields the following heat map:

Note that in this case, again just at first glance, we can tell the correlations have increased compared to 2004, by the colors changing from dark to light nearly across the board. While there are some correlations that do not increase all that much, such as the SP500/Nikkei and the Russell 2000/Kospi values, there are others across international and capitalization categories that jump quite significantly, such as the SP500/Hang Seng correlation going from about 21% to 41%, and that of the Russell 2000/DAX moving from 43% to over 57%. So, in other words, portfolio diversification can take a hit in down markets.

**Conclusion**

In this example, we only looked at seven market indices, but for a closer look at how correlations were affected during 2008-09 -- and how heat maps among a greater number of market sectors compared -- this article, entitled* Diversification is Broken*, is a recommended and interesting read.

**Using package repositories to recreate the past, distribute the present, and protect against the future**

*by Gabriel Becker (@groundwalkergmb)*

**1. Have you ever needed to reach into the distant past …**to recreate a years old result? Take - as an arbitrary example - Anders and Huber's paper on Differential expression analysis for sequence count data from 2010. "2010", the excitable among you might exclaim, "were they programming their R scripts on punch-cards and running them on coal-powered computers??" Perhaps they were, my hyperbolic readers, but hardware is not our concern today.

Replicable scientific results are the engine for advancing human knowledge; non-replicable results, simply put, are not. This holds true regardless of how many eons - in computer years - have passed since the results were originally generated.

**2. They got what??**

Anders and Huber provide Sweave output (a PDF) that contains the code for the plots and output directly reported in their paper, as well as all the datasets used by the code. So let's reproduce their results! Getting the code into an executable form for this is painful, but that is a subject for another time.

Below is their code (I added the supressMessages() call) for identifying differentially expressed genes for their fly data with a False Discovery Rate (FDR) cutoff of .1

suppressMessages(library(DESeq)) countsTableFly <- read.delim( "fly_RNA_counts.tsv" ) condsFly <- c( "A", "A", "B", "B" ) # Add dummy names to avoid confusion later rownames( countsTableFly ) <- paste( "Gene", 1:nrow(countsTableFly), sep="_" )

cdsFly <- newCountDataSet( countsTableFly, condsFly ) cdsFly <- estimateSizeFactors( cdsFly ) cdsFly <- estimateVarianceFunctions( cdsFly ) #oops! error! resFly <- nbinomTest( cdsFly, "A", "B" )

Error: 'estimateVarianceFunctions' is defunct. Use 'estimateDispersions' instead. See help("Defunct") Error in nbinomTest(cdsFly, "A", "B") : Call 'estimateDispersions' first.

It didn't work! We got an error when running their code.The Defunct error suggests that we use the estimateDispersions function instead of the removed estimateSizeFactors function, so lets do that …

359?!? The paper (and Sweave output) report that this code gives the result of 864!

Before we go lighting our pitchforks and sharpening our torches, however, we need to remember a simple fact: just because it doesn't work now doesn't mean it didn't work then.

At this point you may be wondering what I'm proposing. Are we supposed to take a quick trip back in time and rerun their code in 2010? Well, sort of. The GRANBase package - among many other things - provides a sort of time machine for your computing environment.

**3. Climbing in through the sessionInfo() window**

Anders and Huber provided a window into the computing environment they used in the form of sessionInfo() output in their SWeave output PDF. "Windows are great", I'm sure you're thinking, "except that I can't use them to get inside and so I'm left out in the cold!"

Well not with that attitude you can't. We're going to use GRANBase to knock out that window and install a door in it's place. One that we - and everyone else - can use to reproduce Anders and Huber's results. No more pressing your face up against the glass, drooling at the delicious old software tantalizingly arranged on the other side.

First we have to get our hands on the right version of R - 2.12 in this case. That isn't the easiest thing to do, but for those of you who don't want to install and manage an old R installation you can use an Amazon EC2 instance with this lovingly hand-crafted AMI that contains R 2.12.1 and has switchr installed: ami-94a670fc

NOTE: the authors actually ran their code in 2.12.0, but in general we can usually assume that matching the major (2) and minor (12) versions is sufficient, as we see that it is in this case. The end result (run in R 2.12.1):

library(switchr) doiRepo = "http://research-pub.gene.com/gran/10.1186/gb-2010-11-10-r106" switchTo(doiRepo, name = "10.1186/gb-2010-11-10-r106")

Switched to the '10.1186/gb-2010-11-10-r106' computing environment. 54 packages are currently available. Packages installed in your site library ARE suppressed. To switch back to your previous environment type switchBack()

If we had never switched to a computing environment based on this particular repository within the version of R being used, the above statement would create the computing environment - by installing all the packages contained in that repository - and remember it for the future. I have spared you the dozens of lines of package installation output, because that's just the type of person I am. Now then, let's try that again.

suppressMessages(library(DESeq)) countsTableFly <- read.delim( "fly_RNA_counts.tsv" ) condsFly <- c( "A", "A", "B", "B" ) # Add dummy names to avoid confusion later rownames( countsTableFly ) <- paste( "Gene", 1:nrow(countsTableFly), sep="_" ) cdsFly <- newCountDataSet( countsTableFly, condsFly ) cdsFly <- estimateSizeFactors( cdsFly ) cdsFly <- estimateVarianceFunctions( cdsFly ) #oops! error! resFly <- nbinomTest( cdsFly, "A", "B" ) length( which( resFly$padj < .1 ) )

[1] 864

And just like that we've recreated their result. Now, if we were staying in the same R, we would revert to our original computing environment and go about our day.

switchBack()

Reverted to the 'original' computing environment. 33 packages are currently available. Packages installed in your site library ARE NOT suppressed. To switch back to your previous environment type switchBack()

**4. So what just happened?**

The short answer: there is a repository associated with the paper's DOI that contains all the exact package versions used by the authors. This repository encapsulates the R computing environment - up to the actual version of R used - that Anders and Huber used. And it renders that environment recreatable, and thus - along with the original data they provide - their results reproducible.

The repository mechanism provides a natural way to both define specific sets of package versions that make up an R computing environment and make them instantly distributable.

**5. Sure, but how did it get there?**

"Well that's great", you might be thinking, "when we have such a repository; but we're out of luck if the authors of an article don't create one at the time of publication." And you would have largely been right, until now. GRANBase can retrieve the source version of any non-base package that has ever been on CRAN or Bioconductor.

Allow me to repeat that for emphasis. GRANBase can retrieve and build sources **for any version of any non-base package that has ever been released** on CRAN or BioConductor. For example - assuming you have the svn command-line utility installed - this will get you the DESeq version Anders and Huber used:

Furthermore, GRANBase will happily parse sessionInfo() output - either the R object or the text -, find and build source versions of the specified packages, and build the repository for you. That is how I got the repository I used to recreate Anders and Huber's work.

"The world is saved!", you're probably shouting, and "GRANBase for President!" You guys sure are excitable. I appreciate the sentiment, but I'm fairly certain that R packages can't hold elected office.

**6. So what was that switching stuff?**

The switchr package provides a formal abstraction for creating, managing, tracking, and switching between "R computing environments" built on top of R's existing library location mechanism. So instead of managing library path's directly, we switched to the appropriate computing environment, ran the authors' computations, confirmed that we got the same results, and then switched back to our normal R installation. Also, as we have seen, switchr is confirmed to work at least as far back as R 2.12.1 (2010), with earlier R versions likely supported but untested.

**7. But won't someone think of the harddrives?**

Every time a paper, blog post, or internal analysis gets done, we copy the tarballs to a new repository and move on with our day. Sounds good. "Madness!", comes the refrain, "Why would you copy a file 920345982430589423509823452340589345 times???" Well, first of all that number seems a bit high to me, even for popular packages. In principle, though, I agree (and thus so does GRANBase).

GRANBase gets around this with the concept of having many "virtual repositories". Virtual repositories are ones that have passed Kindergarten and thus know how to share. Specifically, they share tarballs of exact package versions, so that communally they only need one copy of each version of each package.

The details of how this sharing happens does this are both unimportant and out of scope here, but the take-away is that there is no duplication of files when many virtual repositories hosted in the same place share package versions. Plus, like the cooks at any reasonable diner, GRANBase can make your packages to order using the version location functionality I mentioned above. This means that with code, data, sessionInfo() output, and a suitable R installation, we can assess and preserve the reproducibility of **every R analysis run in the past, present, or future** ^.

*^Assuming no dependencies not encompassed by R packages*

**8. Other uses for this framework**

- Ensuring the R environment on your local machine is equivalent to the one on a particular computing cluster to the one on your local machine
- Ensuring everyone collaborating on package development is programming against the same API (set of dependency versions)
- Allowing package maintainers to easily recreate the exact environment a reported bug was detected in
- Having side-by-side release and devel installations of BioConductor

**9. Other things GRANBase does, very briefly**

*SHAMELESS PLUG*

**9.1 Risk reports**

- Compare your currently installed packages to currently available versions
- Asseses possible "ripple" effects of updating packages
- Parses NEWS - when possible - to summary # of fixed bugs you are subject to
- Helps reformulate whether to upgrade or not as a strategic decision

**9.2 Incremental, multi-source Repository building**

- Build repositories using packages sources from many disparate sources (SVN, github, etc) based on the concept of a package manifest
- Only rebuilds updated packages (version bump) and their reverse dependnecies. No wasted testing of unaltered pkgs
- CI-ready design, we use Jenkins

*END SHAMELESS PLUG*

**10. Related work, and further reading**

See:

- Gentleman and Temple Lang for a discussion of why and how software dependencies should be distributed packaged with dynamic documents.
- RStudio's packrat is an implementation of Gentleman and Temple Lang's ideas centered around RStudio's concept of an R Project.
- Andrie de Vries' miniCRAN R package explores the concept of creating frozen, self-consistent subsets of CRAN.

Starting with R 2.13, the Bioconductor project provides AMIs that encapsulate core BioC packages, but which also necessarily encapsulate historical R versions. And no, before you ask, starting at 2.13 isn't because there is a vast and highly prescient conspiracy against us … probably.

Dirk Eddelbuettel's work with Docker-based R builds offer the potential for another way of mitigating the pain of the Old-R-versions requirement, though currently that is not the focus of his work. (See ~pg 50)

**11. Acknowledgements**

Cory Barr created an earlier GRAN package which served as a stepping-off point for development of GRANBase.

Thanks to David Smith, Joseph Rickert and Revolutions Analytics for allowing me use of their soapbox.

The GRANBase and switchr R packages are Copyright Genentech, Inc. and have been released under the Artistic 2.0 open-source license here and here, respectively.

GRANBase and switchr were developed by Gabriel Becker under the leadership of Michael Lawrence and Robert Gentleman at Genentech Research and Early Development.

Thanks to Michael and Robert for their advice and support throughout the project, and for allowing me to grow it's scope beyond what was initially envisioned.

by Andrie de Vries

In my previous post I wrote about how to identify and visualize package dependencies. Within hours, Duncan Murdoch (member of R-core) identified some discrepancies between my list of dependencies and the visualisation. Since then, I fixed the dispecrancies. In this blog post I attempt to clarify the issues involved in listing package dependencies.

In miniCRAN I expose two functions that provides information about dependencies:

- The function
**pkgDep()**returns a character vector with the names of dependencies. Internally, pkgDep() is a wrapper around**tools::package_dependencies()**, a base R function that, well, tells you about package dependencies. My new function is in one way a convenience, but more importantly it sets different defaults (more about this later). - The function
**makeDepGraph()**creates an**igraph**representation of the dependencies.

Take a look at some examples. I illustrate with the the package **chron**, because chron neatly illustrates the different roles of Imports, Suggests and Enhances:

- chron
**Imports**the base packages**graphics**and**stats**. This means that chron internally makes use of graphics and stats and will always load these packages. - chron
**Suggests**the packages**scales**and**ggplot2**. This means that chron uses some functions from these packages in examples or in its vignettes. However, these functions are not necessary to use chron - chron
**Enhances**the package**zoo**, meaning that it adds something to zoo packages. These enhancements are made available to you if you have zoo installed.

The function **pkgDep()** exposes not only these dependencies, but also also all recursive dependencies. In other words, it answers the question which packages need to be installed to satsify all dependencies of dependencies.

This means that the algorithm is as follows:

- First retrieve a list of Suggests and Enhances, using a non-recursive dependency search
- Next, perform a recursive search for all Imports, Depends and LinkingTo

The resulting list of packages should then contain the complete list necessary to satisfy all dependencies. In code:

> library(miniCRAN)

> tags <- "chron" > pkgDep(tags, suggests=FALSE, enhances=FALSE, includeBasePkgs = TRUE) [1] "chron" "graphics" "stats"

> pkgDep(tags, suggests = TRUE, enhances=FALSE) [1] "chron" "RColorBrewer" "dichromat" "munsell" "plyr" "labeling" [7] "colorspace" "Rcpp" "digest" "gtable" "reshape2" "scales" [13] "proto" "MASS" "stringr" "ggplot2"

> pkgDep(tags, suggests = TRUE, enhances=TRUE) [1] "chron" "RColorBrewer" "dichromat" "munsell" "plyr" "labeling" [7] "colorspace" "Rcpp" "digest" "gtable" "reshape2" "scales" [13] "proto" "MASS" "stringr" "lattice" "ggplot2" "zoo"

Created by Pretty R at inside-R.org

To create an igraph plot of the dependencies, you can use the function **makeDepGraph()** and plot the results:

Created by Pretty R at inside-R.org

Note how the dependencies expand to zoo (enhanced), scales and ggplot (suggested) and then recursively from there to get all the Imports and LinkingTo dependencies.

In my previous post I tried to plot the most popular package tags on StackOverflow. Using the updated functionality in the miniCRAN functions, it is easier to understand the structure of the dependencies:

> tags <- c("ggplot2", "data.table", "plyr", "knitr", + "shiny", "xts", "lattice") > pkgDep(tags, suggests = TRUE, enhances=FALSE) [1] "ggplot2" "data.table" "plyr" "knitr" "shiny" "xts" [7] "lattice" "digest" "gtable" "reshape2" "scales" "proto" [13] "MASS" "Rcpp" "stringr" "RColorBrewer" "dichromat" "munsell" [19] "labeling" "colorspace" "evaluate" "formatR" "highr" "markdown" [25] "mime" "httpuv" "caTools" "RJSONIO" "xtable" "htmltools" [31] "bitops" "zoo" "SparseM" "survival" "Formula" "latticeExtra" [37] "cluster" "maps" "sp" "foreign" "mvtnorm" "TH.data" [43] "sandwich" "nlme" "Matrix" "bit" "codetools" "iterators" [49] "timeDate" "quadprog" "Hmisc" "BH" "quantreg" "mapproj" [55] "hexbin" "maptools" "multcomp" "testthat" "mgcv" "chron" [61] "reshape" "fastmatch" "bit64" "abind" "foreach" "doMC" [67] "itertools" "testit" "rgl" "XML" "RCurl" "Cairo" [73] "timeSeries" "tseries" "its" "fts" "tis" "KernSmooth" > set.seed(1) > plot(makeDepGraph(tags, includeBasePkgs=FALSE, suggests=TRUE, enhances=TRUE), + legendPosEdge = c(-1, -1), legendPosVertex = c(1, -1), vertex.size=10, cex=0.5)

Created by Pretty R at inside-R.org

After my previous post, Duncan Murdoch pointed out that the package **rgl**, suggested by **knitr**, appeared in the list, but not in the plot. This new version of the function fixes this bug, which was introduced because I retrieved the suggests dependencies incorrectly:

EDIT:

A few hours ago the miniCRAN went live on CRAN. Find miniCRAN at http://cran.r-project.org/web/packages/miniCRAN/index.html

With the growing popularity of R, there is an associated increase in the popularity of online forums to ask questions. One of the most popular sites is StackOverflow, where more than 60 thousand questions have been asked and tagged to be related to R.

On the same page, you can also find related tags. Among the top 15 tags associated with R, several are also packages you can find on CRAN:

- ggplot2
- data.table
- plyr
- knitr
- shiny
- xts
- lattice

It very easy to install these packages directly from CRAN using the R function install.packages(), but this will also install all these package dependencies.

This leads to the question: How can one determine all these dependencies?

It is possible to do this using the function available.packages() and then query the resulting object.

But it is easier to answer this question using the functions in a new package, called miniCRAN, that I am working on. I have designed miniCRAN to allow you to create a mini version of CRAN behind a corporate firewall. You can use some of the function in miniCRAN to list packages and their dependencies, in particular:

- pkgAvail()
- pkgDep()
- makeDepGraph()

I illustrate these functions in the following scripts.

Start by loading miniCRAN and retrieving the available packages on CRAN. Use the function pkgAvail() to do this:

library(miniCRAN) pkgdata <- pkgAvail(repos = c(CRAN="http://cran.revolutionanalytics.com"), type="source") head(pkgdata[, c("Depends", "Suggests")]) ## Depends Suggests ## A3 "R (>= 2.15.0), xtable, pbapply" "randomForest, e1071" ## abc "R (>= 2.10), nnet, quantreg, MASS" NA ## abcdeFBA "Rglpk,rgl,corrplot,lattice,R (>= 2.10)" "LIM,sybil" ## ABCExtremes "SpatialExtremes, combinat" NA ## ABCoptim NA NA ## ABCp2 "MASS" NA

Next, use the function pkgDep() to get dependencies of the 7 popular tags on StackOverflow:

tags <- c("ggplot2", "data.table", "plyr", "knitr", "shiny", "xts", "lattice") pkgList <- pkgDep(tags, availPkgs=pkgdata, suggests=TRUE) pkgList ## [1] "abind" "bit64" "bitops" "Cairo" ## [5] "caTools" "chron" "codetools" "colorspace" ## [9] "data.table" "dichromat" "digest" "evaluate" ## [13] "fastmatch" "foreach" "formatR" "fts" ## [17] "ggplot2" "gtable" "hexbin" "highr" ## [21] "Hmisc" "htmltools" "httpuv" "iterators" ## [25] "itertools" "its" "KernSmooth" "knitr" ## [29] "labeling" "lattice" "mapproj" "maps" ## [33] "maptools" "markdown" "MASS" "mgcv" ## [37] "mime" "multcomp" "munsell" "nlme" ## [41] "plyr" "proto" "quantreg" "RColorBrewer" ## [45] "Rcpp" "RCurl" "reshape" "reshape2" ## [49] "rgl" "RJSONIO" "scales" "shiny" ## [53] "stringr" "testit" "testthat" "timeDate" ## [57] "timeSeries" "tis" "tseries" "XML" ## [61] "xtable" "xts" "zoo"

Wow, look how these 7 packages have dependencies on 63 other packages!

You can graphically visualise these dependencies in a graph, by using the function makeDepGraph():

p <- makeDepGraph(pkgList, availPkgs=pkgdata) library(igraph) plotColours <- c("grey80", "orange") topLevel <- as.numeric(V(p)$name %in% tags) par(mai=rep(0.25, 4)) set.seed(50) vColor <- plotColours[1 + topLevel] plot(p, vertex.size=8, edge.arrow.size=0.5, vertex.label.cex=0.7, vertex.label.color="black", vertex.color=vColor) legend(x=0.9, y=-0.9, legend=c("Dependencies", "Initial list"), col=c(plotColours, NA), pch=19, cex=0.9) text(0.9, -0.75, expression(xts %->% zoo), adj=0, cex=0.9) text(0.9, -0.8, "xts depends on zoo", adj=0, cex=0.9) title("Package dependency graph")

So, if you wanted to install the 7 most popular packages R packages (according to StackOverflow), R will in fact download and install up to 63 different packages!

by Daniel Hanson

**Recap and Introduction**

Last time in part 1 of this topic, we used the xts and lubridate packages to interpolate a zero rate for every date over the span of 30 years of market yield curve data. In this article, we will look at how we can implement the two essential functions of a term structure: the forward interest rate, and the forward discount factor.

**Definitions and Notation**We will apply a mix of notation adopted in the lecture notes Interest Rate Models: Introduction, pp 3-4, from the New York University Courant Institute (2005), along with chapter 1 of the book Interest Rate Models — Theory and Practice (2nd edition, Brigo and Mercurio, 2006). A presentation by Damiano Brigo from 2007, which covers some of the essential background found in the book, is available here, from the Columbia University website.

First, t ≧ 0 and T ≧ 0 represent time values in years.

P(t, T) represents the forward discount factor at time t ≦ T, where T ≦ 30 years (in our case), as seen at time = 0 (ie, our anchor date). In other words, again in US Dollar parlance, this means the value at time t of one dollar to be received at time T, based on continuously compounded interest. Note then that, trivially, we must have P(T, T) = 1.

R(t, T) represents the continuously compounded forward interest rate, as seen at time = 0, paid over the period [t, T]. This is also sometimes written as F(0; t, T) to indicate that this is the forward rate as seen at the anchor date (time = 0), but to keep the notation lighter, we will use R(t, T) as is done in the NYU notes.

We then have the following relationships between P(t, T) and R(t, T), based on the properties of continuously compounded interest:

P(t, T) = exp(-R(t, T)・(T - t)) (A)

R(t, T) = -log(P(t, T)) / (T - t) (B)

Finally, the interpolated the market yield curve we constructed last time allows us to find the value of R(0, T) for any T ≦ 30. Then, since by properties of the exponential function we have

P(t, T) = P(0, T) / P(0, t) (C)

we can determine any discount factor P(t, T) for 0 ≦ t ≦ T ≦ 30, and therefore any R(t, T), as seen at time = 0.

**Converting from Dates to Year Fractions**By now, one might be wondering -- when we constructed our interpolated market yield curve, we used actual dates, but here, we’re talking about time in units of years -- what’s up with that? The answer is that we need to convert from dates to year fractions. While this may seem like a rather trivial proposition -- for example, why not just divide the number of days between the start date and maturity date by 365.25 -- it turns out that, with financial instruments such as bonds, options, and futures, in practice we need to be much more careful. Each of these comes with a specified day count convention, and if not followed properly, it can result in the loss of millions for a trading desk.

For example, consider the Actual / 365 Fixed day count convention:

Year Fraction (ie, T - t) = (Days between Date1 and Date2) / 365

This is one commonly used convention and is very simple to calculate; however, for certain bond calculations, it can become much more complicated, as leap years are considered, as well as local holidays in the country in which the bond is traded, plus more esoteric conditions that may be imposed. To get an idea, look up day count conventions used for government bonds in various countries.

In the book by Brigo and Mercurio noted above, the authors in fact replace the “T - t” expression with a function (tau) τ(t, T), which represents the difference in time based upon the day count convention in effect.

Equation (A) then becomes

P(t, T) = exp(-R(t, T)・ τ(t, T))

where τ(t, T) might be, for example, the Actual / 365 Fixed day count convention.

For the remainder of this article, we will implement to the “T - t” above as a day count function, as demonstrated in the example to follow.

**Implementation in R**We will first revisit the example from our previous article on interpolation of market zero rates, and then use this to demonstrate the implementation of term structure functions to calculate forward discount factors and forward interest rates.

*a) The setup from part 1*

Let’s first go back to the example from part 1 and construct our interpolated 30-year market yield curve, using cubic spline interpolation. Both the xts and lubridate packages need to be loaded. The code is republished here for convenience:

require(xts)

require(lubridate)

ad <- ymd(20140514, tz = "US/Pacific")

marketDates <- c(ad, ad + days(1), ad + weeks(1), ad + months(1),

ad + months(2), ad + months(3), ad + months(6),

ad + months(9), ad + years(1), ad + years(2),

ad + years(3), ad + years(5), ad + years(7),

ad + years(10), ad + years(15), ad + years(20),

ad + years(25), ad + years(30))

# Use substring(.) to get rid of "UTC"/time zone after the dates

marketDates <- as.Date(substring(marketDates, 1, 10))

# Convert percentage formats to decimal by multiplying by 0.01:

marketRates <- c(0.0, 0.08, 0.125, 0.15, 0.20, 0.255, 0.35, 0.55, 1.65,

2.25, 2.85, 3.10, 3.35, 3.65, 3.95, 4.65, 5.15, 5.85) * 0.01

numRates <- length(marketRates)

marketData.xts <- as.xts(marketRates, order.by = marketDates)

createEmptyTermStructureXtsLub <- function(anchorDate, plusYears)

{

# anchorDate is a lubridate here:

endDate <- anchorDate + years(plusYears)

numDays <- endDate - anchorDate

# We need to convert anchorDate to a standard R date to use

# the "+ 0:numDays" operation

# Also, note that we need a total of numDays + 1

# in order to capture both end points.

xts.termStruct <- xts(rep(NA, numDays + 1),

as.Date(anchorDate) + 0:numDays)

return(xts.termStruct)

}

termStruct <- createEmptyTermStructureXtsLub(ad, 30)

for(i in (1:numRates)) termStruct[marketDates[i]] <-

marketData.xts[marketDates[i]]

termStruct.spline.interpolate <- na.spline(termStruct, method = "hyman")

colnames(termStruct.spline.interpolate) <- "ZeroRate"

*b) Check the plot*

plot(x = termStruct.spline.interpolate[, "ZeroRate"], xlab = "Time",

ylab = "Zero Rate",

main = "Interpolated Market Zero Rates 2014-05-14 -

Cubic Spline Interpolation",

ylim = c(0.0, 0.06), major.ticks= "years",

minor.ticks = FALSE, col = "darkblue")

This gives us a reasonably smooth curve, preserving the monotonicity of our data points:

*c) Implement functions for discount factors and forward rates*

We will now implement these functions, utilizing equations (A), (B), and (C) above. We will also take advantage of the functional programming feature in R, by incorporating the Actual / 365 Fixed day count as a functional argument, as an example. One could of course implement any other day count convention as a function of two lubridate dates, and pass it in as an argument.

First, let’s implement the Actual / 365 Fixed day count as a function:

# Simple example of a day count function: Actual / 365 Fixed

# date1 and date2 are assumed to be lubridate dates, so that we can

# easily carry out the subtraction of two dates.

dayCountFcn_Act365F <- function(date1, date2)

{

yearFraction <- as.numeric((date2 - date1)/365)

return(yearFraction)

}

Next, since the forward rate R(t, T) depends on the forward discount factor P(t, T), let’s implement the latter first:

# date1 and date2 are again assumed to be lubridate dates.

fwdDiscountFactor <- function(anchorDate, date1, date2, xtsMarketData, dayCountFunction)

{

# Convert lubridate dates to base R dates in order to use as xts indices.

xtsDate1 <- as.Date(date1)

xtsDate2 <- as.Date(date2)

if((xtsDate1 > xtsDate2) | xtsDate2 > max(index(xtsMarketData)) |

xtsDate1 < min(index(xtsMarketData)))

{

stop("Error in date order or range")

}

# 1st, get the corresponding market zero rates from our

# interpolated market rate curve:

rate1 <- as.numeric(xtsMarketData[xtsDate1]) # R(0, T1)

rate2 <- as.numeric(xtsMarketData[xtsDate2]) # R(0, T2)

# P(0, T) = exp(-R(0, T) * (T - 0)) (A), with t = 0 <=> anchorDate

discFactor1 <- exp(-rate1 * dayCountFunction(anchorDate, date1))

discFactor2 <- exp(-rate2 * dayCountFunction(anchorDate, date2))

# P(t, T) = P(0, T) / P(0, t) (C), with t <=> date1 and T <=> date2

fwdDF <- discFactor2/discFactor1

return(fwdDF)

}

Finally, we can then write a function to compute the forward interest rate:

# date1 and date2 are assumed to be lubridate dates here as well.

fwdInterestRate <- function(anchorDate, date1, date2, xtsMarketData, dayCountFunction)

{

if(date1 == date2) {

fwdRate = 0.0 # the trivial case

} else {

fwdDF <- fwdDiscountFactor(anchorDate, date1, date2,

xtsMarketData, dayCountFunction)

# R(t, T) = -log(P(t, T)) / (T - t) (B)

fwdRate <- -log(fwdDF)/dayCountFunction(date1, date2)

}

return(fwdRate)

}

*d) Calculate discount factors and forward interest rates*

As an example, suppose we want to get the five year forward three-month discount factor and interest rates:

# Five year forward 3-month discount factor and forward rate:

date1 <- anchorDate + years(5)

date2 <- date1 + months(3)

fwdDiscountFactor(anchorDate, date1, date2, termStruct.spline.interpolate,

dayCountFcn_Act365F)

fwdInterestRate(anchorDate, date1, date2, termStruct.spline.interpolate,

dayCountFcn_Act365F)

# Results are:

# [1] 0.9919104

# [1] 0.03222516

We can also check the trivial case for P(T, T) and R(T, T), where we get 1.0 and 0.0 respectively, as expected:

# Trivial case:

fwdDiscountFactor(anchorDate, date1, date1, termStruct.spline.interpolate,

dayCountFcn_Act365F) # returns 1.0

fwdInterestRate(anchorDate, date1, date1, termStruct.spline.interpolate,

dayCountFcn_Act365F) # returns 0.0

Finally, we can verify that we can recover the market rates at various points along the curve; here, we look at 1Y and 30Y, and can check that we get 0.0165 and 0.0585, respectively:

# Check that we recover market data points:

oneYear <- anchorDate + years(1)

thirtyYears <- anchorDate + years(30)

fwdInterestRate(anchorDate, anchorDate, oneYear,

termStruct.spline.interpolate,

dayCountFcn_Act365F) # returns 1.65%

fwdInterestRate(anchorDate, anchorDate, thirtyYears,

termStruct.spline.interpolate,

dayCountFcn_Act365F) # returns 5.85%

**Concluding Remarks**

We have shown how one can implement a term structure of interest rates utilizing tools available in the R packages lubridate and xts. We have, however, limited the example to interpolation within the 30 year range of given market data without discussing extrapolation in cases where forward rates are needed beyond the endpoint. This case does arise in risk management for longer term financial instruments such as variable annuity and life insurance products, for example. One simple-minded -- but sometimes used -- method is to fix the zero rate that is given at the endpoint for all dates beyond that point. A more sophisticated approach is to use the financial cubic spline method as described in the paper by Adams (2001), cited in part 1 of the current discussion. However, xts unfortunately does not provide this interpolation method for us out of the box. Writing our own implementation might make for an interesting topic for discussion down the road -- something to keep in mind. For now, however, we have a working term structure implementation in R that we can use to demonstrate derivatives pricing and risk management models in upcoming articles.

by Joseph Rickert

Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment,* Applied Predictive Modeling* by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.

The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yet-to-be-seen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.

Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and Rule-Based models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.

*Applied Predictive Modeling* begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.

There are many ways to go about reading *Applied Predictive Modeling*. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.

Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)

A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.

The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially well-suited for self study.)

*Applied Predictive Modeling* is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.

The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good - just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.

Finally, *Applied Predictive Modeling* can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.

*Applied Predictive Modeling* is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s *The Elements of Statistical Learning*.