As a language for statistical computing, R has always had a bias towards linear algebra, and is optimized for operations dealing in complete vectors and matrixes. This can be surprising to programmers coming to R from lowerlevel languages, where iterative programming (looping over the elements of a vector or matrix) is more natural and often more efficient. That's not the case with R, though: Noam Ross explains why vectorized programming in R is a good idea:
If you can express what you want to do in R in a line or two, with just a few function calls that are actually calling compiled code, it’ll be more efficient than if you write long program, with the added overhead of many function calls. This is not the case in all other languages. Often, in compiled languages, you want to stick with lots of very simple statements, because it allows the compiler to figure out the most efficient translation of the code.
Read Noam's complete article at the link below for a bunch of useful tips and tricks for writing efficient and clear code in the R langauge using vectorized programming.
Noam Ross: Vectorization in R: Why?
by Daniel Hanson
Last time, we looked at the fourparameter Generalized Lambda Distribution, as a method of incorporating skew and kurtosis into an estimated distribution of market returns, and capturing the typical fat tails that the normal distribution cannot. Having said that, however, the Normal distribution can be useful in constructing Monte Carlo simulations, and it is still commonly found in applications such as calculating the Value at Risk (VaR) of a portfolio, pricing options, and estimating the liabilities in variable annuity contracts.
We will start here with a simple example using R, focusing on a single security. Although perhaps seemingly trivial, this lays the foundation for in more complexities such as multiple correlated securities and stochastic interest rates. Discussion of these topics is planned for articles to come, as well as topics in option pricing.
Single Security Example
Under the oftused assumption of Brownian Motion dynamics, the return of a single security (eg, an equity) over a period of time Δt is approximately [See Pelsser for example.]
μΔt + σZ・sqrt(Δt) (*)
where μ is the mean annual return of the equity (also called the drift), and σ is its annualized volatility (i.e., standard deviation). Z is a standard Normal random variable, which makes the second term in the expression stochastic. The time t is measured in units of years, so for quarterly returns, for example, Δt = 0.25.
As μ, σ, and Δt are all known values, generating a simulated distribution of returns is a simple task. As an example, suppose we are interested in constructing a distribution of quarterly returns, where μ = 10% and σ= 15%. In order to get a reasonable approximation of the distribution, we will generate n = 10,000 returns.
n < 10000
# Fixing the seed gives us a consistent set of simulated returns
set.seed(106)
z < rnorm(n) # mean = 0 and sd = 1 are defaults
mu < 0.10
sd < 0.15
delta_t < 0.25
# apply to expression (*) above
qtr_returns < mu*delta_t + sd*z*sqrt(delta_t)
Note that R is “smart enough” here by adding the scalar mu*delta_t to each element of the vector in the second term, thus giving us a set of 10,000 simulated returns. Finally, let’s check out results. First, we plot a histogram:
hist(qtr_returns, breaks = 100, col = "green")
This gives us the following:
The symmetric bell shape of the histogram is consistent with the Normal assumption. Checking the annualized mean and variance of the simulated returns,
stats < c(mean(qtr_returns) * 4, sd(qtr_returns) * 2) # sqrt(4)
names(stats) < c("mean", "volatility")
stats
We get:
mean volatility
0.09901252 0.14975805
which is very close to our original parameter settings of μ = 10% and σ= 15%.
Again, this is rather simple example, but in future discussions, we will see how it extends to using Monte Carlo simulation for option pricing and risk management models.
by Joseph Rickert
Norman Matloff professor of computer science at UC Davis, and founding member of the UCD Dept. of Statistics has begun posting as Mad(Data)Scientist. (You may know Norm from his book, The Art of R Programming: NSP, 2011.) In his second post (out today) on the new R package, freqparcoord, that he wrote with Yinkang Xie, Norm looks into outliers in baseball data.
> library(freqparcoord) > data(mlb) > freqparcoord(mlb,3,4:6,7)
We would like to welcome Norm as a new R blogger and we are looking forward to future posts!
Mad(Data)Scientist: More on freqparcoord
by Mike Bowles
Mike Bowles is a machine learning expert and serial entrepreneur. This is the second post in what is envisioned as a four part series that began with Mike's Thumbnail History of Ensemble Models.
One of the main reasons for using R is the vast array of highquality statistical algorithms available in R. Ensemble methods provide a prime example. As statistics researchers have advanced the forefront in statistical learning, they have produced R packages that incorporate their latest techniques. The table below demonstrates this by compares several of the ensemble packages available in R.
Name 
Author 
Algorithms 
1st Vers Date 
Last Update 
Peters et al 
Bagging 
3/29/2002 
9/3/2013 

Alfaro et al 
AdaBoost and Bagging 
6/6/2006 
7/5/2012 

Culp et al 
AdaBoost + Friedman’s mods 
9/29/2006 
7/30/2010 

Breiman et al 
Random Forest 
4/1/2002 
1/6/2012 

Ridgeway et al 
Stochastic Gradient Boosting 
2/21/2003 
1/18/2013 

Hothorn 
RF with faster tree growing 
6/24/2005 
1/17/2014 

Hothorn 
Boosting appl to glm, gam 
6/16/2006 
2/8/2013 
Table 1. Ensemble packages available in R
The table gives the package name, the lead author and the basic contents of the package. The dates in the rightmost two columns are the date on the first version of the package and the date on the last version. The dates more or less track the development of development of these methods and the publication the corresponding papers in the area. The date for the last package update is provided to indicate how actively some of these packages are maintained and how active the field remains.
A number of these packages are worth having a look at, even though the methods they implement have been subsumed in other newer methods. For example ipred does bagging which has been incorporated into both Random Forest and Gradient Boosting. But the ipred package has the ability to incorporate more than one type of base learner. One of the examples in the package documentation incorporates Linear Discriminant Analysis in addition to Binary Decision Tree. It is hard to find ensemble methods using base learners other than binary decision trees. Simultaneously using two (or more) different base learners is singular to this package.
The randomForest algorithm wins the machine learning competitions and the R package was written by late Professor Leo Breiman of Berkeley. It contains the functionality that Prof Breiman describes in his papers. It solves regression and classification problems, has an unsupervised mode, produces marginal plots of prediction versus individual attributes, ranks attributes by importance. It also produces a similarity matrix measuring how frequently two rows from the input wind up in the same leaf node together. That gives a measure of how close the two rows are in their effect on the trained model.
The gbm package is heavily used and commercially important. It’s written by Greg Ridgeway and contributors. The package incorporates the methods outlined in Professor Jerome Friedman’s papers. Those include regression under mean square and mean absolute loss, binary classification under AdaBoost penalty and Bernoulli loss and multiclass classification. The package includes a number of extensions (Cox proportional hazard and pairwise ranking as examples). The gbm package includes similar visualization tools as randomForest. It will draw 2D or 3D plots showing marginal predicted values versus 1 or 2 of the attributes and gives a table ranking attributes by importance as a guide for feature engineering. (After loading the package, type example(gbm) at the console.)
The R packages party and mboost reflect continued development of ensemble methods. The party package uses an alternative method for training binary decision trees. The method is called conditional inference trees. The package authors describe in their associated paper how conditional inference trees^{1} reduce bias and reduce training time. In the party package, the authors use Breiman’s Random Forest procedure incorporating conditional inference trees as base learners.
The mboost package approaches generalized linear model and generalized additive model as boosting problems. The connection between boosting is described in Elements of Statistical Learning^{2}, Algorithm 16.1. If used for least squares regression then the method of taking base learners as being single attributes corresponds to Efron’s Least Angle Regression^{3} or Tibshirani’s Lasso regression^{4}. The package authors extend the method to apply to generalized linear model and generalized additive model.
Here’s an example of the sort of results these methods will produce. These results are for predicting the compressive strength of concrete based on ingredients in the concrete (water, cement, coarse aggregate, fine aggregate etc.). The data set comes from the UC Irvine Data Repository. The results come from gbm package (3000 trees, 10x crossvalidation, shrinkage=0.003). In Figure 1, going clockwise from the upper left are plots of the progress of training (green line is outofsample performance and black line is insample performance), relative importance of the various ingredients in predicting compressive strength, and the marginal changes in predicted strength as functions of fine aggregate and water. As the figures show modern ensemble methods are far from being black boxes. Besides delivering predictions, they deliver a significant amount of information about the character of their predictions.
Figure 1 – Outputs from gbm Model for UCI Compressive Strength of Concrete
References
by Joseph Rickert
Generalized Linear Models have become part of the fabric of modern statistics, and logistic regression, at least, is a “go to” tool for data scientists building classification applications. The ready availability of good GLM software and the interpretability of the results logistic regression makes it a good baseline classifier. Moreover, Paul Komarek argues that, with a little bit tweaking, the basic iteratively reweighted least squares algorithm used to evaluate the maximum likelihood estimates can be made robust and stable enough to allow logistic regression to challenge specialized classifiers such as support vector machines.
It is relatively easy to figure how to code a GLM in R. Even a total newcomer to R is likely to figure out that the glm()function is part of the core R language within a minute or so of searching. Thereafter though, it gets more difficult to find other GLM related stuff that R has to offer. Here is a far from complete, but hopefully helpful, list of resources.
Online documentation that I have found helpful includes the contributed book by Virasakdi Chongsuvivatwong and the tutorials from Princeton and UCLA. Here is slick visualization of a poisson model from the Freakonometrics blog.
But finding introductory materials is on GLMs is not difficult. Almost all of the many books on learning statistics with R have chapters on the GLM including the classic Modern Applied Statistics with S, by Venables and Ripley, and one of my favorite texts, Data Analysis and Graphics Using R, by Maindonald and Braun. It is more of a challenge, however, to sort through the more than 5,000 packages on CRAN to find additional functions that could help with various specialized aspects or extensions to the GLM. So here is a short list of GLM related packages.
Packages to help with convergence and improve the fit
Packages for variable selection and regularization
Packages for special models
Bayesian GLMs
GLMs for Big Data
Generalized Additive Models, GAMS,generalize GLMs
Beyond the documentation and and a list of packages that may be useful, it is also nice to have the benefit of some practical experience. John Mount has written prolifically about logistic regression in his WinVector Blog over the past few years. His post, How robust is logistic regression, is an illuminating discussion of convergence issues surrounding NewtonRaphson/IterativelyReweightedLeast Squares. It contains pointers to examples illustrating the trouble caused by complete or quasicomplete separation as well as links to the academic literature. This post is a classic, but all of the other posts in the series are very much worth the read.
Finally, as a reminder of the trouble you can get into interpreting tvalues from a GLM, here is another classic, a post from the SNews archives on the HauckDonner phenomenon.
by Seth Mottaghinejad, Analytic Consultant for Revolution Analytics
You may have heard before that R is a vectorized language, but what do we mean by that? One way to read that is to say that many functions in R can operate efficiently on vectors (in addition to singletons). Here are some examples:
> log(1) # input and output are singletons
[1] 0
> log(1:10) # input and output are vectors
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
[8] 2.0794415 2.1972246 2.3025851
> paste("a", "1") # join two strings together into a single string
[1] "a 1"
> paste(letters[1:10], 1:10) # same as above, but vectorwise
[1] "a 1" "b 2" "c 3" "d 4" "e 5" "f 6" "g 7" "h 8" "i 9" "j 10"
Being aware of which functions are vectorized in R and using them can make a big difference when it comes to writing succinct and efficient R code. For example, had you not known that 'paste' is vectorized, instead of the above line of code you would need to loop through each value of your vector, or alternatively use one of the 'apply' family of functions, both of which are shown below.
> vv < character(10) # initialize an empty character vector of length 10
> for (i in 1:10) vv[i] < paste(letters[i], i) # fill in the values
> print(vv)
[1] "a 1" "b 2" "c 3" "d 4" "e 5" "f 6" "g 7" "h 8" "i 9" "j 10"
> sapply(1:10, function(i) paste(letters[i], i)) # use sapply instead of a loop
[1] "a 1" "b 2" "c 3" "d 4" "e 5" "f 6" "g 7" "h 8" "i 9" "j 10"
You can check for yourself that neither of the above two approaches are as efficient as using the 'paste' function in a vectorized fashon.
For a more serious demonstration of what a difference vectorization can make, we will look at an R implementation of the Collatz conjecture. (Check out xkcd for an entertaining presentation.) The Collatz conjecture states that if you start with any positive integer n, and recursively apply the following algorithm, you will eventually reach 1:
The number of iterations it takes for n to reach 1 is called its stopping time. To restate, the Collatz conjecture states that any integer n has a finite stopping time.
Let's try a naive implementation of the Collatz conjecture in R (a better implementation would use memoization, which we will not cover). There are two ways to implement the above in R. Here's the first way:
nonvec_collatz < function(ints){ collatz < function(n) { # n is a single integer # recusively applies the Collatz conjecture to n # returns the number of iterations it takes to reach 1 counter < 0 while (n != 1){ counter < counter + 1 n < ifelse(n %% 2 == 0, n / 2, 3*n + 1) } counter } # we use sapply to apply the above function to a vector of integers sapply(ints, collatz) }
> set.seed(20)
> nonvec_collatz
[1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9
In the above approach, we write a function 'collatz' which, given a single integer n, will determine its stopping time. Since collatz can only take a single integer as input, we can pass it to 'sapply' to get it to work on a vector on integers. The above implementation is wrapped around in a function which we call 'nonvec_collatz'. Although 'nonvec_collatz' is vectorized in the sense that given a vector of integers as input you get a vector of integers as output, vectorization was achieved through the use of 'sapply' and so it is only aestetically vectorized and not necessarily efficient.
In our second approach, we write a function 'vec_collatz' which is truely vectorized:
vec_collatz < function(ints){ # we store the number of iterations for each number into niter niter < integer(length(ints)) # while there remains a number that has not yet converged to 1, run the loop while (abs(sum(ints  1)) > .01){ niter < niter + ifelse(ints == 1, 0, 1) ints < ifelse(ints == 1, ints, ifelse(ints %% 2 == 0, ints / 2, 3*ints + 1)) } niter }
> set.seed(20)
> vec_collatz(sample(20))
[1] 20 17 8 19 4 20 1 0 2 5 3 16 9 7 17 7 12 6 14 9
Notice any diffences between 'nonvec_collatz' and 'vec_collatz'? Both functions run a while loop that stops once the number reaches 1. But in 'vec_collatz' the while loop takes advantage of the fact that 'ifelse' is a vectorized function to run the recursive process on the whole vector of integers, instead of one at a time, as is done by 'nonvec_collatz'.
Let's look how much more efficient 'vec_collatz' is compared to 'nonvec_collatz':
> library(microbenchmark) > microbenchmark(vec_collatz(1:10^2), nonvec_collatz(1:10^2), times = 20) Unit: milliseconds expr min lq median uq max neval vec_collatz(1:10^2) 20.63166 21.82983 23.13596 24.33374 25.24081 20 nonvec_collatz(1:10^2) 42.82601 47.40102 49.23785 50.68589 54.94907 20 > microbenchmark(vec_collatz(1:10^3), nonvec_collatz(1:10^3), times = 20) Unit: milliseconds expr min lq median uq max neval vec_collatz(1:10^3) 213.4263 221.0164 228.4409 234.2387 260.2224 20 nonvec_collatz(1:10^3) 778.5368 920.8171 957.5107 982.2117 1062.2317 20 > microbenchmark(vec_collatz(1:10^4), nonvec_collatz(1:10^4), times = 20) Unit: seconds expr min lq median uq max neval vec_collatz(1:10^4) 3.138519 3.183883 3.191266 3.276266 3.35356 20 nonvec_collatz(1:10^4) 12.997234 13.349863 13.396626 13.482648 13.87802 20
The results above indicate that 'vec_collatz' is much more efficient than 'nonvec_collatz', and increasingly so for larger sets of numbers.
Now let's use 'vec_collatz' to draw a density plot for the stopping times for all integers from 1 to n = 10^i where we vary i from 1 to 6 (we create a separate plot for each i).
An interesting pattern emerges when we plot each number against its stopping time.
par(mfrow = c(1, 1))
plot(1:(10^41), Reduce(c, res[1:4]), xlab = "n", ylab = "stopping time", lwd = 2, col = "blue")
Whether there is a significance to the observed pattern is the subject of a great many dissertations, and we will take a stab at this problem using a statistical approach in a later post.
Most people know R as a statistics/analytics language for analysis of quantitative data, and don't think of it as a tool for processing raw text. But R actually has some quite powerful facilities for processing character data. And as Gaston Sanchez learned, text manipulation is an important part of a modern data scientist's repertoire:
Many years ago I decided to apply for a job in a company that developed data mining applications for big retailers. I was invited for an onsite visit and I went through the typical series of interviews with the members of the analytics team. Everything was going smoothly and I was enjoying all the conversations. Then it came turn to meet the computer scientist. After briefly describing his role in the team he started asking me a bunch of technical questions and tests. Although I was able to answer those questions related with statistics and multivariate analysis, I had a really hard time trying to answer a series of questions related with string manipulations.
I will remember my interview with that guy as one of the most embarrassing moments of my life. That day, the first thing I did when I went back home was to open my laptop, launch R, and start reproducing the tests I failed to solve. It didn't take me that much to get the right answers. Unfortunately, it was too late and the harm was already done. Needless to say I wasn't offered the job. That shocking experience showed me that I was not prepared for manipulating character strings. I felt so bad that I promised myself to learn the basics of strings manipulation and text processing. "Handling and Processing Strings in R" is one of the derived results of that old promise.
Gaston's CreativeCommons licensed 112page ebook, Handling and Processing Strings in R, is an excellent and comprehensive review of R's string handling capabilities. It cover's R's basic stringhandling capabilities (reading, converting, manipulating and formatting), and also devotes a chapter to the higherlevel functions of Hadley Wickham's stringr package. The two chapters on regular expressions are a mustread for anyone who hasn't yet come to grips with the power of regexes for handing stringbased data. There are a few practical examples at the end of the ebook (frequency counting, word clouds) but the book sticks mainly with the fundamentals, and doesn't stray into semantic analysis. Highly recommended for anyone working with strings or character data in R.
Gaston Sanchez: Handling and Processing Strings in R (via Sharon Machlis)
Graham Williams is the Lead Data Scientist at the Australian Taxation Office, and the creator of Rattle, an opensource GUI for data mining with R. (Check out some recent reviews/demos of Rattle on this blog here and here.) Dr Williams continues his many contributions to the R community with One Page R, a "Survival Guide to Data Science with R".
As the name suggests, it is a onepage listing of key topics for data scientists using R: getting started, dealing with data, descriptive and prescriptive analytics, and other advanced topics. But the name belies the depth of content behind that one page: click through for detailed tutorials on topics including visualizing data with maps, text mining with R, ensembles of decision trees, and much much more. Some topics also include slidebased lecture notes and downloadable R code (follow the *R link).
Explore the content at the One Page R website lined below. And if you're in the NYC area, Dr Williams will be presenting at a free workshop for the CUNY Data Mining Initiative on March 7.
Togaware: One Page R: A Survival Guide to Data Science with R (via KDnuggets)
There's no shortage of web sites listing the current medal standings at Sochi, not least the official Winter Olympics Medal Tally. And here's the same tally, rendered with R:
Click through to see a realtime version of the chart, created with RStudio's Shiny by Ty Henkaline. (By the way, does anyone know if it's possible to embed a live version of the chart in a blog post like this?) If you're looking to create similar realtime charts of Webbased tables, be sure to check out the underlying code by Tyler Rinker that grabs the medal table from the Sochi website, cleans up the data, and plots the medal tally as a chart.
[Updated: The interactive chart by Ty Henkaline was mistakenly attributed to Ramnath Vaidyanathan. Apologies for the error.]
TRinker's R Blog: Sochi Olympic Medals
by Daniel Hanson, QA Data Scientist, Revolution Analytics
Last time, we included a couple of examples of plotting a single xts time series using the plot(.) function (ie, said function included in the xts package). Today, we’ll look at some quick and easy methods for plotting overlays of multiple xts time series in a single graph. As this information is not explicitly covered in the examples provided with xts and base R, this discussion may save you a bit of time.
To start, let’s look at five sets of cumulative returns for the following ETF’s:
SPY SPDR S&P 500 ETF Trust
QQQ PowerShares NASDAQ QQQ Trust
GDX Market Vectors Gold Miners ETF
DBO PowerShares DB Oil Fund (ETF)
VWO Vanguard FTSE Emerging Markets ETF
We first obtain the data using quantmod, going back to January 2007:
library(quantmod)
tckrs < c("SPY", "QQQ", "GDX", "DBO", "VWO")
getSymbols(tckrs, from = "20070101")
Then, extract just the closing prices from each set:
SPY.Close < SPY[,4]
QQQ.Close < QQQ[,4]
GDX.Close < GDX[,4]
DBO.Close < DBO[,4]
VWO.Close < VWO[,4]
What we want is the set of cumulative returns for each, in the sense of the cumulative value of $1 over time. To do this, it is simply a case of dividing each daily price in the series by the price on the first day of the series. As SPY.Close[1], for example, is itself an xts object, we need to coerce it to numeric in order to carry out the division:
SPY1 < as.numeric(SPY.Close[1])
QQQ1 < as.numeric(QQQ.Close[1])
GDX1 < as.numeric(GDX.Close[1])
DBO1 < as.numeric(DBO.Close[1])
VWO1 < as.numeric(VWO.Close[1])
Then, it’s a case of dividing each series by the price on the first day, just as one would divide an R vector by a scalar. For convenience of notation, we’ll just save these results back into the original ETF ticker names and overwrite the original objects:
SPY < SPY.Close/SPY1
QQQ < QQQ.Close/QQQ1
GDX < GDX.Close/GDX1
DBO < DBO.Close/DBO1
VWO < VWO.Close/VWO1
We then merge all of these xts time series into a single xts object (à la a matrix):
basket < cbind(SPY, QQQ, GDX, DBO, VWO)
Note that is.xts(basket)returns TRUE. We can also have a look at the data and its structure:
> head(basket)
SPY.Close QQQ.Close GDX.Close DBO.Close VWO.Close
20070103 1.0000000 1.000000 1.0000000 NA 1.0000000
20070104 1.0021221 1.018964 0.9815249 NA 0.9890886
20070105 0.9941289 1.014107 0.9682540 1.0000000 0.9614891
20070108 0.9987267 1.014801 0.9705959 1.0024722 0.9720154
20070109 0.9978779 1.019889 0.9640906 0.9929955 0.9487805
20070110 1.0012025 1.031915 0.9526412 0.9517923 0.9460847
> tail(basket)
SPY.Close QQQ.Close GDX.Close DBO.Close VWO.Close
20140110 1.302539 NA 0.5727296 1.082406 0.5118100
20140113 1.285209 1.989130 0.5893833 1.068809 0.5053915
20140114 1.299215 2.027058 0.5750716 1.074166 0.5110398
20140115 1.306218 2.043710 0.5826177 1.092707 0.5109114
20140116 1.304520 2.043941 0.5886027 1.089411 0.5080873
20140117 1.299003 2.032377 0.6070778 1.090647 0.5062901
Note that we have a few NA values here. This will not be of any significant consequence for demonstrating plotting functions, however.
We will now look how we can plot all five series, overlayed on a single graph. In particular, we will look at the plot(.) functions in both the zoo and xts packages.
The xts package is an extension of the zoo package, so coercing our xts object basket to a zoo object is a simple task:
zoo.basket < as.zoo(basket)
Looking at head(zoo.basket) and tail(zoo.basket), we will get output that looks the same as what we got for the original xts basket object, as shown above; the date to data mapping is preserved. The plot(.) function provided in zoo is very simple to use, as we can use the whole zoo.basket object as input, and the plot(.) function will overlay the time series and scale the vertical axis for us with the help of a single parameter setting, namely the screens parameter.
Let’s now look at the code and the resulting plot in the following example, and then explain what’s going on:
# Set a color scheme:
tsRainbow < rainbow(ncol(zoo.basket))
# Plot the overlayed series
plot(x = zoo.basket, ylab = "Cumulative Return", main = "Cumulative Returns",
col = tsRainbow, screens = 1)
# Set a legend in the upper left hand corner to match color to return series
legend(x = "topleft", legend = c("SPY", "QQQ", "GDX", "DBO", "VWO"),
lty = 1,col = tsRainbow)
We started by setting a color scheme, using the rainbow(.) command that is included in the base R installation. It is convenient as R will take in an arbitrary positive integer value and select a sequence of distinct colors up to the number specified. This is a nice feature for the impatient or lazy among us (yes, guilty as charged) who don’t want to be bothered with picking out colors and just want to see the result right away.
Next, in the plot(.) command, we assign to x our “matrix” of time series in the zoo.basket object, labels for the horizontal and vertical axes (xlab, ylab), a title for the graph (main), the the colors (col). Last, but crucial, is the parameter setting screens = 1, which tells the plot command to overlay each series in a single graph.
Finally, we include the legend(.) command to place a color legend at the upper left hand corner of the graph. The position (x) may be chosen from the list of keywords "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right" and "center"; in our case, we chose "topleft". The legend parameter is simply the list of ticker names. The lty parameter refers to “line type”, and by setting it to 1, the lines in the legend are shown as solid lines, and as in the plot(.) function, the same color scheme is assigned to the parameter col.
Back to the color scheme, we may at some point need to show our results to a manager or a client, so in that case, we probably will want to choose colors that are easier on the eye. In this case, one can just store the colors into a vector, and then use it as an input parameter. For example, set
myColors < c("red", "darkgreen", "goldenrod", "darkblue", "darkviolet")
Then, just replace col = tsRainbow with col = myColors in the plot and legend commands:
plot(x = zoo.basket, xlab = "Time", ylab = "Cumulative Return",
main = "Cumulative Returns", col = myColors, screens = 1)
legend(x = "topleft", legend = c("SPY", "QQQ", "GDX", "DBO", "VWO"),
lty = 1, col = myColors)
We then get a plot that looks like this:
While the plot(.) function in zoo gave us a quick and convenient way of plotting multiple time series, it didn’t give us much control over the scale used along the horizontal axis. Using plot(.) in xts remedies this; however, it involves doing more work. In particular, we can no longer input the entire “matrix” object; we must add each series separately in order to layer the plots. We also need to specify the scale along the vertical axis, as in the xts case, the function will not do this on the fly as it did for us in the zoo case.
We will use individual columns from our original xts object, basket. By using basket rather than basket.zoo, this tells R to use the xts version of the function rather than the zoo version (à la an overloaded function in traditional object oriented programming). Let’s again look at an example and the resulting plot, and then discuss how it works:
plot(x = basket[,"SPY.Close"], xlab = "Time", ylab = "Cumulative Return",
main = "Cumulative Returns", ylim = c(0.0, 2.5), major.ticks= "years",
minor.ticks = FALSE, col = "red")
lines(x = basket[,"QQQ.Close"], col = "darkgreen")
lines(x = basket[,"GDX.Close"], col = "goldenrod")
lines(x = basket[,"DBO.Close"], col = "darkblue")
lines(x = basket[,"VWO.Close"], col = "darkviolet")
legend(x = 'topleft', legend = c("SPY", "QQQ", "GDX", "DBO", "VWO"),
lty = 1, col = myColors)
As mentioned, we need to add each time series separately in this case in order to get the desired overlays. If one were to try x = basket in the plot function, the graph would only display the first series (SPY), and a warning message would be returned to the R session. So, we first use the SPY series as input to the plot(.) function, and then add the remaining series with the lines(.) command. The color for each series is also included at each step (the same colors in our myColors vector).
As for the remaining arguments in the plot command, we use the same axis and title settings in xlab, ylab, and main. We set the scale of the vertical axis with the ylim parameter; noting from our previous example that VWO hovered near zero at the low end, and that DBO reached almost as high as 2.5, we set this range from 0.0 to 2.5. Two new arguments here are the major.ticks and minor.ticks settings. The major.ticks argument represents the periods in which we wish to chop up the horizontal axis; it is chosen from the set
{"years", "months", "weeks", "days", "hours", "minutes", "seconds"}
In the example above, we chose "years". The minor.ticks parameter can take values of TRUE/FALSE, and as we don’t need this for the graph, we choose FALSE. The same legend command that we used in the zoo case can be used here as well (using myColors to indicate the color of each time series plot). Just to compare, let’s change the major.ticks parameter to "months" in the previous example. The result is as follows:
A new package, called xtsExtra, includes a new plot(.) function that provides added functionality, including a legend generator. However, while it is available on RForge, it has not yet made it into the official CRAN repository. More sophisticated time series plotting capability can also be found in the quantmod and ggplot2 packages, and we will look at the ggplot2 case in an upcoming post. However, for plotting xts objects quickly and with minimal fuss, the plot(.) function in the zoo package fills the bill, and with a little more effort, we can refine the scale along the horizontal axis using the xts version of plot(.). R help files for each of these can be found by selecting plot.zoo and plot.xts respectively in help searches.