*In this post, Revolution engineer Sherry LaMonica shows us how to use the RevoScaleR big-data package in Revolution R Enterprise to do principal components analysis on 50 years of stock market data -- ed.*

Principal components analysis, or PCA, seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as much variability as possible and subsequent axes are chosen to maximize variance while maintaining orthogonality with previous axes. Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use the RevoScaleR function *rxCovCor* with the standard R function* princomp*.

Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available from InfoChimps. As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction. We read the original data (a set of 26 comma-separated text files, where each file is represented by a letter in the alphabet) into an .xdf file, NYSE_daily_prices.xdf:

nyseDataDir <- "C:/Users/Sherry/Downloads/NYSE" dataSourceName <- file.path(nyseDataDir, "NYSE_daily_prices") dataFileName <- "NYSE_daily_prices.xdf" append <- "none" for (i in LETTERS) { importFile <- paste(dataSourceName, "_", i, ".csv", sep="") rxTextToXdf(importFile, dataFileName, stringsAsFactors=TRUE, append=append) append <- "rows" }

The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks:

> rxGetInfoXdf(dataFileName) File name: NYSE_daily_prices.xdf Number of observations: 9211031 Number of variables: 9 Number of blocks: 34

We will use the rxCor function to calculate the Pearson's correlation matrix for the variable specified, and pass this to the princomp function:

This yields the following output:

> summary(stockPca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Standard deviation 2.0756631 0.8063270 0.197632281 0.0454173922 Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479 Cumulative Proportion 0.8616755 0.9917081 0.999519853 0.9999324005 Comp.5 Standard deviation 1.838470e-02 Proportion of Variance 6.759946e-05 Cumulative Proportion 1.000000e+00 > loadings(stockPca) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 stock_price_open -0.470 -0.166 0.867 stock_price_high -0.477 -0.151 -0.276 0.410 -0.711 stock_price_low -0.477 -0.153 -0.282 0.417 0.704 stock_price_close -0.477 -0.149 -0.305 -0.811 stock_price_adj_close -0.309 0.951 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 SS loadings 1.0 1.0 1.0 1.0 1.0 Proportion Var 0.2 0.2 0.2 0.2 0.2 Cumulative Var 0.2 0.4 0.6 0.8 1.0

The default plot method for objects of class princomp is a screeplot, which is a barplot of the variances of the principal components. We can obtain the plot as usual by calling plot with our principal components object:

> plot(stockPca)

Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.

would be nice to see the time sries of the two components, we might be able to relate them to some observables

Posted by: eran | June 17, 2011 at 11:11

I might have said this here before and I will say this again. When doing "Big Data" kind of stuff...do show timing comparison..other wise it's meaningless. PCA in general is not that interesting. Also 9 million times a handful of columns dont make big data. Pick some gene data (billion rows by thousands cols) and then see how good this is compared to some other tools (for eg SAS, SPSS, R etc). Then this PCA would be interesting....

Posted by: nick | June 17, 2011 at 16:39

I suspect you want to be looking at log prices as otherwise your errors are going to be dominated by recent prices.

Posted by: Edward | June 28, 2011 at 01:25

I would even say you have to look at returns not prices. The nonstationarity in the stock prices (or log-prices) will make the correlation coefficient meaningless. After you obtain the principal component of the returns you can obtain the principal component of the stock prices by transforming returns back to prices.

Posted by: Zeno | July 14, 2011 at 01:13