by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In her series on principal components analysis for regression in R, Win-Vector LLC's Dr. Nina Zumel broke the demonstration down into the following pieces:

- Part 1: the proper preparation of data and use of principal components analysis (particularly for supervised learning or regression).
- Part 2: the introduction of
*y*-aware scaling to direct the principal components analysis to preserve variation correlated with the outcome we are trying to predict. - And now Part 3: how to pick the number of components to retain for analysis.

In the earlier parts Dr. Zumel demonstrates common poor practice versus best practice and quantifies the degree of available improvement. In part 3, she moves from the usual "pick the number of components by eyeballing it" non-advice and teaches decisive decision procedures. For picking the number of components to retain for analysis there are a number of standard techniques in the literature including:

- Pick 2, as that is all you can legibly graph.
- Pick enough to cover some fixed fraction of the variation (say 95%).
- (for variance scaled data only) Retain components with singular values at least 1.0.
- Look for a "knee in the curve" (the curve being the plot of the singular value magnitudes).
- Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

Dr. Zumel shows that the last method (designing a formal statistical test) is particularly easy to encode as a permutation test in the *y*-aware setting (there is also an obvious similarly good bootstrap test). This is well-founded and pretty much state of the art. It is also a great example of why to use a scriptable analysis platform (such as R) as it is easy to wrap arbitrarily complex methods into functions and then directly perform empirical tests on these methods. The following "broken stick" type test yields the following graph which identifies five principal components as being significant:

However, Dr. Zumel goes on to show that in a supervised learning or regression setting we can further exploit the structure of the problem and replace the traditional component magnitude tests with simple model fit significance pruning. The significance method in this case gets the stronger result of finding the two principal components that encode the known even and odd loadings of the example problem:

In fact that is sort of her point: significance pruning either on the original variables or on the derived latent components is enough to give us the right answer. In general, we get much better results when (in a supervised learning or regression situation) we use knowledge of the dependent variable (the "*y*" or outcome) and do *all* of the following:

- Fit model and significance prune incoming variables.
- Convert incoming variables into consistent response units by
*y*-aware scaling. - Fit model and significance prune resulting latent components.

The above will become much clearer and much more specific if you click here to read part 3.

No, it's not a Jaqen H'ghar quote. Recently, Hadley Wickham tweeted the following image:

While this image isn't included in Hadley's Advanced R book, he does discuss many of the implications there. The most significant of these is that **creating a copy of an object in R doesn't consume any additional memory**. (Most of the time, anyway: there are exceptions, but I'm not going into them here.) The simplest example is the following:

`a <- matrix(0, ncol=1000, nrow=1000)`

`b <- a`

Creating a allocates 8Mb of memory, but creating b requires basically nothing: all that gets created is a new *name*, b, that points to the same data as a. This isn't like a pointer in C though; if you later modify a or b a new 8Mb matrix will be created in memory to preserve the logical semantics of there being two separate copies. As a user, you don't have to worry about any of this: it all happens behind the scenes, seamlessly, thanks to the "names have objects" concept Hadley illustrates above.

The same thing applies to the call-by-value semantics of functions: if you pass an object to a function, as in eigen(a), the *semantics* of R is that a copy of the object a is passed to the function eigen, with which it can do as it pleases without modifying the global copy. In practice though, no new memory will be needed for the copy as long as the function doesn't actually mess with the object passed to it. You can prove this to yourself by using the provfis package, or by using the profiling tools in the preview release of RStudio.

This is all likely second nature to experienced R programmers, but it can be a bit of a shock to programmers coming to R from other languages. Miles McBain (who learned to program in C++) had a it of an epiphany on seeing Hadley's diagram, and explored many of the implications in a blog post which is well worth reading. His take-away:

If you’re trying to optimise R while thinking like a C++ coder, you may well be doing more harm than good. I myself have fallen foul of this in an attempt to modify data frames in place with my

`pushr`

package. It ended up just being syntactic sugar, with no observable performance boost.

If you think you might be in the same boat, take a look at Miles's post linked below.

One weiRd tip: R Has No Primitives

by Joseph Rickert

R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a conference to keep its sparkle and pizzazz. But, the conference organizers and the UIC have managed to create a vibe that keeps people coming back. The fact that invited keynote speakers (e.g. Bernhard Pfaff 2012, Sanjiv Das 2013, and Robert McDonald 2014) regularly submit papers in subsequent years is a testimony to the quality and networking importance of the event. My guess is that the single track format, quality presentations, intense compact schedule and pleasant venues comprise a winning formula.

Since I have recently written about the content of this year's conference in preparation for the event, and since most of the presentations are already online for you to examine directly I'll just present a few personal highlights here.

My favorite single visual from the conference is Bryan Lewis' depiction of corporate "Big Data" architectures as a manifestation of the impulse for completeness, control and dominance that once drove Soviet style central planning. (If you don't read Russian, run google translate on the text in the first panel.)

In his presentation, R in practice, at scale, Bryan presents a lightweight, R-centric architecture built around Redis that is adequate fro many "big data" tasks.

Matt Dziubinski's talk on Getting the Most out of Rcpp, High-Performance C++ in Practice, is probably not a talk I would have elected to attend in a multi-track conference, and I would have missed seeing a virtuoso performance. Matt got through over 120 of his 153 prepared slides in a single, lucid stream of clear (but loud) explanations in only 20 minutes. Never stopping to pause, he gave a mini-course in computer science performance evaluation (both hardware and software aspects) that addressed the Why, What and How of it all.

Ryan Hafen's presentation, Interactively Exploring Financial Trades in R, showed how to use a tool chain built around Tessera and the NxCore R package to perform exploratory data analysis on a large NxCore data set containing approximately 1.25 billion records of 47 variables without leaving the R environment. The following slide provides an example of the kinds of insights that are possible.

In his presentation, Quantitative Analysis of Dual Moving Average Indicators in Automated Trading, Douglas Service showed how to use stochastic differential equations and the Itô calculus to derive a closed form solution for expected Log returns under the Luxor trading strategy and a baseline set of simplifying assumptions. If you like seeing the Math you will be pleased to see that Doug provides all of the details.

Michael Kane (glmnetlib: A Low-Level Library for Regularized Regression) discussed the motivations for continuing to improve linear models and showed the progress he is making on re-implementing glmnet which, although very efficient, does not support arbitrary link family combinations or out of memory calculations and is written in the obscure Mortran flavor of Fortran. Kane's goal with his new package (renamed pirls: Penalized, Iteratively Reweighted Least Squares Regression) is to rectify these deficiencies while producing something fast enough to use.

In his presentation Community Finance Teaching Resources with R/Shiny, Matt Brigida showed off some opensource resources for teaching quantitative finance that are based on the new paradigm of GitHub as the place for tech savvy people to hang out and Shiny as the teaching / presentation tool for making calculations come alive. Check out some of Matt's 5 minute mini lessons. Here is an example from his What is Risk module:

There is much more than I have presented here on the R / Finance conference site. If you are interested in deep Finance and not just the tools I have highlighted, be sure to check out the presentations by Sanjiv Das, Bernhard Pfaff, Matthew Dixon and others. There is plenty of useful R code to be mined in these presentations too.

I would be remiss without mentioning Patrick Burns' keynote presentation which was highly entertaining, novel and thought provoking on many levels: everything a keynote should be. Pat launched his talk by referring to the Sapir-Worf hypothesis which posits that language controls how we think and assigned a similar role to model building. He went on to describe his Agent inspired R simulation model and showed how he calibrated this model to provide a useful tool for investigating ideas such as risk parity, variance targeting and strategies for taxing market pollution. The code for Pat's model is available here, but since his slides are not up on the conference site, and I was apparently too mesmerized to take useful notes, we will have to wait for Pat to post more on his website. (Pat's slides should be available soon.)

Finally, I would like to note that Doug Service and Sanjiv Das won the best paper prizes. This is the second year in a row for Sanjiv to win an R / Finance award. Congratulations to both Doug and Sanjiv!

Recently, I wrote about how it's possible to use predictive models to predict when an airline engine will require maintenance, and use that prediction to avoid unpleasant (and expensive!) delays for passengers on the ground. Planes generate a *lot* of data that can be used to make such predictions: today’s engines have hundreds of sensors and signals that transmit gigabytes of data for each flight. If you have access to data like this, you can generate predictions using Microsoft Azure services using the Predictive Maintenance for Aerospace solution in the Cortana Intelligence Gallery. (If you don't have data but still want to play around with the solution, it will generate simulated data based on this public data set donated by NASA.) The solution automates the process of launching and configuring several Azure services as shown in the architecture diagram below.

If you prefer the manual route, there's also a step-by-step walkthough on GitHub on deploying the Predictive Maintenance solution. Of particular interest to R users is the Predictive Maintenance Template for SQL Server R Services, which includes R code that runs in the SQL Server database to:

- Predict the Remaining Useful Life or Time To Failure of an asset, such an an engine component
- Predict if an asset will fail within certain time frame or within a specific time window

In each case, a number of different models are trained in R (decision forests, boosted decision trees, multinomial models, neural networks and poisson regression) and compared for performance; the best model is automatically selected for predictions.

On a related note, Microsoft recently teamed up with aircraft engine manufacturer Rolls-Royce to help airlines get the most out of their engines. Rolls-Royce is turning to Microsoft's Azure cloud-based services — Stream Analytics, Machine Learning and Power BI — to make recommendations to airline executives on the most efficient way to use their engines in flight and on the ground. This short video gives an overview.

Cortana Intelligence Gallery: Predictive Maintenance for Aerospace solution

by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

Unlike most other statistical software packages, R doesn't have a native data file format. You can certainly import and export data in any number of formats, but there's no native "R data file format". The closest equivalent is the `saveRDS`

/`loadRDS`

function pair, which allows you to serialize an R object to a file and then load it back into a later R session. But these files don't hew to a standardized format (it's essentially a dump of R in-memory representation of the object), and so you can't read the data with any software other than R.

The goal of the feather project, a collaboration of Wes McKinney and Hadley Wickham, is to create a standard data file format that can be used for data exchange by and between R, Python, and any other software that implements its open-source format. Data are stored in a computer-native binary format, which makes the files small (a 10-digit integer takes just 4 bytes, instead of the 10 ASCII characters required by a CSV file), and fast to read and write (no need to convert numbers to text and back again). Another reason why feather is fast is that it's a *column-oriented* file format, which matches R's internal representation of data. (In fact, feather is based on the Apache Arrow framework for working with columnar data stores.) When reading or writing traditional data files with R, it must spend signfican time translating the data from column format to row format and back again; with feather the entire second step in the process below is eliminated.

For users of R 3.3.0 and later, the feather package is now available on CRAN. (Users of older versions of R can install feather from GitHub.) With feather installed, you can read and write R data frames to feather files using simple functions:

`write_feather(mtcars. "mtcars.feather")`

`mtcars2 <- read_feather("mtcars.feather")`

Better yet, the mtcars.feather file can easily be read into Python, using its feather-format package. This example uses the small built-in mtcars data frame, but you should see a significant performance impact when working with larger data. Eduardo Ariño de la Rubia performed some benchmarking of feather, and found it to be significantly faster for ingesting data than other popular R functions. The chart below compares using feather, the data.table package, and `loadRDS`

to import 508Mb file of 8.5 million rows and 7 columns:

Feather wasn't the fastest function benchmarked for *writing* data — data.table's `fwrite`

function generally performed a bit better — but given that you typically read a file more often than writing it, the speedups should be very noticable in day-to-day data science activites.

For more on the feather package, check out its announcement from the RStudio blog linked below.

RStudio blog: Feather: A Fast On-Disk Format for Data Frames for R and Python, powered by Apache Arrow

Microsoft R Open 3.2.5 is now available for download. There are no changes of note in the R langauge engine with this release (R 3.2.5 was just a largely a version number increment).

There's lots new on the packages front though: Microsoft R Open 3.2.5 has a default CRAN snapshot date of May 1, 2016 and there was plenty of updates on CRAN in the month since MRO 3.2.4 was released. (As always, you can use the checkpoint package to access even newer package updates.) This releases' Package Spotlight notes new packages for deep learning, fuzzy joins for data frames, ggplot2 color themes inspired by scientific journals like *The Lancet* and TV shows like *The Simpsons*, calculating rolling statistics on time series, and for hexagonal-binning scatterplots for multi-class data (as shown below).

Stay tuned for the next update coming soon, Microsoft R Open 3.3.0, but for now you can download Microsoft R Open 3.2.5 at the link below.

by Joseph Rickert

For quite a few years now we have attempted to maintain the Revolution Analytics' Local R User Group Directory as the complete and authoritative list of R user groups. Meetup groups make this list in one of two ways: we discover the group because they have a web page of some sort proclaiming the group to be focused on the R language or someone from the group writes to us asking to have the group included in our list. We have deliberately pursued a relatively conservative strategy in growing the list, resisting the temptation to include every data science user group that may have had an R related presentation. Even so, we have been pleased to note the slow but steady growth of R user groups, and delighted to see quite a few relatively new meetup groups from South America and Africa make our list which, as of today, includes 235 groups.

However, it is interesting to occasionally take a broader view. Meetup.com, a very popular site for hosting user meetups of all stripes, lists a number of groups who identify with the keyword "R Project For Statistical Computing" in their "We're About" section. Using the site's tools to filter on this keyword will bring you to a page (Subject to daily fluctuations) containing somewhere in the neighborhood of 358 meetup spread over 223 cities and 55 countries. The following plot displays the top 25 meetups by number of members for May 15, 2016.

Most of these are indeed R user groups, or at least data science user groups with an interest in R. However, the presence of *Find a Tech Job in London* in the top 25 indicates that interest in R is spreading to a somewhat wider audience throughout the worldwide tech culture. So, while the Local R User Group Directory is still likely to be your best bet to finding a hard core RUG, meetup.com may lead you an R conversation in some surprising places.

My simple code to produce the chart may be downloaded here: Download Scrape_meetup

Apache Spark, the open-source cluster computing framework, will soon see a major update with the upcoming release of Spark 2.0. This update promises to be faster than Spark 1.6, thanks to a run-time compiler that generates optimized bytecode. It also promises to be easier for developers to use, with streamlined APIs and a more complete SQL implementation. (Here's a tutorial on using SQL with Spark.) Spark 2.0 will also include a new "structured streaming" API, which will allow developers to write algorithm for streaming data without having to worry about the fact that streaming data is always incomplete; algorithms written for complete DataFrame objects will work for streams as well.

This update also includes some news for R users. First, the DataFrame object continues to be the primary interface for R (and Python users). Although the DataSets structure in Spark is more general, using the single-table DataFrame construct makes sense for R and Python which have analogous native (or near-native, in Python's case) data structures. In addition, Spark 2.0 is set to add a few new distributed statistical modeling algorithms: generalized linear models (in addition to the Normal least-squares and logistic regression models in Spark 1.6); Naive Bayes; survival (censored) regression; and K-means clustering. The addition of survival regression is particularly interesting. It's a type of model used in situations where the outcome isn't always completely known: for example, some (but not all) patients may have yet experienced remission in a cancer study. It's also used for reliability analysis and lifetime estimation in manufacturing, where some (but not all) parts may have failed by the end of the observation period. To my knowledge, this is the first distributed implementation of the survival regression algorithm.

For R users, these models can be applied to large data sets stored in Spark DataFrames, and then computed using the Spark distributed computing framework. Access to the algorithms is via the SparkR package, which hews closely to standard R interfaces for model training. You'll need to first create a SparkDataFrame object in R, as a reference to a Spark DataFrame. Then, to perform logistic regression (for example), you'll use R's standard glm function, using the SparkDataFrame object as the data argument. (This elegantly uses R's object-oriented dispatch architecture to call the Spark-specific GLM code.) The example below creates a Spark DataFrame, and then uses Spark to fit a logisic regression to it:

```
df <- createDataFrame(sqlContext, iris)
model <- glm(Species ~ Sepal_Length + Sepal_Width, data = training, family = "binomial")
```

The object model contains most (but not all) of the output created by R's traditional glm algorithm, so most standard R functions that work with GLM objects work here as well.

For more on what's coming in Spark 2.0 check out the DataBricks blog post below, and the preview documentation for SparkR contains more info as well. Also, you might want to check out how you can use Spark on HDInsight with Microsoft R Server.

Databricks: Preview of Apache Spark 2.0 now on Databricks Community Edition

John Mount Ph. D.

Data Scientist at Win-Vector LLC

Win-Vector LLC's Dr. Nina Zumel has just started a two part series on Principal Components Regression that we think is well worth your time. You can read her article here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:

- It can find important latent structure and relations.
- It can reduce over fit.
- It can ease the curse of dimensionality.
- It is used in a ritualistic manner in many scientific disciplines. In some fields it is considered ignorant and uncouth to regress using original variables.

We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.

Wikipedia: PCA

And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is *already* supplied in a reliable analysis platform (such as R).

Dr. Zumel uses the expressive and graphical power of R to work through the *use* of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "*y*-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.