by John Mount Ph. D.

Data Scientist at Win-Vector LLC

In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called *y*-aware techniques. These often neglected methods use the fact that for predictive modeling problems we know the dependent variable, outcome or *y*, so we can use this during data preparation *in addition to* using it during modeling. Dr. Zumel shows the incorporation of *y*-aware preparation into Principal Components Analyses can capture more of the problem structure in fewer variables. Such methods include:

- Effects based variable pruning
- Significance based variable pruning
- Effects based variable scaling.

This recovers more domain structure and leads to better models. Using the foundation set in the first article Dr. Zumel quickly shows how to move from a traditional *x*-only analysis that fails to preserve a domain-specific relation of two variables to outcome to a *y*-aware analysis that preserves the relation. Or in other words how to move away from a middling result where different values of y (rendered as three colors) are hopelessly intermingled when plotted against the first two found latent variables as shown below.

Dr. Zumel shows how to perform a decisive analysis where *y* is somewhat sortable by the each of the first two latent variable *and* the first two latent variables capture complementary effects, making them good mutual candidates for further modeling (as shown below).

Click here (part 2 *y*-aware methods) for the discussion, examples, and references. Part 1 (*x* only methods) can be found here.

Could you please share the link for the part 1 of this article? thank you

Posted by: Sergio Armando Rosales Mazariegos | May 27, 2016 at 15:23

@Sergio, it's the last link in the article above: http://www.win-vector.com/blog/2016/05/pcr_part1_xonly/

Posted by: David Smith | May 27, 2016 at 17:42