If you want to do statistical analysis or machine learning with data in SQL Server, you can of course extract the data from SQL Server and then analyze it in R or Python. But a better way is to run R or Python *within* the database, using Microsoft ML Services in SQL Server 2017. Why?

**It's faster**. Not only to you get to use the SQL Server instance (which is likely to be faster than your local machine), but it also means you no longer have to transport the data over a network, which is likely to be the biggest factor, particularly when working with large data sets.**It operationalizes your analysis**. Rather than running an ad-hoc analysis on your desktop, you can instead publish it as a stored procedure, and make your analysis available on demand as a SQL query,**It's more powerful**. Not only can you use all of open-source R and Python within SQL Server, but you also get access to dedicated Microsoft libraries in R/Python including RevoScaleR/revoscalepy and MicrosoftML/microsoftml, which provide algorithms optimized for large data sets in SQL Server.

If you'd like to learn how to run R within SQL Server, a great resource is the book SQL Server 2017 Machine Learning Services with R, by Tomaž Kaštrun and Julie Koesmarno. At over 300 pages, the book covers every detail, including installation and configuration, an overview of data analysis in R, operationalizing R code in SQL Server and via Power BI, and using the aforementioned extension libraries for R. It's mostly aimed at R-using data scientists, but does includes a chapter for database administrations to provide an introduction to R for DBAs as well. In short, it contains everything you need to know to build production applications with R and SQL Server 2017 (or 2016), backed up with a wealth of examples from industry.

You can purchase a paperback copy from your favorite bookseller, and you can also get a discounted digital edition from the publisher at the link below.

Packt Publishing: SQL Server 2017 Machine Learning Services with R

When it comes to getting things right in data science, most of the focus goes to the data and the statistical methodology used. But when a misplaced parenthesis can throw off your results *entirely*, ensuring correctness in your programming is just as important. A new book published by CRC Press, *Testing R Code* by Richard (Richie) Cotton, provides all the guidance you’ll need to write robust, correct code in the R language.

This is not a book exclusively for package developers: data science, after all, is a combination of programming and statistics, and this book provides lots of useful advice on helping to ensure the programming side of things is correct. The book suggests you begin by using the assertive package to provide alerts when your code runs off the rails. From there, you can move on to creating formal unit tests as you code, using the testthat package. There's also a wealth of useful advice for organizing and writing code that's more maintainable in the long run.

If you *are* a package developer, there's lots here for you too. There are detailed instructions on incorporating unit tests into your package, and how tests interact with version control and continuous integration systems. It also touches on advanced topics relevant to package authors, like debugging C or C++ code embedded via the Rcpp package. And there's more good advice, like how to write better error messages so your users can recover more easily when an error does occur.

Testing is a topic that doesn't get as much attention as it deserves in data science disciplines. One reason may be that it's a fairly dry topic, but Cotton does a good job in making the material engaging with practical examples and regular exercises (with answers in the appendix). Frequent (and often amusing) footnotes help make this an entertaining read (given the material) and hopefully will motivate you to make testing a standard (and early) part of your R programming process.

*Testing R Code* is available from your favorite bookseller now, in hardback and electronic formats.

by Joseph Rickert

Basically, there are two kinds of graphics or plots you can make from a data set: (1) those that allow you to see what is going on with the data, and (2) those you make to communicate what you have found to someone else. When making the first kind, you want to select plots that will enable you to see as much as possible while taking great care not to fool yourself. With the second kind, you ought to select plots that make the features of the data you want to communicate seem obvious. They should be focused on the story you are trying to tell, be free from clutter and have an impact on your target audience.

Anthony Unwin’s Graphical Data Analysis with R (CRC Press 2015) is a very good read that thoroughly discusses the process and principles behind plots of the first kind while offering considerable guidance about producing those of the second kind. In 14 chapters that extend to nearly 300 pages, Unwin makes superb use of the R language to develop the principles of Graphical Data Analysis (GDA) while demonstrating the interplay of plot making and basic statistical inference that together make for a comprehensive, exploratory analysis of a data set.

The Preface and Chapter 1 set the scene by defining the scope of GDA, illustrating some of its basic principles and offering a new metaphor that ought to replace the tired and misleading idea that good graphics let you “drill down” into the data. This mechanical notion assumes you know where to drill and possess a clear idea of what you are looking for. In stark contrast, Unwin offers the metaphor of a photographer who takes many photographs of an object from multiple angles and in different lighting conditions in order to “grasp a whole object”. Throughout the book Unwin hammers home his central idea that many graphics of a dataset should be drawn “maybe even a large number of them, where each contributes something to the overall picture”.

Chapter 2 presents a fairly comprehensive review of the relevant literature for GDA acknowledging the contributions of Cleveland, Cook, Murrell, Wickham and many others and pointing out software, websites, texts and other resources that a student should find helpful. Because, as the author explains, “There is no complex theory about graphics” it is the practice of experts and successful exemplars that comprise the foundations of the subject. I found this sketch of GDA’s background helpful in coming to see GDA as an emergent discipline in its own right.

Chapters 3 through 7 discuss graphics for continuous variables, categorical data, looking for structure and associations, multivariate continuous data and multivariate categorical data. These chapters comprise the core of the text, describing most of the basic plots available in the R arsenal through numerous examples with multiple data sets. Chapter 6 on *Investigating Multivariate Continuous Data* provides as thorough a discussion parallel coordinate plots as you are likely to find anywhere.

Not only does Unwin describe the logic and technique of the plots in these core chapters but he is careful to provide thoughtful and measured interpretations. Consider plots 5.5 and 5.6 and their captions, the second and third plots in a series of three describing the geyser data set.

library(ggplot2)

data(geyser, package="MASS")

ggplot(geyser, aes(duration, waiting)) + geom_point() +

geom_density2d()

FIGURE 5.5: The same scatter plot with bivariate density contours. There is evidence of three concentrations of data, two univariate outliers (one eruption with low duration and one with a high waiting time until the next eruption), and one bivariate outlier.

library(hdrcde)

par(mar=c(3.1, 4.1, 1.1, 2.1))

with(geyser, hdr.boxplot.2d(duration, waiting,

show.points=TRUE, prob=c(0.01,0.05,0.5,0.75)))

FIGURE5.5: Another version of the scatterplot, but now with highest density regions based on a bivariate density estimate. There is less evidence of three data concentrations than in the previous plot and there is a slightly different set of possible outliers.

Of the many possible examples from the book, I have selected these plots to illustrate Unwin’s style because they show how a simple scatter plot enhanced with sophisticated, “built in” mathematical tools can illuminate the complexities of the data. They also serve to illustrate the value of Unwin’s injunction that a good graphic ought to be accompanied by a thought-through caption and highlight the ephemeral boundary between seeing and interpreting.

Chapter 8 - *Getting and Overview*, Chapter 9 - *Graphics and Data Quality: How Good are the Data?*, Chapter 10 - *Comparisons, Comparisons, Comparisons* and Chapter 12 – *Ensemble Graphics and Case Studies* together comprise a mini course on Exploratory Data Analysis.

Chapter 11 –* Graphics for Time Series* provides some good advice and proposes a few alternative suggestions for plotting time series data including the use of parallel coordinated plots and calendar plots (see the openair package). It unusual to find any extended discussion on plotting time series data.At the very least this chapter summarizes a good amount of common sense wisdom.

Chapter 13 – *Some Notes on Graphics with R* provides a short discussion of R’s graphic systems along with tips on R coding for graphics and offers some suggestions for dealing with large data sets.

Chapter 14 provides a very brief summary of the text and a short assessment of the strengths and weakness of GDA itself.

The text has some nice pedagogical features that make it appealing for self study. Every chapter is preceded by a short summary of what the chapter is about, ends with a list of the main points covered and offers some exercises that extend the material covered in the text, often pointing to additional R packages. I particularly enjoyed Unwin's brief historical references that accompany examples drawn from the HisData package like the context surrounding the *Charge of the Light Brigade*.

*Graphical Data Analysis with R* will certainly be valuable to anyone wanting to create better graphics in R. It is sufficiently rich in well coded, ggplot2 examples that it will serve as a good reference even after the basic principles have been assimilated. But, in my view, the book has more to offer than examples and ought to be read more closely, and more widely. Although the examples primarily make use of the ggplot2 system, the principles that they illustrate are much more general making them useful to anyone with some experience using any robust statistical plotting system including base R graphics, the lattice package, Python, Matlab and others. Also, as I mentioned above, the text provides a fairly complete discussion of exploratory data analysis and could easily be used as the basis for short course in this subject.

Moreover, Unwin’s careful treatment of the plots he draws, his use of multiple data sets, and his ongoing discussion of the interplay between graphical analysis and statistical inference make *Graphical Data Analysis with R* a suitable text for alternative first course in statistics. To my mind, a future scientist or intelligent consumer of plots and statistical information would be better served by a modern computational approach to working with data than most standard introductory courses in statistics which, in the end, amount to little more than a futile attempt to explain what a p-value isn’t.

Who can say how any individual text will be read and valued in these days when e-books dilute the integrity of any printed texts by merging them into the rushing stream of online content? Even so, I think that many readers will find that *Graphical Data Analysis with R* illuminates a shadowy corner of statistical analysis and stands a good chance of becoming recognized as a foundational text for GDA.

by Joseph Rickert

What are you reading? - and what are you recommending to friends, colleagues, and students who want to learn something about R programming? A quick search of Amazon will show that there are several new R books proposed for 2016; but of course, new doesn't necessarily mean better. I fully expect that many new books in all areas of statistics, data science and many other scientific disciplines using R to provide a computational aspect for their exposition will continue to be written for years to come. All of these books will provide windows into learning R for people excited about the particular subject matter. However, so many excellent R based texts have already been published that it will be difficult for these new works to achieve "must buy" status for the R content alone.

Below are my recommendations for good R reads. Some of these books go back a few years, but they continue to hold their value. With the possible exception of books that were based primarily on the S language, good R books don't become obsolete. Unlike some other computer languages, R evolves mostly through new capabilities added by contributed packages, not through changes to the R core. The fact that the dplyr family of packages may make data wrangling more convenient in many circumstances doesn't make a book that teaches data manipulation through base R functions any less relevant. In fact, some might argue that new students should be taught the basic functionally first. I am not a militant traditionalist, but it does seem to me that familiarity with the bare bones basics of the language will help newcomers to gain intuition about how R works.

There are three lists below. The first lists my picks for teaching R programming. (Top row in the graphic) The second list provides my recommendations for people interested in learning R for data science. (Second row in the graphic).

The third list is of books on my shelf that I continue to value. For every entry in all three lists I provide a mini or micro review. In a few cases, I point to a more extensive review that I have previously published in this blog. My lists are in no way intended to be complete. But, I apologize right now if I have omitted some really good books. Please let me know about what I have missed by commenting to this post with a mini review of your own.

Advanced R by Hadley Wickham - Anyone who wants to gain a deep understanding of the R language will certainly benefit from this book. More than a reference: the author seeks to provide a conceptual framework for understanding R’s structure and guide readers through R’s idiosyncratic mechanisms pointing out traps, illuminating difficult concepts and providing expert commentary.

The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff – This is still my pick for the best book for people with some programming experience who want to make a serious effort at learning R. Professor Matloff’s interest in teaching the mechanics of programming infused with his deep understanding of both the underlying computer science and statistical theory put this book on top.

Hands on Programming with R by Garrett Grolemund – If you are not only new to R but new to programming as well this is the book for you. I have review it more extensively here.

R For Dummies by Andrie de Vries and Joris Meys – A current, concise and insightful reference to core concepts in the R language. A really nice feature of the book is its emphasis on presenting the R ecosystem along with core R concepts. When learning anything new, it is always helpful to understand the big picture. Keep this book by your computer, when you stop referring to it you will be a pretty good R programmer.

Applied Predictive Modeling by Max-Kuhn and Kjell Johnson – This book is the master text for predictive analytics, carefully walking through several modeling examples and making expert use of the extensive machine learning tools in R’s caret package. I have described the book more fully here.

Data Mining with Rattle and R by Graham Williams – This is the perfect first book for machine learning with R. The rattle GUI helps get across the machine learning concepts and also produces some pretty good R code to get your started.

Data Science in R: A Case Studies Approach to Computational Reasoning and Data Science by Deborah Nolan and Duncan Temple Lang. – My most recent acquisition, this book consists of 12, non-trivial case studies organized under three themes: Data Manipulation and modeling, Simulation Studies and Data and Web Technologies. All of the data sets are messy and the projects identify and develop the kind of skills required to undertake open-ended data science projects. The book doesn’t teach R programming, but it shows why R is the appropriate language for doing data science.

Practical Data Science with R by Nina Zumel and John Mount – This book is one of a kind. It moves fluidly between the various stages of the data science process from surface considerations of working with customers to the deep details of various machine learning algorithms. There is quite a bit of original R code that you can use in real projects. Most impressive is the statistical sensibility of the authors who want you to make correct inferences from your data and machine learning models as well as effectively communicate your findings to the people paying the bills.

A First Course in Statistical Programming with R by W. John Braun and Duncan Murdoch – A deceptively thin book that provides a sharp introduction to R and moves quickly through debugging, computational linear algebra, numerical optimization and linear programming.

An Introduction to Statistical Learning with Applications in R by, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. This book is the companion to the master text for Machine / Statistical Learning, The Elements of Statistical Learning, and contains plenty of R code. The authors have generously posted pdf versions of both of these books online.

An R Companion to Applied Regression by John Fox and Sanford Weisberg– I have been a fan since the first edition which is possibly the best introduction to regression analysis with R ever.

Applied Meta-Analysis with R by Ding-Geng Chen and Karl E. Peace – Provides a solid introduction to basic meta-analysis that should be very helpful to people working in the field and want to move to R.

Bayesian Computation with R by Jim Albert – A concise, undergraduate level introduction to Bayesian Statistics.

Bayesian Essentials with R by Jean-Michel Marin and Christian P. Robert – This is a solid introduction to Bayesian Statistics with lots of useful code.

Data Analysis and Graphics Using R: An Example-Based Approach by John Maindonald and John Braun – a comprehensive introduction to both statistical analysis that is most suitable for self-learning. It is also a very handsome book. If you are a book person, this is the one to own.

Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill – A superb book on statistical modeling that is both practical and rigorous with a modern perspective that should appeal to anyone Bayesians and non-Bayesians alike.

Data Manipulation with R by Phil Spector – A concise introduction to data munging using base R capabilities. This is another book to keep with you while programming.

Doing Bayesian Data Analysis: A Tutorial with R and BUGS by John K. Kruschke – This eclectic and entertaining read is a way to learn both R and Bayesian Analysis simultaneously. It provides lots of R code to build on.

Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models – Building on the authors text on linear models this book covers a lot of ground and provides real insight.

Forecasting: principles and practices by Rob J Hyndman and George Athanasopoulos - Written to teach time series forecasting to a business audience this free, online text is a beautiful example of both the open source ethos and of how R can help people with real business problems become productive with a very modest learning curve.

Introduction to Probability with R by Kenneth Baclawski – This is an eclectic little book. There is really not much R in it, but it is a modern introduction to probability theory including stochastic processes with enough R to help you teach yourself the math by experimenting. R is the really easy part of this book.

Introductory Statistics with R by Peter Dalgaard – A classic text with R code to get you doing real statistics very quickly and a great reference for both statistics and R that you will want to hang on to.

Introductory Time Series with R by Paul S.P. Cowpertwait and Andrew C. Metcalfe - Could be the best introduction to time series analysis ever.

Linear Models with R by Julian J. Faraway – A compact course on analyzing linear models using R. It contains several examples and enough R code to thoroughly analyze regression models.

Modern Applied Statistics with S by W.N. Venables and B.D. Ripley – Probably the best introduction to modern computational statistics out there. Even though it is S, most of the code will work in R.

R Cookbook by Paul Teetor – A solid introduction with recipes for carrying out data analyses and basic plots that you will want on your shelf.

R for Everyone: Advanced Analytics and Graphics by Jared P. Lander – An easy read with relevant machine learning examples that will get you started with R.

R for SAS and SPSS Users by Robert A. Muenchen. If you are still using SAS or SPSS you need this book. The author speaks your language, understands where you are coming from and will help you learn some R.

R Graphics Cookbook by Winston Chang – an indispensable reference for R visualizations of all kinds. Read a more complete review here.

R in Action by Robert I Kabacoff - A gentle introduction to R with elegant plots that are model visualizations. You can read more about it here.

Regression Modeling Strategies by Frank E. Harrell, Jr. An incredible amount of wisdom for how to do statistics backed up with mostly straightforward R code.

R Programming for Bioinformatics by Robert Gentlemen – Not only for Bioinformatics. This book provides insight into the structure of the R language for intermediate and advanced programmers.

Software for Data Analysis: Programming with R by John M. Chambers – A text for advanced programmers discussing philosophy and good practices and providing deep insight into R.

Statistical Analysis of Network Data with R by Eric D. Kolaczyk and Gábor Csárdi – This is an indispensable resource for analyzing network data, containing a thorough explanation of the igraph package, it works through exponential random graph models and other advanced topics.

Statistical Computing in C++ and R by Randall L. Eubank and Ana Kupresanin – A very approachable introduction to both R and C++ for anyone who wants to understand these languages from the perspective of numerical analysis and the nuts and bolts of linear algebra.

Statistics and Data Analysis for Financial Engineering by David Ruppert and David S. Matteson – If you are interested in financial modeling this book could be your ticket to learning R and the R packages that support time series and financial engineering.

Time Series Analysis with Applications in R by Jonathan D. Cryer and Kung-Sik Chan – A solid undergraduate level introduction to R with step-by-Step R code. Very suitable for self study.

XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang – Everything a data scientist would ever really want to know about XLM documents, JSON and other web technologies and how you can work with them using R.

by Joseph Rickert

I had barely begun reading *Statistics Done Wrong: the Woefully Complete Guide* by Alex Reinhart (no starch press 2015) when I stated to wonder about the origin of the aphorism "Don't shoot the messenger." It occurred to me that this might be a reference to a primitive emotion that wells up unbidden when you hear bad news in such a way that you know things are not going to get better any time soon.

It was on page 4 that I read: "Even properly done statistics can't be trusted." Ouch! Now, to be fair, the point the author is trying to make here is that it is often not possible, based solely on the evidence contained in a scientific paper, to determine if an author sifted through his data until he turned up something interesting. But, coming as it does after mentioning J.P.A Ioannidis' conclusion that most published research findings are probably false, that the average scores of medical school faculty on tests of basic statistical knowledge don’t get much better than 75%, and that both pharmaceutical companies and the scientific journals themselves bias research by failing to publish studies with negative results, Reinhart’s sentence really stings. Moreover, Reinhart is so zealous in his efforts to expose the numerous ways a practicing scientist can go wrong in attempting to "employ statistics" it is reasonable (despite the optimism he expresses in the final chapters) for a reader in the book’s target demographic of practicing scientists with little formal training in statistics to conclude that the subject is just insanely difficult.

Is the practice of statistics just too difficult? Before permitting myself a brief comment on this I’ll start with an easier and more immediate question: Is this book worth reading? To this question, the answer is an unqualified yes.

Anyone starting out on a journey would like to know ahead of time where the road is dangerous, were the hard climbs are, and most of all: where be the dragons? *Statistics Done Wrong* is as good a map to the traps lurking in statistical analysis adventures that you are ever likely to find. In less than 150 pages it covers the pitfalls of p-values, the perils of being underpowered, the disappointments of false discoveries, the follies mistaking correlation for causation, the evils of torturing data and the need for exploratory analysis to avoid Simpson’s paradox.

About three quarters of the way into the book (Chapter 8), Reinhart moves beyond the basic hypothesis testing to consider some of the problems associated with fitting linear models. There follows a succinct but lucid presentation some essential topics including over fitting, unnecessary dichotomization, variable selection via stepwise regression, the subtle ways in which one can be led into mistaking correlation for causation, the need for clarity in dealing with missing data and the difficulties of recognizing and accounting for bias.

That is a lot of ground to cover, but Reinhart manages it with some style and with an eye for relevant contemporary issues. For example, in his discussion on statistical significance Reinhart says:

And because any medication or intervention usually has some real effect, you can always get a statistically significant result by collecting so much data that you detect extremely tiny but relatively unimportant differences (p9).

And then, he follows up with a very amusing quote from Bruce Thomson's 1992 paper that wryly explains that significance tests on large data sets are often little more than confirmations of the fact that a lot of data was collected. Here we have a “Big Data” problem, deftly dealt with in 1992, but in a journal that no data scientist is ever likely to have read.

The bibliography contained in the notes to each chapter of *Statistics Done Wrong* is a major strength of the book. Nearly every transgression recorded and every lamentable tale of the sorry state of statistical practice is backed up with a reference to the literature. This impressive exercise at scholarly research adds some weight and depth to the book’s contents and increases it usefulness as a guide.

Also, to my surprise and great delight, Reinhart manages a short discussion that elucidates the differences between R. A. Fisher’s conception of p-values and the treatment given by Neyman and Pearson in their formal theory of hypothesis testing. The confounding of these two very different approaches in what Gigerenzer et al. call the “Null Ritual” is perhaps the root cause of most of the misuse and abuse of significance testing in the scientific literature. However, you can examine dozens of the most popular text books on elementary statistics and find no mention of it.

In the closing chapters of the *Statistics Done Wrong* Reinhart effects a change of tone and discusses some of the structural difficulties with the practice of statistics in the medical and health sciences that have contributed to the present pandemic of the publication of false, misleading or just plain useless results. Topics include the lack of incentives for researchers to publish inconclusive and negative results, the reluctance of many researchers to share data and the willingness of some to attempt to game the system by deliberately publishing “doctored” results. Reinhart handles these topics nicely and uses them to motivate contemporary work on reproducible research and the need to cultivate a culture of reproducible and open research. Reinhart ends the book with recommendations for the new researcher that allows him to finish the book on a surprisingly upbeat note. The bearer of bad news concludes by offering hope.

I highly recommend *Statistics Done Wrong* to be read as the author intended: as supplementary material. In the preface, Reinhart writes:

But this is not a textbook, so I will not teach you how to use these techniques in any technical detail. I only hope to make you aware of the most common problems so you are able to pick the statistical technique best suited to your question.

*Statistics Done Wrong* is the kind of study guide that I think could benefit almost anyone slogging through a statistical analysis for the first time. It seems to me that the author achieve his stated goal with admiral economy and just a few shortcomings. The book, which entirely avoids the use of mathematical symbolism, would have benefited from precise definitions of the key concepts presented (p-values, confidence intervals etc.) and from a little R code to back these definitions. These are, however, relatively minor failings.

Now, back to the big question: is the practice of statistics just too difficult? Yes, I think that the catalogue of errors and numerous opportunities for going wrong documented by Reinhart indicates that the practice of statistics is more difficult than it needs to be. My take on why this is so is expressed (perhaps inadvertently) by Reinhart in the statement of his of his goal for the book quoted above. As long as statistics is conceived and taught as the process of selecting the right technique to answer isolated questions, rather than as an integrated system for thinking with data, we are all going to have a difficult time of it.

by Joseph Rickert

There have been well over a hundred books on R published within the last ten years. Most of these texts with titles like “Introduction Statistics with R” or “Time Series with R” offer the reader a way to jump right in and perform some concrete statistical analysis using R’s myriad built-in functions and extensive visualization features. And, while it is true that some R books appear to be little more than a rehash of basic documentation, there are nevertheless scores of carefully written texts from experts that not only illuminate some area of statistics but also demonstrate some good R programming as well. In no small way, I believe these works have contributed to the R’s popularity and growth by providing quality application level documentation.

Comparatively few books, however, are focused on teaching R programming itself. So it was a pleasant surprise when a copy of Garrett Grolemund’s “Hands-On Programming with R: Write Your Own Functions and Simulations” (O’Reilly 2015) came my way. This is a superb book: well conceived, unusual in the choice of material and sufficiently streamlined (185 pages not including the appendices) to make it a non-stop beginning-to-end read.

At the very beginning Garrett says:

I want to help you become a data scientist, as well as a computer scientist, so this book will focus on programming skills that are most related to data science.

These skills have to do with solving what Garrett refers to as the "logistical problems" of data science. In the context of the R language, they include acquiring data, manipulating R objects, constructing custom functions, negotiating the R environment and above all, writing vectorized code.

Given the ambitious agenda, "Hands-On Programming with R" starts surprisingly slowly with arithmetic, assignment, useful R functions and basic housekeeping chores: getting help and looking for packages. Then, still slowly and deliberately the text discusses R objects, atomic vectors, data types and data structures. 48 pages in and Garrett is still lingering on attributes. But this discussion is more sophisticated than most authors attempt. The presentation of type, attributes and class, in particular the insight that the concept of class follows directly from attributes, is meant to cultivate a programmer's mindset.

Around page 65 when Garrett gets excited about subsetting the pace really picks up. If you are hooked and still reading like I was by page 112 you will have acquired a working knowledge of scoping rules and environments and be ready for the beguilingly lucid discussion of the S3 class system that begins on page 139. Even if you are an experienced R programmer you may want to borrow a copy of the book and read this. If you really know your stuff, you may not learn anything new, but I bet you will be hard pressed to do a better job of explaining S3 classes to someone else.

After S3 the text moves to considering loops as a prelude to its presentation of vectorized code. This section, which is really the final destination of the book, is exceptionally well done. First, vectorized code is characterized as code that takes advantage of three great features of the R language: fast logical tests, powerful subsetting operations and a multitude of built-in functions that permit element-wise execution. Then the text demonstrates how to put these ideas into practice.

As you can gather, I was impressed by the conceptual formulation of the material. However, the real strength of the text is its sharp presentation of essential elements of the R language through a well-crafted, extended example that forms the spine of the book. "Hands-On Programming with R" is indeed a “hands-on” text that guides and challenges the reader to write good R code. A reader / coder who makes it to the end will have worked through several refinements of a small collection of functions that implement a fairly complex slot machine simulation. This example significantly raises the bar for selecting code examples in any R book. The simulation is rich enough to illustrate all of the R features presented in the text while allowing for refinement and polishing as the final form of the slot machine takes shape. The whole presentation is very tight. Garrett tells a pretty good story. During the final vectorized-code chapter I found myself reading with the delight of anticipation: “Just how is he going to make this code better?”.

I should also mention that the book is notable for what it does not include. This might be the first R book I have encountered that doesn’t develop any statistical models. Not a single regression is fit and there are no plots to speak of (3 histograms and a scatter plot). Certainly, this is the only R book I have come across that mentions data science in the preface that is not replete with Random Forest models and the like. Presumably, all of this will show up in the follow up book that Garrett promises in the preface.

"Hands-On Programming with R" presents but one carefully thought-through trajectory of many possible R language excursions. It is not to be compared with Hadley Wickham’s encyclopedic "Advanced R" and it contains only a fraction of the material you can find in Norm Matloff’s "The Art of R Programming". But, having worked through "Hands-On Programming with R" both of these texts should be accessible.

Garrett's book is a good read: a technical story with a plot and a few surprises that could help anyone starting out with the R language learn to write some pretty slick code.

If you've lived in or simply love London, a wonderful new book for your coffee-table is *London: The Information Capital*. In 100 beautifully-rendered charts, the book explores the data that underlies the city and its residents. To create most of these charts, geographer James Cheshire and designer Oliver Uberti relied on programs written in R. Using the R programming language not only created beautiful results, it saved time: "a couple of lines of code in R saved a day of manually drawing lines".

Take for example *From Home To Work, *the graphic illustrating the typical London-area commute. R's ggplot2 package was used to draw the invidual segments as transparent lines, which when overlaid build up the overall picture of commuter flows around cities and towns. The R graphic was then imported into Adobe Illustrator to set the color palette and add annotations. (FlowingData's Nathan Yau uses a similar process.)

Another example is the chart below of cycle routes in London. (We reported on an earlier version of this chart back in 2012.) As the authors note, "hundreds of thousands of line segments are plotted here, making the graphic an excellent illustration of R’s power to plot large volumes of data."

You can learn more from the authors about how R was used to create the graphics in *London: The Information Capital* and see several more examples at the link below. And if you'd like a copy, you can buy the book here.

London: The Information Capital / Our Process: The Coder and Designer

by Joseph Rickert

Predictive Modeling or “Predictive Analytics”, the term that appears to be gaining traction in the business world, is driving the new “Big Data” information economy. Predictably, there is no shortage of material to be found on this subject. Some discussion of predictive modeling is sure to be found in any reasonably technical presentation of business decision making, forecasting, data mining, machine learning, data science, statistical inference or just plain science. There are hundreds of booksthat have something worthwhile to say about predictive modeling. However, in my judgment,* Applied Predictive Modeling* by Max Kuhn and Kjell Johnson (Springer 2013) ought to be at the very top of reading list of anyone who has some background in statistics, who is serious about building predictive models, and who appreciates rigorous analysis, careful thinking and good prose.

The authors begin their book by stating that “the practice of predictive modeling defines the process of developing a model in a way that we can understand and quantify the model’s prediction accuracy on future, yet-to-be-seen data”. They emphasize that predictive modeling is primarily concerned with making accurate predictions and not necessarily building models that are easily interpreted. Neverless, they are careful to point out that “the foundation of an effective predictive model is laid with intuition and deep knowledge of the problem context”. The book is a masterful exposition of the modeling process delivered at high level of play, with the authors gently pushing the reader to understand the data, to carefully select models, to question and evaluate results, to quantify the accuracy of predictions and to characterize their limitations.

Kuhn and Johnson are intense but not oppressive. They come across like coaches who really, really want you to be able to do this stuff. They write simply and with great clarity. However, the material is not easy. I frequently, found myself rereading a passage and almost always found it to be worth the effort. This mostly happened when reading a careful discussion of a familiar topic (i.e. something I thought I understood). For example, Chapter 14 on Classification Trees and Rule-Based models contains what I thought to be an illuminating discussion on the difference between building trees with grouped categories and taking the trouble to decompose a categorical predictor into binary dummy variables, in effect forcing binary splits for the categories.

*Applied Predictive Modeling* begins with chapter that introduces the case studies that referenced throughout the book. Thereafter, chapters are organized into four parts: General Strategies, Regression Models, Classification Models, Other Considerations and three appendices, including a brief introduction to R (too brief to teach someone R, but adequate to give a programmer new to R enough of an orientation to make sense of the R scripts included in the book). This organization has the virtue of allowing the authors to focus on the specifics of the various models while providing a natural way to repeat and reinforce fundamental principles. For example, Regression Trees and Classification Trees share a great deal in common and many authors treat them together. However, by splitting them into separate sections Kuhn and Johnson can focus on the performance measures that are peculiar to each kind of model while getting a second chance to explain fundamental principles and techniques such as bagging and boosting that are applicable to both kinds of models.

There are many ways to go about reading *Applied Predictive Modeling*. I can easily envision someone committed to mastering the material reading the text from cover to cover. However, the chapters are pretty much self contained, and the authors are very diligent about providing back references to topics they have covered previously. You can pretty much jump in anywhere and find your way around. Additionally, the authors take the trouble to include quite a bit of “forward referencing” which I found to be very helpful. As an example, In section 3.6, where the authors mention credit scoring with respect to a discussion on adding predictors to a model, they point ahead to section 4.5 which is short discussion of the credit scoring case study. This section, in turn, points ahead to section 11.2 and a discussion of evaluating predicted classes. These forward references encourage and facilitate latching on to a topic and then threading through the book to track it down.

Three major strengths of the book are its fundamental grounding in the principles of statistical inference, the thoroughness with which the case studies are presented, and its use of the R language. The statistical viewpoint is apparent both from the choice of topics presented and the authors’ overall approach to predictive modeling. Topics that are peculiar to a statistical approach include the presentation of stratified sampling and other sampling techniques in the discussion of data splitting, and the sections on partial least squares and linear discriminant analysis. The real statistical value of the text, however, is embedded in the Kuhn and Johnson’s methodology. They take great care to examine the consequences of modeling decisions and continually encourage the reader to challenge the results of particular models. The chapters on data preparation and model evaluation do an excellent job of informally presenting a formal methodlolgy for making inferences. Applied Predictive Modeling contains very few equations and very little statistical jargon but it is infused with statistical thinking. (A side effect of the text is to teach statistics without being too obvious about it. You will know you are catching on if you think the xkcd cartoon in chapter 19 is really funny.)

A nice feature about the case studies is that they are rich enough to illustrate several aspects of the model building process and are used effectively throughout the text. The discussion in Chapter 12 on preparing the Kaggle contest, University of Melbourne grant funding data set is particularly thorough. This kind of “blow by blow” discussion of why the authors make certain modeling decisions is invaluable.

The R language comes into play in several ways in the text. The most obvious is the section on computing that closes most chapter. These sections contain R code that illustrates the major themes presented in the chapter. To some extent, these brief R statements substitute for the equations that are missing from the text. They provide concrete visual representations of the key ideas accessible to anyone who makes the effort to learn very little R syntax. The chapter ending code is itself backed up with an R package available on CRAN, AppliedPredictiveModeling, that contains scripts to reproduce all of the analyses and plots in the text. (This feature makes the text especially well-suited for self study.)

*Applied Predictive Modeling* is resplendent with R graphs and plots, many of them in color that are integral to the presentation of ideas but which also serve to illustrate how easily presentation level graphs can be created in R. Form definitely follows function here, and it makes for a rather pretty book. One of my favorite plots is the first part of Figure 11.3 reproduced below which shows the test set probabilities for a logistic regression model of the German Credit data set.

The authors point out that the estimates of bad credit in the right panel are skewed showing that most estimates predict very low probabilities for bad credit when the credit is, in fact, good - just what you want to happen. In contrast, the estimates of bad credit are flat in the left panel, “reflecting the model’s inability to distinguish bad credit cases”.

Finally, *Applied Predictive Modeling* can be view as an introduction to the caret package. There is great depth here. This is not a book that comes with a little bit of illustrative code, icing on a cake so to speak, rather the included code is just the tip of the iceberg. It provides a gateway to the caret package and the full functionality of R’s machine learning capabilities.

*Applied Predictive Modeling* is a remarkable text. At 600 pages, it is the succinct distillation of years of experience of two expert modelers working in the pharmaceutical industry. I expect that beginners and experienced model builders alike will find something of value here. On my shelf, it sits up there right next to Hastie, Tibshirani and Friedman’s *The Elements of Statistical Learning*.

Most people know R as a statistics/analytics language for analysis of quantitative data, and don't think of it as a tool for processing raw text. But R actually has some quite powerful facilities for processing character data. And as Gaston Sanchez learned, text manipulation is an important part of a modern data scientist's repertoire:

Many years ago I decided to apply for a job in a company that developed data mining applications for big retailers. I was invited for an on-site visit and I went through the typical series of interviews with the members of the analytics team. Everything was going smoothly and I was enjoying all the conversations. Then it came turn to meet the computer scientist. After briefly describing his role in the team he started asking me a bunch of technical questions and tests. Although I was able to answer those questions related with statistics and multivariate analysis, I had a really hard time trying to answer a series of questions related with string manipulations.

I will remember my interview with that guy as one of the most embarrassing moments of my life. That day, the first thing I did when I went back home was to open my laptop, launch R, and start reproducing the tests I failed to solve. It didn't take me that much to get the right answers. Unfortunately, it was too late and the harm was already done. Needless to say I wasn't offered the job. That shocking experience showed me that I was not prepared for manipulating character strings. I felt so bad that I promised myself to learn the basics of strings manipulation and text processing. "Handling and Processing Strings in R" is one of the derived results of that old promise.

Gaston's Creative-Commons licensed 112-page e-book, Handling and Processing Strings in R, is an excellent and comprehensive review of R's string handling capabilities. It cover's R's basic string-handling capabilities (reading, converting, manipulating and formatting), and also devotes a chapter to the higher-level functions of Hadley Wickham's stringr package. The two chapters on regular expressions are a must-read for anyone who hasn't yet come to grips with the power of regexes for handing string-based data. There are a few practical examples at the end of the e-book (frequency counting, word clouds) but the book sticks mainly with the fundamentals, and doesn't stray into semantic analysis. Highly recommended for anyone working with strings or character data in R.

Gaston Sanchez: Handling and Processing Strings in R (via Sharon Machlis)

by Joseph Rickert

Every once in a while a single book comes to crystallize a new discipline. If books still have this power in the era of electronic media, "Doing Data Science, Straight Talk from the Frontline" by Rachel Schutt and Cathy O’Neil: O'Reilly, 2013 might just be the book that defines data science. "Doing Data Science", which is based on a course that Rachel taught at Columbia University and to which Cathy contributed, is ambitious and multidimensional. It presents data science in all of its messiness as an open-ended practice that is coalescing around an expanding class of problems; problems which are yielding to an interdisciplinary approach that includes ideas and techniques from statistics, computer science, machine learning, social science and other disciplines.

The book is neither a statistics nor a machine learning text, but there are plenty of examples of statistical models and machine learning algorithms. There is enough R code in the text to get a beginner started on real problems with tools that are immediately useful. There is Python code, a bash shell script, mention of JSON and a down to earth discussion of Hadoop and MapReduce that many should find valuable. My favorite code example is the bash script (p 105) that fetches an Enron spam file and performs some basic word count calculations. Its almost casual insertion into the text, without fanfare and little explanation, provides a low key example of the kinds of baseline IT/ programmer skills that a newly minted statistician must acquire in order to work effectively as a data scientist.

"Doing Data Science" is fairly well balanced in its fusion of the statistics and machine learning world views, but Rachel’s underlying bias as a PhD statistician comes through when it counts. The grounding in linear models and the inclusion of time series models establish the required inferential skills. The discussion of causality shows how statistical inference is essential to obtaining a deep understanding of how things really work, and the chapter on epidemiology provides a glimpse into just how deep and difficult are the problems that statisticians have been wrestling with for generations. (I found the inclusion of this chapter in a data science book to be a delightful surprise.)

It is not only the selection of material, however, that betrays the book's statistical bias. When the authors take on the big questions their language indicates a statistical mindset. For example, in the discussion following "In what sense does data science deserve the word “science” in its name?" (p114) the authors write: “Every design choice you make can be formulated as an hypothesis, against which you will use rigorous testing and experimentation to validate or refute”. This is the language of a Neyman/Pearson trained statistician trying to pin down the truth. It stands in stark contrast with the machine learning viewpoint espoused in a quote by Kaggle’s Jeremy Howard who, when asked “Can you see any downside to the data-driven, black-box approach that dominates on Kaggle?”, replies:

Some people take the view that you don’t end up with a richer understanding of the problem. But that’s just not true: The algorithms tell you what’s important and what’s not. You might ask why those things are important, but I think that’s less interesting. You end up with a predictive model that works. There is not too much to argue about there.

So, whether you are doing science or not might just be in your intentions and point of view. Schutt and O’Neil do a marvelous job of exploring the tension between the quest for understanding and and the blunt success of just getting something that works.

An unusual aspect of the book is its attempt to understand data science as a cultural phenomenon and to place the technology in a historical and social context. Most textbooks in mathematics, statistics and science make no mention of how things came to be. Their authors are just under too much pressure to get on with presenting the material to stop and and discuss “just what were those guys thinking?”. But Schutt and O’Neill take the time, and the book is richer for it. Mike Driscoll and Drew Conway, two practitioners who early on recognized that data science is something new, are quoted along with other contemporary data scientists who are shaping the discipline both through their work and how they talk about it.

A great strength of the book is its collection of the real-world, big-league examples contributed by the guest lecturers to Rachel’s course. Doug Perlson of Real Direct, Jake Hofman of Microsoft Research, Brian Dalessandro and Claudia Perlich both of Media6Degrees, Kyle Teague of GetGlue, William Cukierski of Kaggle, David Huffaker of Google, Matt Gattis of Hutch.com, Mark Hansen of Columbia University, Ian Wong of Square, John Kelley of Morningside Analytics and David Madigan, Chair of the Columbia’s Statistics Department, all bring thoughtful presentations of difficult problems with which they have struggled. The perspective and insight of these practicing data scientists and statisticians is invaluable. Claudia Perlich’s discussion of data leakage alone is probably worth the price of the book.

A minor fault of the book is the occasional lapse into the hip vulgar. Someone being “pissed off” and talking about a model “that would totally suck” are probably innocuous enough phrases, but describing a vector as “huge ass” doesn’t really contribute to clarity. In a book that stresses communication, language counts. Nevertheless, "Doing Data Science" is a really “good read”. The authors have done a remarkable job of integrating class notes, their respective blogs, and the presentations of the guest speakers into a single, engaging voice that mostly speaks clearly to the reader. I think this book will appeal to a wide audience. Beginners asking the question “How do I get into data science?” will find the book to be a guide that will take them a long way. Accomplished data scientists will find a perspective on their profession that they should appreciate as being both provocative and valuable. "Doing Data Science" argues eloquently for a technology that respects humanist ideals and ethical considerations. We should all be asking "What problems should I be working on?", "Am I doing science or not?", and "What are the social and ethical implications of my work?". Finally, technical managers charged with assembling a data science team, and other interested outsiders, should find the book helpful in getting beyond the hype and and having a look at what it really takes to squeeze insight from data.