Today is Talk Like A Pirate Day, the perfect day to learn R, the programming language of pirates (arrr, matey!). If you have two-and-a-bit hours to spare, Nathaniel Phillips has created a video tutorial YaRrr! The Pirate's Guide to R which will take you through the basics: installation, basic R operations, and the matrix and data frame obects.

For a more in-depth study of R, there's also a 250-page e-book YaRrr! The Pirate’s Guide to R which goes into the basics in more depth, and covers more advanced topics including data visualization, statistical analysis, and writing your own functions.

There's also an accompanying package to the video and book called (appropriately) yarr that includes datasets from the course and also an interesting "Pirate Plot" data visualization that combines raw data, summary statistics, a "bean plot" distribution, and a confidence interval.

For more on The Pirate's Guide to R (and to tip him a beer), follow the link to Nathaniel's blog below.

Nathaniel Phillips: YaRrr! The Pirate’s Guide to R

Take a satellite image, and extract the pixels into a uniform 3-D color space. Then run a clustering algorithm on those pixels, to extract a number of clusters. The centroids of those clusters them make a representative palette of the image. Here's the palette of Chicago:

The R package earthtones by Will Cornwell, Mitch Lyons, and Nick Murray — now available on CRAN — does all this for you. Pass the get_earthtones function a latitude and longitude, and it will grab the Google Earth tile at the requested zoom level (8 works well for cities) and generate a palette with the desired number of colors. This Shiny app by Homer Strong uses the earthtones package to make the process even easier: it grabs your current location for the first palette, or you can pass in an address and it geolocates it for another. That's what I used to create the image above. (Another Shiny app by Andrew Clark shows the size of the clusters as a bar chart, but I prefer the simple palettes.) There are a few more examples below, and you can see more in the earthtones vignette. If you find more interesting palettes, let us know where in the world you found them in the comments.

Will Cornwell (github): earthtones

For about three years now, telemetry has been gathered for professional basketball games in the US by SportVU for the NBA. Six cameras track the on-court position of the players and the ball, with a resolution of 25 samples per second.

Combine this movement data with NBA play-by-play data (players, plays, fouls, and points scored — data sadly no longer made available by the NBA), and you have a rich data set for analysis. Naturally, you can read these data files into R, and Rajiv Shah provides several R scripts to facilitate the process. These include functions to import the motion and play-by-play data files and merge them into a data frame, exploration of the movement data, extract and analyze player trajectories and calculate metrics on player motion (such as speed and 'jerk').

James Curley used the same data and extended those scripts to animate NBA plays, such as this basket scored during a December 2015 game between the San Antonio Spurs and the Minnesota Timberwolves. The orange polygon is a measure of player spacing on the court. (Pop a taut rubber band around the players and let go: that's a convex hull.) It would be interesting to extract the area of this convex hull over time as a series, and see if the value relates to scoring opportunities, but that's a task for another time.

James used simple ggplot2 functions to plot the positions of the players and the ball on top of a geom extension to draw the court. Each frame was animated from records in the SportVU data, and then assempled into an animated GIF using the gg_animate function. (Many thanks to James for providing the GIF itself.) You can see further details, including the complete R code, at the blog post linked below.

Curley Lab: Animating NBA Play by Play using R

If you have dense data on a continuous scale, an effective way of representing the data visually is to use a **heatmap**, where the values are represented by a color on a continuous scale. For example, this chart from a Wall Street Journal interactive feature (and mentioned in Tal Galili's useR!2016 talk) represents the number of measles cases in each US state and year by a colored square:

(Here's how to create that chart in R.) But, note that scale at the bottom of the chart, mapping measles cases to a color on the rainbow. Here, we'll zoom in on it:

The scale you choose for a heat map is very important, and has a major impact on how the viewer will interpret the data presented. This scale has been chosen with care: while most of the *scale* is red, very few of the *data cells* are red (because the distribution of measles cases is skewed, thanks in particular to the introduction of a vaccine in 1964). A naively chosen scale would wash out the data.

The actual colors you choose are important too. The physics, technology, and neuroscience behind the interpretation of colors is surprisingly complex, but this talk on the default color schemes used in Python's matplotlib does a great job of explaining:

You can easily use the viridis color scales in R as well, thanks to the viridis package by Simon Garnier, which is available on CRAN. The package provides for heatmap color schemes, all carefully chosen for optimized perception and usefulness for color-impaired viewers.

You can find several examples of using the viridis color pallettes in the package vignette, both for base R graphics (including raster) and ggplot2. To get started, just install.packages("viridis") to install the package from CRAN.

Github (Simon Garnier): viridis

Power BI, Microsoft's data visualization and reporting platform, has made great strides in the past year integrating the R language. This Computerworld article describes the recent advances with Power BI and R. In short, you can:

- import data into Power BI by using an R script
- cleanse and transform other data sources coming into Power BI using R functions
- create custom charts in a Power BI dashboard using the R language, like these maps
- share R scripts with others for use with Power BI in the R Scripts Showcase
- create dashboards with Power BI desktop and R on your local machine, and share them with others using Power BI Online

Power BI desktop is completely free to download and use, and includes all the features you need to create visualizations, reports and dashboards. (Publishing to Power BI online requires a subscription, though.) Power BI desktop and R are both included in the Data Science Virtual Machine, so that's another easy way to get started.

Sharon Laivand from the Power BI engineering team recently gave a webinar showcasing the capabilities of Power BI and R. Fast-forward to the 29-minute mark to see how to create a report incorporating R-based calculations and graphics, and then share it with others (even people who don't have R installed!) using Power BI online.

For more information about Power BI, visit powerbi.microsoft.com.

I'm excited to share that one of my data science heroes will be a presenter at the Microsoft Data Science Summit in Atlanta, September 26-27. Edward Tufte, the data visualization pioneer, will deliver a keynote address on the future of data analysis and the how to make more credible conclusions based on data.

If you're not familiar with Tufte, a great place to start is to read his seminal book Visual Display of Quantitative Information. First published in 1983 — well before the advent of mainstream data visualization software — this is the book that introduced and/or popularized many familiar concepts in data visualization today, such a small multiples, sparklines, and the data-ink ratio. Check out this 2011 Washington Monthy profile for more background on Tufte's career and influence. Tufte's work also influenced R: you can easily recreate many of Tufte's graphics in the R graphics system, including this famous weather chart.

The program for the Data Science Summit looks fantastic, and will also include keynote presentations from Microsoft CEO Satya Nadella and Data Group CVP Joseph Sirosh. Also there's a fantastic crop of Microsoft data scientists (plus yours truly) giving a wealth of practical presentations on how to use Microsoft tools and open-source software for data science. Here's just a sample:

- Jennifer Marsman will speak about building intelligent applications with the Cognitive Services APIs
- Danielle Dean will describe deploying real-world predictive maintenance solutions based on sensor data
- Brandon Rohrer will give a live presentation of his Data Science for Absolutely Everybody series
- Frank Seide will introduce CNTK, Microsoft's open source deep learning toolkit
- Maxim Likuyanov will share some best practices for interactive data analysis and scalable machine learning with Apache Spark
- Rafal Lukawiecki will explain how to apply data science in a business context
- Debraj GuhaThakurta and Max Kaznady will demonstrate statistical modeling on huge data sets with Microsoft R Server and Spark
- David Smith (that's me!) will give some examples of how data science at Microsoft (and R!) is being used to improve the lives of disabled people
- ... and many many more!

Check out the agenda for the breakout sessions on the Data Science Summit page for more. I hope to see you there: it will be a great opportunity to meet with Microsoft's data science team and see some great talks as well. To register, follow the link below.

Microsoft Data Science Summit, September 26-27, Altanta GA: Register Now

If you need to present two time series spanning the same period, but in wildly different scales, it's tempting to use a time series chart with two separate vertical axes, one for each series, like this one from the Reserve Bank of New Zealand:

Charts like this typically have one or more crossover points, and that crossing imparts meaning to the viewer of the sense that one series is now "ahead" of the other. One problem is that **crossover-points in dual-axis time series charts are entirely arbitrary**. Changing either the left-hand or right-hand scale (and replotting the data accordingly) will change where the crossover points appear. And (as if often the case) the scales are automatically chosen to allow each series to use the full vertical space available, just changing the time-range of the data plotted will also change the location of the crossover points.

In an excellent blog post, statistician Peter Ellis points out five problems with dual-axis time series charts:

- The designer has to make choices about scales and this can have a big impact on the viewer
- In particular, “cross-over points” where one series cross another are results of the design choices, not intrinsic to the data, and viewers (particularly unsophisticated viewers) will not appreciate this and think there is more significance in cross over than is actually the case
- They make it easier to lazily associate correlation with causation, not taking into account autocorrelation and other time-series issues
- Because of the issues above, in malicious hands they make it possible to deliberately mislead
- They often look cluttered and aesthetically unpleasing

A simple alternative is to rescale both time series, for example to define both series to have a nominal value at a specific time, say both start at 100 on January 1, 2016. This is a useful way to compare the growth in two series since the beginning of the year, and means that both can be represented using the same single scale. (If you're using the ggplot2 package in R to plot time series, you can use the stat_index function from Peter's ggseas package to scale time series in this way.) The problem though is that you use the interpretability of the chart, having now lost the true scales for *both* time series.

All that being said, Peter suggests that there are times when a dual-axis chart can be appropriate, for example when the two axes are conceptually similar (as above, when both are linear monetary scales), and you use a consistent process to set the scales of the vertical axes. Other considerations include color-coding the axes for interpretability, and choosing colors that don't favor one series over the other. Implementing these best practices, Peter has created the dualplot() function for R, which cooses the axes according to a cross-over point you specify. This is equivalent to rescaling the series to have the same value at that specified points, but keeps the real-value axes for interpretability. Heres' the above chart, rendered with dualplot() with a crossover point at January 2104:

For more great discussion of the pros and cons of dual-axis time series charts, and the R code for the dualplot() function, follow the link to Peter's blog post below.

Peter's stats stuff: Dual axes time series plots may be ok sometimes after all (via Harlan Harris)

Len Kiefer, Deputy Chief Economist at Freddie Mac, recently published the following chart to his personal blog showing household debt in the United States (excluding mortgage debt). As you can see, student loan debt has steadily increased over the last 13 years and has now eclipsed all other forms of non-mortgage debt:

He also created this animated chart showing the growth (and occasional decline) of all forms of debt (including mortgages). All categories are scaled to the same nominal value in March 2003, and since that time student debt in the US has more than quintupled.

Both charts were created using the R language (the code is included in the blog post linked below). The data come from the New York Federal Reserve, which Len reads into R using fread from the data.table package. The line charts were created using the ggplot2 package, with the ggrepel extension to keep the labels from overlapping. The animated version was created using the saveGIF function of the animation package.

For more charts (including some interesting by-state charts) and the complete details on the implementation, follow the link to Len's blog below.

Len Kiefer: Consumer Credit Trends (via Sharon Machlis)