by Michael Helbraun

The software business includes travel, and that means hotels. The news that Marriott was acquiring Starwood was of particular interest to me – especially since more than 75% of my 95 nights so far this year on the road have been spent with one of those two companies.

While other folks can evaluate if the deal makes sense financially, I was just curious how this might affect a business traveler. Looking at the news there are those optimistic and plenty concerned. Granted, many of these details on how the loyalty programs will be combined won’t be known for some time, but what we do know is where each company maintains properties.

With 4200+ Marriott and 1700+ Starwood properties I was curious where there might be overlap, and how well the deal would help Marriott to grow in new markets. Luckily R can help in this regard.

The first thing to do is to put together a data set. It would have been nice if the companies had cleaned spreadsheets available publically, but as is normally the case we end up spending a good portion of time gathering and preparing data. In this case scraping, and formatting the data from SPG and Marriott into a spreadsheet with all their property locations. While I won’t go into data cleaning here, for a one time effort on just a few thousand rows of data this was pretty straightforward to do in Excel.

After I had all locations for all properties it was time to bring that data into R to start the analysis. First I was curious where each firm had the most properties – simple to do with a cross tab. NYC seems a logical top 5, but Houston and Atlanta, interesting:

**Top 10 Marriott Locations **

So far so good, but to actually put these on a map it’s much easier if the data has latitude and longitude. The

marGeocoded <- cbind(locations, geocode(locations))

save(marGeocoded, file="D:/Datasets/marGeocoded.RData")

load("D:/Datasets/marGeocoded.RData")

locations <- hotToGeo

hotGeocoded <- cbind(locations, geocode(locations))

save(hotGeocoded, file="D:/Datasets/hotGeocoded.RData")

load("D:/Datasets/hotGeocoded.RData")

Once the lat/long coordinates are merged back into our data set there are a number of ways to plot the results. I’m a fan of the globe plots within Bryan Lewis’s excellent *rthreejs* package. This allows you to stretch a 2D image over a globe which you can then plot on top of and interact with. Here I’ve plotted all the Marriott properties in orange and the Starwood properties in yellow:

After this it seemed like there was the most overlap in the US and Europe. To create a static plot ggmap is very quick:

# Europe map with ggmap

eurPlot <- qmap(location = "Europe", zoom = 4, legend = "bottomright", maptype = "terrain", color = "bw", darken = 0.01)

eurPlot <- eurPlot + geom_point(data = combGeocoded, aes(y = lat, x = lon, colour = firm, size=Counts, alpha=.2))

(eurPlot <- eurPlot + scale_size_continuous(range = c(3,10)))

If we want to create something within an interactive zoom the *leaflet* package is another useful one. It leverages Open Street Map and allows you to pan and zoom:

Aggregating and deriving value from low value info is a great use of R, and this sort of analysis is fun as it gives some additional perspective into a current event. If you would like to play around with this, a copy of the script Download Merger analysis and relevant data files are available Download HotGeocoded and Download MarGeocoded – let us know what you find in the comments.

Betterment, the online automated investing service, uses R for modeling, analysis and reporting. In a recent blog post on the company website, data scientist Sam Swift suggests using R or Python as open data analysis platforms and goes on to reveal:

Here at Betterment, we use both. We use Python more for data pipeline processes and R more for modeling, analyses, and reporting. But this article is not about the relative merits of these popular modern solutions. It is about the merits of using one of them (or any of a the smaller alternatives). To get the most out of a programmatic data analysis workflow, it should be truly end-to-end, or as close as you can get in your environment.

The full blog post is well worth a read, as it lists some excellent best pratices for conducting modern data analysis: a system that is reproducible, versionable, scalable and open. Read the whole thing at the link below.

Betterment: Modern Data Analysis: Don’t Trust Your Spreadsheet

by Joseph Rickert

We all "know" that correlation does not imply causation, that unmeasured and unknown factors can confound a seemingly obvious inference. But, who has not been tempted by the seductive quality of strong correlations?

Fortunately, it is also well known that a well done randomized experiment can account for the unknown confounders and permit valid causal inferences. But what can you do when it is impractical, impossible or unethical to conduct a randomized experiment? (For example, we wouldn't want to ask a randomly assigned cohort of people to go through life with less education to prove that education matters.) One way of coping with confounders when randomization is infeasible is to introduce what Economists call instrumental variables. This is a devilishly clever and apparently fragile notion that takes some effort to wrap one's head around.

On Tuesday October 20th, we at the Bay Area useR Group (BARUG) had the good fortune to have Hyunseung Kang describe the work that he and his colleagues at the Wharton School have been doing to extend the usefulness of instrumental variables. Hyunseung's talk started with elementary notions: like explaining the effectiveness of randomized experiments, described the essential notion of instrumental variables and developed the background necessary for understanding the new results in this area. The slides from Hyunseung's talk available for download in two parts from the BARUG website. As with most presentations, these slides are little more than the mute residue of talk itself. Nevertheless, Hyunseung makes such imaginative used of animation and build slides that the deck is worth working through.

The following slide from Hyunseung's presentation captures the essence of the instrumental approach.

The general idea is that one or more variables, the instruments, are added to the model for the purpose of inducing randomness into the outcome. This has to be done in a way that conforms with the three assumptions mentioned in the figure. The first assumption, A1, is that the instrument variables are relevant to the process. The second assumption, A2, states that randomness is only induced into the exposure variables and not also into the outcome. The third assumption, A3, is a strong one: there are no unmeasured confounders. The claim is that if these three assumptions are met then causal effects can be estimated with coefficients for the exposure variables that are consistent and asymptotically unbiased.

In the education example developed by Hyunseung, the instrumental variables are the subject's proximity to 2 year and 4 year colleges. Here is where the "rubber meets the road" so to speak. Assessing the relevancy of the instrumental variables and interpreting their effects are subject to the kinds of difficulties described by Andrew Gelman in his post of a few years back.

In the second part of his presentation Hyunseung presents new work: (1) two methods that provide robust confidence intervals when assumption A1 is violated, (2) a method for implementing a sensitivity analysis to assess the sensitivity of an instrumental variable model to violations of assumptions A2 and A3, and (3) the R package ivmodel that ties it all together.

To delve even deeper into this topic have a look at the paper: Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization.

I'm honoured to be giving the opening keynote at the Effective Applications of R Conference (EARL) Conference in Boston on November 2. My presentation will be on the business economics and opportunity of open source data science, with a focus on applications that are now possible given the convergence of big data platforms, cloud technology, and data science software (especially R) charged by the contributions of the open source community.

Given the outstanding calibre of talks at last months' EARL London conference, I'm can't wait to learn about more uses of R in business and industry. The whole agenda looks great, but a few of the sessions that caught my eye include:

- Mark Sellors' workshop on "Getting Started with Spark and R"
- Gergely Daroczi talking about how CARD.com is analyzing and managing Facebook ads from R
- Oliver Keyes on how Wikipedia (one of the top 10 websites) analyzes traffic data with R
- Ari Lamstein (a frequent contributor to this blog) on mapping census data with R
- Bob Rudis from Verizon, with a behind-the-scenes look at making "the most respected report in information security"
- Tim Hesterberg from Google on measuring brand ad effectiveness
- William Denney from Pfizer on monitoring Phase 1 clinical drug trials

If you haven't yet signed up for EARL Boston (organized and hosted by Mango Solutions), registration is still open here. (Discount academic registrations are sold out, though.) I hope to see you there!

by Joseph Rickert

I have been a big fan of R user groups since I attended my first meeting. There is just something about the vibe of being around people excited about what they are doing that feels good. From a speaker's perspective, presenting at an R user Group meeting must be the rough equivalent of doing "stand-up" at a club where you know mostly everyone and you are pretty sure people are going to like your material. So while user groups don't necessarily ignite R creativity (people don't do their best work just to present at an R User group meeting), they do help to shine the spotlight on some really good stuff.

I attend all of the Bay Area useR group meetings, and quite a few other R related events throughout the year, but I only get to experience a small fraction of what is going on in the R world. In the spirit of sharing the "wish I was there" feeling, here are a few recent user group presentations from around the globe that look like they were informative, entertaining and motivating.

Tommy O'Dell gave "Welcome to dply" talk to the Western Australia R Group (WARG) on September 10th. This is a very good presentation until near the very end when it becomes an absolutely great presentation!! Apparently, motivated by a desire to use dplyr with R 2.12, an older R version of R not supported by dplyr, Tommy deconstructed the dplyr "magic" to write his own package, rdplyr. This is a wonderful example of how curiosity and open source can open up many possibilities. The following slide comes from the section where Tommy explains some of the problems he encountered and how he worked through them.

On the 16th of September, Kevin Little gave a talk to MadR about how he recovered after "hitting the wall" in failed first attempt to interface to the SurveyMonkey API using the Rmonkey package. Kevin's description of how he worked through the process which included wading into some JSON scripting is a motivational case study. Kevin wrote a blog post that provides background for the project and has made his slides available here.

Also in September Jim Porzak, a long-time contributor to the San Francisco Bay Area R community, described a detailed customer segmentation analysis in a presentation to BARUG. The following slide examines the stability of the clusters.

Finally, there is a small treasure trove of relatively recent work at the BaselR presentations page. These include a presentation from Aimee Gott on the Mango Solutions development environment and one from Anne Kuemmel on using simulations to calculate confidence intervals in pharma applications. Also have a look at Daniel Sabanes Bove's presentation on using R to produce Microsoft PowerPoint presentations, and some thoughtful advice from Reinhold Koch on how to go about creating a lively R community within your company.

_______________________________________________________________________________________

_______________________________________________________________________________________

Let us all adopt this mindset!!

The Effective Applications of R (EARL) Conference (held last week in London) is well-named. At the event I saw many examples of R being used to solve real-world industry problems with advanced statistics and data visualization. Here are just a few examples:

**AstraZeneca**, the pharmaceutical company, uses R to design clinical trials, and to predict the ending date of the trials based on planned interim analyses of the data.**Allianz**, the financial services company, has deployed a massively-parallel "R-as-a-Service" production environment to support real-time banking processes.**KPMG**, one of the "big four" consulting companies, used R to simulate the impact of rule-changes to Britain's National Lottery, to better distribute prizes amongst players.**Allstate**, the insurance company, uses R to build predictive models to set premiums and calculate risk profiles.**Douwe Egberts**, the coffee company, uses R to analyze consumer preferences for coffee, and to design coffee roasts with desired flavour profiles.**Atass Sports**, a company that specializes in forecasting sports results, used R to identify cases of match-fixing in professional tennis.

There were also several other examples of industry applications from parallel sessions I couldn't attend, from companies including Lloyds of London, Shell, UBS, Deloitte, UniCredit, BCA Marketplace, TIM Group, PartnerRe, and hosts Mango Solutions.

The next EARL conference will be held in Boston, November 2-4, where I'm honoured to be included as a keynote speaker. I'm looking forward to learning about many more applications of R there!

*by John Mount (more articles) and Nina Zumel (more articles) of Win-Vector LLC*

"Essentially, all models are wrong, but some are useful." George Box

Here's a caricature of a data science project: your company or client needs information (usually to make a decision). Your job is to build a model to predict that information. You fit a model, perhaps several, to available data and evaluate them to find the best. Then you cross your fingers that your chosen model doesn't crash and burn in the real world. We've discussed detecting if your data has a signal. Now: how do you know that your model is good? And how sure are you that it's better than the models that you rejected?

Notice the Sun in the 4th revolution about the earth. A very pretty, but not entirely reliable model.

In this latest "Statistics as it should be" series, we will systematically look at what to worry about and what to check. This is standard material, but presented in a "data science" oriented manner. Meaning we are going to consider scoring system utility in terms of service to a *negotiable* business goal (one of the many ways data science differs from pure machine learning). To organize the ideas into digestible chunks, we are presenting this article as a four part series. This part (part 1) sets up the specific problem.

Win-Vector blog: HOW DO YOU KNOW IF YOUR MODEL IS GOING TO WORK? PART1: THE PROBLEM

Zillow, the leading real estate and rental marketplace in the USA, uses R to estimate housing values. Zillow's signature product is the Zestimate, their estimated market value for individual homes, and it's calculated using R in a parallel batch job for 100 million homes nationwide. The process is described in this Slideshare presentation Data Science At Zillow** **by Nicholas McClure, senior data scientist at Zillow.

This article by Alex Woodie at Datanami provides more detail about the process. R is used in conjunction with other data science tools:

The Zestimate is generated through a series of processes built using various tools, including heavy doses of R, Python, Pandas, Scikit Learn, and GraphLab Create, the graph analytics software developed by Seattle-based Dato (formerly GraphLab).

The company makes extensive use of R, including the development of a proprietary software package called ZPL that functions similar to MapReduce on Hadoop, but runs on a relational database. The company is increasing its use of Python, which Zillow data scientists say is better than R for some things, such as conducting GIS analysis.

You can find more details on Zillow's data science systems at the link below.