Revolutions
http://blog.revolutionanalytics.com/
Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.en-US2015-11-24T08:30:00-08:00Mapping out Marriott's Starwood Acquisition
http://blog.revolutionanalytics.com/2015/11/marriott.html
by Michael Helbraun The software business includes travel, and that means hotels. The news that Marriott was acquiring Starwood was of particular interest to me – especially since more than 75% of my 95 nights so far this year on the road have been spent with one of those two companies. While other folks can evaluate if the deal makes sense financially, I was just curious how this might affect a business traveler. Looking at the news there are those optimistic and plenty concerned. Granted, many of these details on how the loyalty programs will be combined won’t be known...<p>by Michael Helbraun</p>
<p>The software business includes travel, and that means hotels.  The <a href="http://www.economist.com/news/business-and-finance/21678723-giant-hotelier-about-become-behemoth-industrys-largest-deal-2007">news</a> that Marriott was acquiring Starwood was of particular interest to me – especially since more than 75% of my 95 nights so far this year on the road have been spent with one of those two companies.</p>
<p>While other folks can evaluate if the deal makes sense financially, I was just curious how this might affect a business traveler.  Looking at the news there are those <a href="http://fusion.net/story/234938/marriott-starwood-points-after-merger/">optimistic</a> and plenty <a href="http://www.nytimes.com/2015/11/18/upshot/marriott-merger-has-starwood-lovers-nervous.html?_r=0">concerned</a>.  Granted, many of these details on how the loyalty programs will be combined won’t be known for some time, but what we do know is where each company maintains properties. </p>
<p>With 4200+ Marriott and 1700+ Starwood properties I was curious where there might be overlap, and how well the deal would help Marriott to grow in new markets.  Luckily R can help in this regard.</p>
<p>The first thing to do is to put together a data set.  It would have been nice if the companies had cleaned spreadsheets available publically, but as is normally the case we end up spending a good portion of time gathering and preparing data.  In this case scraping, and formatting the data from <a href="https://www.starwoodhotels.com/preferredguest/directory/hotels/all/list.html?display=hotels&language=en_US&pageType=list&regionName=all">SPG</a> and <a href="https://www.marriott.com/rewards/pointsGridPopUpPropertyList.mi">Marriott</a> into a spreadsheet with all their property locations.  While I won’t go into data cleaning here, for a one time effort on just a few thousand rows of data this was pretty straightforward to do in Excel.</p>
<p>After I had all locations for all properties it was time to bring that data into R to start the analysis.  First I was curious where each firm had the most properties – simple to do with a cross tab.  NYC seems a logical top 5, but Houston and Atlanta, interesting:</p>
<p><strong>Top 10 Marriott Locations<br /> <a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7f06342970b-pi" style="display: inline;"><img alt="Top10Mar" class="asset asset-image at-xid-6a010534b1db25970b01b7c7f06342970b img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7f06342970b-320wi" title="Top10Mar" /></a><br /></strong><br /><strong>Top 10 Starwood Locations<br /> <a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb08947fe4970d-pi" style="display: inline;"><img alt="Top10star" class="asset asset-image at-xid-6a010534b1db25970b01bb08947fe4970d img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb08947fe4970d-320wi" title="Top10star" /></a><br /></strong><br />So far so good, but to actually put these on a map it’s much easier if the data has latitude and longitude.  The <em>geocode </em>function within the <em>ggmap</em> package makes this easy; resolution is done against the Google API, and is limited to 2,500 requests/day - so be sure to use save/load.  (Note: there are more than 2500 locations here so I split the task up across a couple machines.  There are other free geocoding options with higher daily limits if you have more data points, like using <a href="https://msdn.microsoft.com/en-us/library/gg427601.aspx">Bing</a>, but that’s a REST based approach.)</p>
<p><span style="font-family: 'courier new', courier;">marGeocoded <- cbind(locations, geocode(locations))<br /></span><span style="font-family: 'courier new', courier;">save(marGeocoded, file="D:/Datasets/marGeocoded.RData")<br /></span><span style="font-family: 'courier new', courier;">load("D:/Datasets/marGeocoded.RData")</span></p>
<p><span style="font-family: 'courier new', courier;"><br /></span><span style="font-family: 'courier new', courier;">locations <- hotToGeo<br /></span><span style="font-family: 'courier new', courier;">hotGeocoded <- cbind(locations, geocode(locations))<br /></span><span style="font-family: 'courier new', courier;">save(hotGeocoded, file="D:/Datasets/hotGeocoded.RData")<br /></span><span style="font-family: 'courier new', courier;">load("D:/Datasets/hotGeocoded.RData")</span></p>
<p>Once the lat/long coordinates are merged back into our data set there are a number of ways to plot the results.  I’m a fan of the globe plots within Bryan Lewis’s excellent <a href="https://github.com/bwlewis/rthreejs"><em>rthreejs</em></a> package.  This allows you to stretch a 2D image over a globe which you can then plot on top of and interact with.  Here I’ve plotted all the Marriott properties in orange and the Starwood properties in yellow:</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d17a208d970c-pi" style="display: inline;"><img alt="Globe plot animated" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b8d17a208d970c img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d17a208d970c-800wi" title="Globe plot animated" /></a></p>
<p>After this it seemed like there was the most overlap in the US and Europe.  To create a static plot ggmap is very quick:</p>
<p># Europe map with ggmap</p>
<p><span style="font-family: 'courier new', courier;">eurPlot <- qmap(location = "Europe", zoom = 4, legend = "bottomright", maptype = "terrain", color = "bw", darken = 0.01) </span><br /><span style="font-family: 'courier new', courier;">eurPlot <- eurPlot + geom_point(data = combGeocoded, aes(y = lat, x = lon, colour = firm, size=Counts, alpha=.2))</span><br /><span style="font-family: 'courier new', courier;">(eurPlot <- eurPlot + scale_size_continuous(range = c(3,10)))</span></p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d17a2100970c-pi" style="display: inline;"><img alt="Map_Europe" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b8d17a2100970c image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d17a2100970c-800wi" title="Map_Europe" /></a></p>
<p>If we want to create something within an interactive zoom the <em>leaflet</em> package is another useful one.  It leverages Open Street Map and allows you to pan and zoom:</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb089480ec970d-pi" style="display: inline;"><img alt="Map_us" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb089480ec970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb089480ec970d-800wi" title="Map_us" /></a></p>
<p>Aggregating and deriving value from low value info is a great use of R, and this sort of analysis is fun as it gives some additional perspective into a current event.  If you would like to play around with this, a copy of the script <span class="asset asset-generic at-xid-6a010534b1db25970b01bb08948123970d img-responsive"><a href="http://revolution-computing.typepad.com/files/merger-analysis.r">Download Merger analysis</a> </span>and relevant data files are available <span class="asset asset-generic at-xid-6a010534b1db25970b01b7c7f0644e970b img-responsive"><a href="http://revolution-computing.typepad.com/files/hotgeocoded.rdata">Download HotGeocoded</a> and  <span class="asset asset-generic at-xid-6a010534b1db25970b01bb08948175970d img-responsive"><a href="http://revolution-computing.typepad.com/files/margeocoded.rdata">Download MarGeocoded</a></span></span> – let us know what you find in the comments.</p>advanced tipsapplicationscurrent eventsRJoseph Rickert2015-11-24T08:30:00-08:00PowerBI adds support for R
http://blog.revolutionanalytics.com/2015/11/powerbi-adds-support-for-r.html
In the latest update released on November 20, PowerBI has added support for R. The desktop edition of Microsoft's data visualization and reporting tool now allows you to run an R script to generate data; the resulting data frames from the script can then be used for data visualization or any other activities within Power BI. This PowerBI Support article provides the details. Simply select the new "Execute R Script" option within the Other section of the "Get Data" dialog, and paste in the R script you want to run. Any dataframes generated by the script (and there can be...<p>In the latest update released on November 20, <a href="http://blogs.msdn.com/b/powerbi/archive/2015/11/20/announcing-the-power-bi-desktop-november-update.aspx#RScript">PowerBI has added support for R</a>. The desktop edition of <a href="https://powerbi.microsoft.com/en-us/desktop">Microsoft's data visualization and reporting tool</a> now allows you to run an R script to generate data; the resulting data frames from the script can then be used for data visualization or any other activities within Power BI.</p>
<p>This <a href="https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-r-scripts/">PowerBI Support article</a> provides the details. Simply select the new "Execute R Script" option within the Other section of the "Get Data" dialog, and paste in the R script you want to run.</p>
<p><a class="asset-img-link" style="display: inline;" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d17a3d2d970c-pi"><img class="asset asset-image at-xid-6a010534b1db25970b01b8d17a3d2d970c img-responsive" style="width: 350px; display: block; margin-left: auto; margin-right: auto;" title="PowerBI screenshot" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d17a3d2d970c-350wi" alt="PowerBI screenshot" /></a><a class="asset-img-link" href="http://a1.typepad.com/6a0105360ba1c6970c01b7c7f07e51970b-pi"><br /></a>Any dataframes generated by the script (and there can be more than one) become available data sources for further PowerBI operations. PowerBI Desktop can use any R version you have installed on the local machine, including <a href="https://mran.revolutionanalytics.com/open/">Revolution R Open</a>. You can see an example of using R as a data source in the video below:</p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/ErHvpkyQjSg?start=1950" frameborder="0" allowfullscreen=""></iframe> </p>
<p>Power BI Support: <a href="https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-r-scripts/">Running R Scripts in Power BI Desktop (Beta)</a></p>MicrosoftRDavid Smith2015-11-23T13:50:17-08:00Because it's Friday: Why you can't take photos of propellers
http://blog.revolutionanalytics.com/2015/11/because-its-friday-why-you-cant-take-photos-of-propellers.html
A few years ago, we shared a video on how the vertical scanning in digital cameras can distort fast-moving objects like fans and propellers. For example, the rotors in this digital photo of a drone are, in fact, perfectly straight: A clever 15-year old on Imgur has created an elegant way of describing the phenomenon using mathematics. Here's his animation: The black line represents the scanning line in the digital camera chip, and the blue curves represent the image captured thanks to the blades moving while the image was being scanned. In the online version of the animation, which was...<p>A few years ago, we shared a video on how the <a href="http://blog.revolutionanalytics.com/2011/09/curved-propellers.html">vertical scanning in digital cameras can distort fast-moving objects</a> like fans and propellers. For example, the rotors in this digital photo of a drone are, in fact, perfectly straight:</p>
<p><a href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154357c4aff970c-pi" style="display: inline;"><img alt="Quadcopter-still" border="0" class="asset asset-image at-xid-6a010534b1db25970b0154357c4aff970c" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b0154357c4aff970c-800wi" title="Quadcopter-still" /></a></p>
<p> A <a href="http://imgur.com/gallery/RG7Kd">clever 15-year old on Imgur</a> has created an elegant way of describing the phenomenon using mathematics. Here's his animation:</p>
<p><a class="asset-img-link" href="http://imgur.com/gallery/RG7Kd" style="display: inline;"><img alt="Propellers" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b7c7ef411e970b image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7ef411e970b-800wi" title="Propellers" /></a></p>
<p>The black line represents the scanning line in the digital camera chip, and the blue curves represent the image captured thanks to the blades moving while the image was being scanned. In the online version of the animation, which was <a href="https://www.desmos.com/calculator/yc9znckbcg">created using the mathematical calculator Desmos</a>, you can even adjust the rotation speed of the propeller. A slower propeller results in slightly bent blades (as in the image above), while a very fast propeller generates multiple virtual blades.</p>
<p>Try it out yourself on a propeller aircraft or even a fast-moving ceiling fan. (If you get any good shots, share 'em in the comments below.) In the meantime, we're signing off for the weekend. See you back here on Monday!</p>randomDavid Smith2015-11-20T11:18:09-08:00How long does it take to get to the airport from NYC?
http://blog.revolutionanalytics.com/2015/11/new-york-taxi-uber.html
Todd W Schneider analyzed a database of 1.1 billion taxi rides in New York City from 2009-2015, and discovered some interesting insights on how New Yorkers use cabs. For example, here's a map of the drop-off locations of each ride in the database: The R code to generate this beautiful map is surprisingly simple: just one line to extract the data from a Postgres database, and a few lines of ggplot2 code to render each drop-off as a point on the map, colored by the type of cab (NYC Yellow or regional Green Boro taxis). Note the use of the...<p>Todd W Schneider analyzed a <a href="http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml">database of 1.1 billion taxi rides in New York City</a> from 2009-2015, and discovered some <a href="http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/">interesting insights on how New Yorkers use cabs</a>. For example, here's a map of the drop-off locations of each ride in the database:</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb089348a2970d-pi" style="display: inline;"><img alt="Taxi_dropoffs_map" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb089348a2970d img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb089348a2970d-800wi" title="Taxi_dropoffs_map" /></a></p>
<p>The <a href="https://mran.revolutionanalytics.com/documents/what-is-r/">R</a> code to generate this beautiful map is surprisingly simple: just one line to extract the data from a Postgres database, and a few lines of ggplot2 code to render each drop-off as a point on the map, colored by the type of cab (NYC Yellow or regional Green Boro taxis). Note the use of the <span style="font-family: 'courier new', courier;">alpha=</span> argument to make the dots transparent, allowing them to build in intensity according to the number of drop-offs in each location.</p>
<p>
<script src="https://gist.github.com/revodavid/2bb5f1d45129f23c3527.js"></script>
</p>
<p>Todd also used R to calculate from the data the amount of time required to get from various NYC districts to the airport. For example, here's the chart for trips from midtown Manhattan to JFK airport:</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb08934b11970d-pi" style="display: inline;"><img alt="MN17_JFK" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb08934b11970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb08934b11970d-800wi" title="MN17_JFK" /></a></p>
<p>Note how Todd presents probability bands instead of the medians of trip times by time of day. As anyone who communtes regularly knows, the same trip at the same time of day doesn't always take the same amount of time: there is a <em>distribution</em> of possible trip times, from quick runs to extreme delays. If you leave for your destination with only the median trip time (as shown by most navigation apps) to spare, <strong>you will be late half the time</strong>. Personally, I like to use the 90/90 rule for airport trips: leave at a time that gives me a 90% chance of arriving 90 or more minutes before my flight. This chart helps me follow that rule. For example, at rush hour (around 4PM) you should leave midtown <strong>2 hours and 55 minutes </strong>(85 + 90 minutes) before your flight if you want to have 90 minutes at the airport.</p>
<p>For many other charts and analyses of the NYC taxi data, check out Todd's complete blog post below.</p>
<p>Todd W Schneider: <a href="http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/">Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance</a></p>graphicsRDavid Smith2015-11-20T07:56:59-08:00Rated R: Recommended Reading
http://blog.revolutionanalytics.com/2015/11/r-recommended-reading.html
by Joseph Rickert What are you reading? - and what are you recommending to friends, colleagues, and students who want to learn something about R programming? A quick search of Amazon will show that there are several new R books proposed for 2016; but of course, new doesn't necessarily mean better. I fully expect that many new books in all areas of statistics, data science and many other scientific disciplines using R to provide a computational aspect for their exposition will continue to be written for years to come. All of these books will provide windows into learning R for...<p>by Joseph Rickert</p>
<p>What are you reading? - and what are you recommending to friends, colleagues, and students who want to learn something about R programming? A quick search of Amazon will show that there are several new R books proposed for 2016; but of course, new doesn't necessarily mean better. I fully expect that many new books in all areas of statistics, data science and many other scientific disciplines using R to provide a computational aspect for their exposition will continue to be written for years to come. All of these books will provide windows into learning R for people excited about the particular subject matter. However, so many excellent R based texts have already been published that it will be difficult for these new works to achieve "must buy" status for the R content alone. </p>
<p>Below are my recommendations for good R reads. Some of these books go back a few years, but they continue to hold their value. With the possible exception of books that were based primarily on the S language, good R books don't become obsolete. Unlike some other computer languages, R evolves mostly through new capabilities added by contributed packages, not through changes to the R core. The fact that the d<a href="https://mran.revolutionanalytics.com/package/dplyr/">plyr</a> family of packages may make data wrangling more convenient in many circumstances doesn't make a book that teaches data manipulation through base R functions any less relevant. In fact, some might argue that new students should be taught the basic functionally first. I am not a militant traditionalist, but it does seem to me that familiarity with the bare bones basics of the language will help newcomers to gain intuition about how R works.</p>
<p>There are three lists below. The first lists my picks for teaching R programming. (Top row in the graphic) The second list provides my recommendations for people interested in learning R for data science. (Second row in the graphic).</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7ed82ed970b-pi" style="display: inline;"><img alt="Recommended_reading" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b7c7ed82ed970b image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7ed82ed970b-800wi" title="Recommended_reading" /></a></p>
<p>The third list is of books on my shelf that I continue to value. For every entry in all three lists I provide a mini or micro review. In a few cases, I point to a more extensive review that I have previously published in this blog. My lists are in no way intended to be complete. But, I apologize right now if I have omitted some really good books. Please let me know about what I have missed by commenting to this post with a mini review of your own.</p>
<h3>Learning R</h3>
<p><a href="https://www.crcpress.com/Advanced-R/Wickham/9781466586963">Advanced R</a> by Hadley Wickham - Anyone who wants to gain a deep understanding of the R language will certainly benefit from this book. More than a reference: the author seeks to provide a conceptual framework for understanding R’s structure and guide readers through R’s idiosyncratic mechanisms pointing out traps, illuminating difficult concepts and providing expert commentary.</p>
<p><a href="http://shop.oreilly.com/product/9781593273842.do">The Art of R Programming: A Tour of Statistical Software Design</a> by Norman Matloff – This is <a href="http://blog.revolutionanalytics.com/2011/11/review-of-the-art-of-r-programming-by-norman-matloff.html">still my pick</a> for the best book for people with some programming experience who want to make a serious effort at learning R. Professor Matloff’s interest in teaching the mechanics of programming infused with his deep understanding of both the underlying computer science and statistical theory put this book on top.</p>
<p><a href="http://shop.oreilly.com/product/0636920028574.do">Hands on Programming with R</a> by Garrett Grolemund – If you are not only new to R but new to programming as well this is the book for you. I have review it more extensively <a href="http://blog.revolutionanalytics.com/2015/03/review-of-hands-on-programming-with-r.html">here</a>.</p>
<p><a href="http://www.dummies.com/store/product/R-For-Dummies.productCd-1119962846,navId-322449.html">R For Dummies</a> by Andrie de Vries and Joris Meys – A current, concise and insightful reference to core concepts in the R language. A really nice feature of the book is its emphasis on presenting the R ecosystem along with core R concepts. When learning anything new, it is always helpful to understand the big picture. Keep this book by your computer, when you stop referring to it you will be a pretty good R programmer.</p>
<h3>Data Science with R</h3>
<p><a href="http://www.springer.com/us/book/9781461468486">Applied Predictive Modeling</a> by Max-Kuhn and Kjell Johnson – This book is the master text for predictive analytics, carefully walking through several modeling examples and making expert use of the extensive machine learning tools in R’s caret package. I have described the book more fully <a href="http://blog.revolutionanalytics.com/2014/06/review-of-applied-predictive-modeling-by-kuhn-and-johnson.html">here</a>.</p>
<p><a href="http://www.springer.com/us/book/9781441998897">Data Mining with Rattle and R</a> by Graham Williams – This is the perfect first book for machine learning with R. The rattle GUI helps get across the machine learning concepts and also produces some pretty good R code to get your started.</p>
<p><a href="https://www.crcpress.com/Data-Science-in-R-A-Case-Studies-Approach-to-Computational-Reasoning-and/Nolan-Lang/9781482234817">Data Science in R: A Case Studies Approach to Computational Reasoning and Data Science</a> by Deborah Nolan and Duncan Temple Lang. – My most recent acquisition, this book consists of 12, non-trivial case studies organized under three themes: Data Manipulation and modeling, Simulation Studies and Data and Web Technologies. All of the data sets are messy and the projects identify and develop the kind of skills required to undertake open-ended data science projects. The book doesn’t teach R programming, but it shows why R is the appropriate language for doing data science.</p>
<p><a href="https://www.manning.com/books/practical-data-science-with-r">Practical Data Science with R</a> by Nina Zumel and John Mount – This book is one of a kind. It moves fluidly between the various stages of the data science process from surface considerations of working with customers to the deep details of various machine learning algorithms. There is quite a bit of original R code that you can use in real projects. Most impressive is the statistical sensibility of the authors who want you to make correct inferences from your data and machine learning models as well as effectively   communicate your findings to the people paying the bills.</p>
<h3>The Rest of My Book Shelf</h3>
<p><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/computational-statistics-machine-learning-and-information-sc/first-course-statistical-programming-r?format=PB">A First Course in Statistical Programming with R</a> by W. John Braun and Duncan Murdoch – A deceptively thin book that provides a sharp introduction to R and moves quickly through debugging, computational linear algebra, numerical optimization and linear programming.   </p>
<p><a href="http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf">An Introduction to Statistical Learning with Applications</a> in R by, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. This book is the companion to the master text for Machine / Statistical Learning, <a href="https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf">The Elements of Statistical Learning</a>, and contains plenty of R code. The authors have generously posted pdf versions of both of these books online.        </p>
<p><a href="http://socserv.socsci.mcmaster.ca/jfox/Books/Companion/">An R Companion to Applied Regression</a> by John Fox and Sanford Weisberg– I have been a fan since the first edition which is possibly the best introduction to regression analysis with R ever.</p>
<p><a href="https://www.crcpress.com/Applied-Meta-Analysis-with-R/Chen-Peace/9781466505995">Applied Meta-Analysis with R</a> by Ding-Geng Chen and Karl E. Peace – Provides a solid introduction to basic meta-analysis that should be very helpful to people working in the field and want to move to R.</p>
<p><a href="http://www.springer.com/us/book/9780387922973">Bayesian Computation with R</a> by Jim Albert – A concise, undergraduate level introduction to Bayesian Statistics.</p>
<p><a href="http://www.springer.com/us/book/9781461486862">Bayesian Essentials with R</a> by Jean-Michel Marin and Christian P. Robert – This is a solid introduction to Bayesian Statistics with lots of useful code.</p>
<p><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/computational-statistics-machine-learning-and-information-sc/data-analysis-and-graphics-using-r-example-based-approach-3rd-edition?format=HB">Data Analysis and Graphics Using R: An Example-Based Approach</a> by John Maindonald and John Braun – a comprehensive introduction to both statistical analysis that is most suitable for self-learning. It is also a very handsome book. If you are a book person, this is the one to own.</p>
<p><a href="http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/data-analysis-using-regression-and-multilevelhierarchical-models?format=PB">Data Analysis Using Regression and Multilevel/Hierarchical Models</a> by Andrew Gelman and Jennifer Hill – A superb book on statistical modeling that is both practical and rigorous with a modern perspective that should appeal to anyone Bayesians and non-Bayesians alike.</p>
<p><a href="http://www.springer.com/us/book/9780387747309">Data Manipulation with R</a> by Phil Spector – A concise introduction to data munging using base R capabilities. This is another book to keep with you while programming.</p>
<p><a href="https://sites.google.com/site/doingbayesiandataanalysis/purchase">Doing Bayesian Data Analysis: A Tutorial with R and BUGS</a> by John K. Kruschke – This eclectic and entertaining read is a way to learn both R and Bayesian Analysis simultaneously. It provides lots of R code to build on.</p>
<p><a href="https://www.crcpress.com/Extending-the-Linear-Model-with-R-Generalized-Linear-Mixed-Effects-and/Faraway/9781584884248">Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models</a> – Building on the authors text on linear models this book covers a lot of ground and provides real insight.</p>
<p><a href="https://www.otexts.org/fpp">Forecasting: principles and practices </a>by Rob J Hyndman and George Athanasopoulos - Written to teach time series forecasting to a business audience this free, online text is a beautiful example of both the open source ethos and of how R can help people with real business problems become productive with a very modest learning curve.</p>
<p><a href="https://www.crcpress.com/Introduction-to-Probability-with-R/Baclawski/9781420065213">Introduction to Probability with R</a> by Kenneth Baclawski – This is an eclectic little book. There is really not much R in it, but it is a modern introduction to probability theory including stochastic processes with enough R to help you teach yourself the math by experimenting. R is the really easy part of this book.</p>
<p><a href="http://www.springer.com/us/book/9780387790534">Introductory Statistics with R</a> by Peter Dalgaard – A classic text with R code to get you doing real statistics very quickly and a great reference for both statistics and R that you will want to hang on to.</p>
<p><a href="http://www.springer.com/us/book/9780387886978">Introductory Time Series with R</a> by Paul S.P. Cowpertwait and Andrew C. Metcalfe - Could be the best introduction to time series analysis ever.</p>
<p><a href="https://www.crcpress.com/Linear-Models-with-R-Second-Edition/Faraway/9781439887332">Linear Models with R</a> by Julian J. Faraway – A compact course on analyzing linear models using R. It contains several examples and enough R code to thoroughly analyze regression models.</p>
<p><a href="http://www.springer.com/us/book/9780387954578">Modern Applied Statistics with S</a> by W.N. Venables and B.D. Ripley – Probably the best introduction to modern computational statistics out there. Even though it is S, most of the code will work in R.</p>
<p><a href="http://shop.oreilly.com/product/9780596809164.do">R Cookbook</a> by Paul Teetor – A solid introduction with recipes for carrying out data analyses and basic plots that you will want on your shelf.</p>
<p><a href="http://www.pearsonhighered.com/educator/product/R-for-Everyone-Advanced-Analytics-and-Graphics/9780321888037.page">R for Everyone: Advanced Analytics and Graphics</a> by Jared P. Lander – An easy read with relevant machine learning examples that will get you started with R.</p>
<p><a href="http://www.springer.com/us/book/9781461406846">R for SAS and SPSS</a> Users by Robert A. Muenchen. If you are still using SAS or SPSS you need this book. The author speaks your language, understands where you are coming from and will help you learn some R.</p>
<p><a href="http://shop.oreilly.com/product/0636920023135.do">R Graphics Cookbook</a> by Winston Chang – an indispensable reference for R visualizations of all kinds. Read a more complete review <a href="http://blog.revolutionanalytics.com/2013/02/a-review-of-the-r-graphics-cookbook.html">here</a>.</p>
<p><a href="https://www.manning.com/books/r-in-action">R in Action</a> by Robert I Kabacoff - A gentle introduction to R with elegant plots that are model visualizations. You can read more about it <a href="http://blog.revolutionanalytics.com/2011/12/review-of-r-in-action-by-robert-i-kabacoff.html">here</a>.</p>
<p><a href="http://www.springer.com/us/book/9783319194240">Regression Modeling Strategies</a> by Frank E. Harrell, Jr. An incredible amount of wisdom for how to do statistics backed up with mostly straightforward R code.</p>
<p><a href="https://www.crcpress.com/R-Programming-for-Bioinformatics/Gentleman/9781420063677">R Programming for Bioinformatics</a> by Robert Gentlemen – Not only for Bioinformatics. This book provides insight into the structure of the R language for intermediate and advanced programmers.</p>
<p><a href="http://www.springer.com/us/book/9780387759357">Software for Data Analysis: Programming with R</a> by John M. Chambers – A text for advanced programmers discussing philosophy and good practices and providing deep insight into R.</p>
<p><a href="http://www.springer.com/us/book/9781493909827">Statistical Analysis of Network Data with R</a> by Eric D. Kolaczyk and Gábor Csárdi – This is an indispensable resource for analyzing network data, containing a thorough explanation of the igraph package, it works through exponential random graph models and other advanced topics.</p>
<p><a href="https://www.crcpress.com/Statistical-Computing-in-C-and-R/Eubank-Kupresanin/9781420066500">Statistical Computing in C++ and R</a> by Randall L. Eubank and Ana Kupresanin – A very approachable introduction to both R and C++ for anyone who wants to understand these languages from the perspective of numerical analysis and the nuts and bolts of linear algebra.</p>
<p><a href="http://www.springer.com/us/book/9781493926138">Statistics and Data Analysis for Financial Engineering</a> by David Ruppert and David S. Matteson – If you are interested in financial modeling this book could be your ticket to learning R and the R packages that support time series and financial engineering.</p>
<p><a href="http://www.springer.com/us/book/9780387759586">Time Series Analysis with Applications in R</a> by Jonathan D. Cryer and Kung-Sik Chan – A solid undergraduate level introduction to R with step-by-Step R code. Very suitable for self study.</p>
<p><a href="http://www.springer.com/us/book/9781461478997">XML and Web Technologies for Data Sciences with R</a> by Deborah Nolan and Duncan Temple Lang – Everything a data scientist would ever really want to know about XLM documents, JSON and other web technologies and how you can work with them using R.</p>
<p> </p>RreviewsJoseph Rickert2015-11-19T08:30:00-08:00Enhancements to the AzureML package to connect R to AzureML Studio
http://blog.revolutionanalytics.com/2015/11/azureml-update.html
by Andrie de Vries We have written on several occasions about AzureML, the Microsoft machine learning studio that is part of the Cortana Analytics suite: Running R in the Azure ML cloud Call R functions from any application with the AzureML package Using miniCRAN in Azure ML In September we announced that the AzureML package for R allows you to publish R functions as Azure web services. This is a brilliantly easy way to deploy your functions to other users and clients. For example, you can publish a function from R, then consume that function from Excel! I am pleased...<p>by Andrie de Vries</p>
<p>We have written on several occasions about AzureML, the Microsoft machine learning studio that is part of the <a href="http://www.microsoft.com/en-us/server-cloud/cortana-analytics-suite/overview.aspx">Cortana Analytics</a> suite:</p>
<ul>
<li><a href="http://blog.revolutionanalytics.com/2014/11/r-on-azure-ml.html">Running R in the Azure ML cloud</a></li>
<li><a href="http://blog.revolutionanalytics.com/2015/09/publishing-r-models-as-a-service-with-azure-ml.html">Call R functions from any application with the AzureML package</a></li>
<li><a href="http://blog.revolutionanalytics.com/2015/10/using-minicran-in-azure-ml.html">Using miniCRAN in Azure ML</a></li>
</ul>
<p>In September we announced that the <a href="https://github.com/RevolutionAnalytics/AzureML ">AzureML package for R</a> allows you to publish R functions as Azure web services. This is a brilliantly easy way to deploy your functions to other users and clients. For example, you can publish a function from R, then consume that function from Excel!</p>
<p>I am pleased to announce that we have completed a significant rewrite of the AzureML package. This rewrite adds several enhancements. Specifically, AzureML now also allows you to interact with:</p>
<ul>
<li>Workspace: connect to and manage AzureML workspaces</li>
<li>Datasets: upload and download datasets to and from AzureML workspaces</li>
<li>Experiments: download intermediate datasets from AzureML experiments</li>
</ul>
<p>We have also significantly enhanced the functionality to publish and consume models</p>
<ul>
<li>Publish: define a custom function or train a model and publish it as an Azure Web Service</li>
<li>Consume: use available web services from R in a variety of convenient formats</li>
</ul>
<h2>Interacting with datasets</h2>
<p>This version of the AzureML package adds new functionality to interact with datasets and experiments.</p>
<p>The code to do this is very simple:</p>
<pre class="r geshifilter-R" style="padding-left: 30px;"><span style="font-size: 10pt;"># Create a workspace object
ws <- workspace()
 
# List datasets
<a href="http://inside-r.org/r-doc/datasets">datasets</a>(ws, <a href="http://inside-r.org/r-doc/stats/filter">filter</a> = "sample")
 
# Download a dataset
<a href="http://inside-r.org/r-doc/graphics/frame">frame</a> <- download.datasets(ws, name = "Forest fires data")
<a href="http://inside-r.org/r-doc/utils/head">head</a>(<a href="http://inside-r.org/r-doc/graphics/frame">frame</a>)</span></pre>
<p>As expected, this displays the first few lines of the resulting data frame:</p>
<p style="padding-left: 30px;"><span style="font-family: 'courier new', courier;">  X Y month day FFMC  DMC    DC  ISI temp RH wind rain area</span><br /><span style="font-family: 'courier new', courier;">1 7 5   mar fri 86.2 26.2  94.3  5.1  8.2 51  6.7  0.0    0</span><br /><span style="font-family: 'courier new', courier;">2 7 4   oct tue 90.6 35.4 669.1  6.7 18.0 33  0.9  0.0    0</span><br /><span style="font-family: 'courier new', courier;">3 7 4   oct sat 90.6 43.7 686.9  6.7 14.6 33  1.3  0.0    0</span><br /><span style="font-family: 'courier new', courier;">4 8 6   mar fri 91.7 33.3  77.5  9.0  8.3 97  4.0  0.2    0</span><br /><span style="font-family: 'courier new', courier;">5 8 6   mar sun 89.3 51.3 102.2  9.6 11.4 99  1.8  0.0    0</span><br /><span style="font-family: 'courier new', courier;">6 8 6   aug sun 92.3 85.3 488.0 14.7 22.2 29  5.4  0.0    0</span></p>
<h2>Publishing an R function as a webservice</h2>
<p>We made many improvements to the mechanism underlying the functionality to publish a web service. </p>
<p>In particular, it is now very easy to provide a data frame as input to the publishing function. You no longer have to specify the classes of every column. Instead, the <span style="font-family: 'courier new', courier;">publishWebservice()</span> function automatically determines the column classes of the inputs as well as the results.</p>
<p>To illustrate, here is an example from the help:</p>
<pre class="r geshifilter-R" style="padding-left: 30px;"><span style="font-size: 10pt;">ws <- workspace()
 
# Publish a simple model using the lme4::sleepdata
 
<a href="http://inside-r.org/r-doc/base/library">library</a>(<a href="http://inside-r.org/packages/cran/lme4">lme4</a>)
<a href="http://inside-r.org/r-doc/base/set.seed">set.seed</a>(1)
train <- sleepstudy[<a href="http://inside-r.org/r-doc/base/sample">sample</a>(<a href="http://inside-r.org/r-doc/base/nrow">nrow</a>(sleepstudy), 120),]
m <- <a href="http://inside-r.org/r-doc/stats/lm">lm</a>(Reaction ~ Days + Subject, <a href="http://inside-r.org/r-doc/utils/data">data</a> = train)
 
# Deine a prediction function to publish based on the model:
sleepyPredict <- <a href="http://inside-r.org/r-doc/base/function">function</a>(newdata){
<a href="http://inside-r.org/r-doc/stats/predict">predict</a>(m, newdata=newdata)
}
 
ep <- publishWebService(ws, fun = sleepyPredict, name="sleepy lm",
inputSchema = sleepstudy,
<a href="http://inside-r.org/r-doc/base/data.frame">data.frame</a>=TRUE)
 
# OK, try this out, and compare with raw data
ans = consume(ep, sleepstudy)$ans
<a href="http://inside-r.org/r-doc/graphics/plot">plot</a>(ans, sleepstudy$Reaction)</span> </pre>
<h2>Installation instructions</h2>
<p>Right now, the new version is only available at <a href="https://github.com/RevolutionAnalytics/AzureML">github</a>. To install the package, use:</p>
<pre class="r geshifilter-R" style="padding-left: 30px;"><span style="font-size: 10pt;">if(!<a href="http://inside-r.org/r-doc/base/require">require</a>("devtools")) <a href="http://inside-r.org/r-doc/utils/install.packages">install.packages</a>("devtools")
devtools::install_github("RevolutionAnalytics/AzureML")</span><br /><br /></pre>
<p>Additional resources:</p>
<p>The package has extensive help with many examples as well as a vignette.  You can also:</p>
<ul>
<li>view the vignette at <a href="https://htmlpreview.github.io/?https://github.com/RevolutionAnalytics/AzureML/blob/master/vignettes/getting_started.html">Getting Started with the AzureML Package</a>.</li>
<li>take a look at the <a href="https://github.com/RevolutionAnalytics/AzureML/wiki/Bug-bash-instructions">bug bash instructions</a> - walk-through guide with installation and configuration instructions as well as sample code</li>
</ul>
<p> Github: AzureML, <a href="https://github.com/RevolutionAnalytics/AzureML">An R interface to AzureML experiments, datasets, and web services</a></p>developer tipsMicrosoftpackagesRAndrie de Vries2015-11-18T10:15:51-08:00Fun with Simpson's Paradox: Simulating Confounders
http://blog.revolutionanalytics.com/2015/11/fun-with-simpsons-paradox-simulating-confounders.html
Bob Horton Sr Data Scientist, Microsoft Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow the “more details” link to find the R code used to generate it. There is a lot of R in Wikipedia). I rearranged it a bit to put the values in a dataframe, to make it a bit easier to think of the “color” column as a confounding variable: x y...<p>Bob Horton<br />Sr Data Scientist, Microsoft</p>
<div class="container-fluid main-container">
<div class="section level1" id="fun-with-simpsons-paradox-simulating-confounders">
<p>Wikipedia describes <a href="https://en.wikipedia.org/w/index.php?title=Simpson%27s_paradox&oldid=686754969">Simpson’s paradox</a> as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow the “<a href="https://commons.wikimedia.org/wiki/File:Simpson%27s_paradox_continuous.svg">more details</a>” link to find the R code used to generate it. There is a lot of R in Wikipedia).</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7ec2f0e970b-pi" style="display: inline;"><img alt="Simpson_categorical-1" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b7c7ec2f0e970b image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7ec2f0e970b-800wi" title="Simpson_categorical-1" /></a></p>
<p>I rearranged it a bit to put the values in a dataframe, to make it a bit easier to think of the “color” column as a confounding variable:</p>
<table>
<thead>
<tr class="header">
<th align="right">x</th>
<th align="right">y</th>
<th align="right">color</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="right">1</td>
<td align="right">6</td>
<td align="right">1</td>
</tr>
<tr class="even">
<td align="right">2</td>
<td align="right">7</td>
<td align="right">1</td>
</tr>
<tr class="odd">
<td align="right">3</td>
<td align="right">8</td>
<td align="right">1</td>
</tr>
<tr class="even">
<td align="right">4</td>
<td align="right">9</td>
<td align="right">1</td>
</tr>
<tr class="odd">
<td align="right">8</td>
<td align="right">1</td>
<td align="right">2</td>
</tr>
<tr class="even">
<td align="right">9</td>
<td align="right">2</td>
<td align="right">2</td>
</tr>
<tr class="odd">
<td align="right">10</td>
<td align="right">3</td>
<td align="right">2</td>
</tr>
<tr class="even">
<td align="right">11</td>
<td align="right">4</td>
<td align="right">2</td>
</tr>
</tbody>
</table>
<p>If we do not consider this confounder, we find that the coefficient of x is negative (the dashed line in the figure above):</p>
<pre class="r"><code>coefficients(lm(y ~ x, data=simpson_data))</code></pre>
<pre><code>## (Intercept) x
## 8.3333333 -0.5555556</code></pre>
<p>If we do take the confouder into account, we see the coefficient of x is positive:</p>
<pre class="r"><code>coefficients(lm(y ~ x + color, data=simpson_data))</code></pre>
<pre><code>## (Intercept) x color
## 17 1 -12</code></pre>
<p>In his book <em>Causality</em>, Judea Pearl makes a more sweeping statement regarding Simpson’s paradox: “Any statistical relationship between two variables may be reversed by including additional factors in the analysis.” [Pearl2009]</p>
<p>That sounds fun; let’s try it.</p>
<p>First we’ll make variables <code>x</code> and <code>y</code> with a simple linear relationship. I’ll use the same slopes and intercepts as in the Wikipedia figure, both to show the parallel and to demonstrate the incredible cosmic power I have to bend coefficients to my will.</p>
<pre class="r"><code>set.seed(1)
N <- 3000
x <- rnorm(N)
m <- -0.5555556
b <- 8.3333333
y <- m * x + b + rnorm(length(x))
plot(x, y, col="gray", pch=20, asp=1)
fit <- lm(y ~ x)
abline(fit, lty=2, lwd=2)</code></pre>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb088ffe9b970d-pi" style="display: inline;"><img alt="Scatterplot" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb088ffe9b970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb088ffe9b970d-800wi" title="Scatterplot" /></a></p>
<p>When we look at the slope of the regression line determined by fitting the model, it is almost exactly equal to the constant <code>m</code> that we used to determine <code>y</code>.</p>
<pre class="r"><code>coefficients(fit)</code></pre>
<pre><code>## (Intercept) x
## 8.3284021 -0.5358175</code></pre>
<p>We get out what we put in; the coefficient of x is essentially the slope we originally gave <code>y</code> when we generated it (-0.5555556). This is the ‘effect’ of <code>x</code>, in that a one unit increase in <code>x</code> apparently increases <code>y</code> by this amount.</p>
<p>Now think about how to concoct a confounding variable to reverse the coefficient of <code>x</code>. This figure shows one way to approach the problem – group the points into a set of parallel stripes, with the stripes sloping in a different direction from the overall dataset:</p>
<pre class="r"><code>m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below</code></pre>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d175a27d970c-pi" style="display: inline;"><img alt="Striped_scatterplot" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b8d175a27d970c image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d175a27d970c-800wi" title="Striped_scatterplot" /></a></p>
<p>The stripes were made by specifying a reference line with a slope equal to the x-coefficient we want to achieve, and calculating the distance to that line for each point. Putting these distances into categories (by rounding off some multiple of the distance) then groups the points into stripes (shown as colors in the figure). A regression line was then fitted separately to the set of points within each stripe. The regression lines for the stripes on the very ends can be a bit wild, since these groups are very small and scattered, but the ones near the center, representing the majority of the data points, have a quite consistent slope.</p>
<p>The equation for determining the <a href="https://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line">distance from a point to a line</a> is (of course) right there in Wikipedia.</p>
<p>With a little rearranging to express the line in terms of y-intercept (<code>b</code>) and slope (<code>m</code>), and leaving off the absolute value so that points below the line have negative distances (and thus end up in a different group from the stripe with a positive distance of the same magnitude), we get this function:</p>
<pre class="r"><code>point_line_distance <- function(b, m, x, y)
(y - (m*x + b))/sqrt(m^2 + 1)</code></pre>
<p>Here are functions for putting the points into stripewise groups, determining the regression coefficients for each group, and putting it all together into a figure:</p>
<pre class="r"><code>confounded_data_frame <- function(x, y, m, num_grp){
b <- 0 # intercept doesn't matter
d <- point_line_distance(b, m, x, y)
d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
data.frame(x=x, y=y,
group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}
find_group_coefficients <- function(data){
coef <- t(sapply(levels(data$group),
function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}
striped_scatterplot <- function(formula, grouped_data){
# blue on top and red on bottom, to match the Wikipedia figure
colors <- rev(rainbow(length(levels(grouped_data$group)), end=2/3))
plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
grp_coef <- find_group_coefficients(grouped_data)
# if some coefficents get dropped, colors won't match exactly
for (r in 1:nrow(grp_coef))
abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}</code></pre>
<p>Note that the regression lines for each group are not exactly parallel to the stripes. This is because linear regression is about minimizing the squared error on the y-axis, not the distance of points from the line. However, the thinner the stripes are, the closer the group regression lines are to our target slope. If we make a large number of thin stripes, the coefficient of <code>x</code> when the groups are taken into account is essentially the same as the slope of the reference line we used to orient the stripes:</p>
<pre class="r"><code>cdf100 <- confounded_data_frame(x, y, m_new, num_grp=100)
# without confounder
coefficients(lm(y ~ x, cdf100))['x']</code></pre>
<pre><code>## x
## -0.5358175</code></pre>
<pre class="r"><code># with confounder
coefficients(lm(y ~ x + group, cdf100))['x']</code></pre>
<pre><code>## x
## 0.9961566</code></pre>
<p>This approach gives us the power to synthesize simulated confounders that can change the coefficient of <code>x</code> to pretty much any value we choose when a model is fitted with the confounder taken into account. Plus, it makes pretty rainbows.</p>
<p>While Simpson’s Paradox is typically described in terms of categorical confounders, the same reversal principle applies to continuous confounders. But that’s a topic for another post.</p>
<div class="section level2" id="references">
<h2>References</h2>
<p>[Pearl2009]: Pearl, J. Causality: Models, Reasoning and Inference (2ed). Cambridge University Press, New York 2009.</p>
</div>
</div>
</div>
<script>// <![CDATA[
// &lt;![CDATA[
// &amp;lt;![CDATA[
// &amp;amp;lt;![CDATA[
// &amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;amp;amp;lt;![CDATA[
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$(&amp;amp;amp;amp;amp;amp;#39;tr.header&amp;amp;amp;amp;amp;amp;#39;).parent(&amp;amp;amp;amp;amp;amp;#39;thead&amp;amp;amp;amp;amp;amp;#39;).parent(&amp;amp;amp;amp;amp;amp;#39;table&amp;amp;amp;amp;amp;amp;#39;).addClass(&amp;amp;amp;amp;amp;amp;#39;table table-condensed&amp;amp;amp;amp;amp;amp;#39;);
});
// ]]&amp;amp;amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;gt;
// ]]&amp;amp;gt;
// ]]&amp;gt;
// ]]&gt;
// ]]></script>
<script>// <![CDATA[
// &lt;![CDATA[
// &amp;lt;![CDATA[
// &amp;amp;lt;![CDATA[
// &amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;amp;amp;lt;![CDATA[
(function () {
var script = document.createElement(&amp;amp;amp;amp;amp;amp;quot;script&amp;amp;amp;amp;amp;amp;quot;);
script.type = &amp;amp;amp;amp;amp;amp;quot;text/javascript&amp;amp;amp;amp;amp;amp;quot;;
script.src = &amp;amp;amp;amp;amp;amp;quot;https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&amp;amp;amp;amp;amp;amp;quot;;
document.getElementsByTagName(&amp;amp;amp;amp;amp;amp;quot;head&amp;amp;amp;amp;amp;amp;quot;)[0].appendChild(script);
})();
// ]]&amp;amp;amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;gt;
// ]]&amp;amp;gt;
// ]]&amp;gt;
// ]]&gt;
// ]]></script>RstatisticsJoseph Rickert2015-11-17T08:30:00-08:00New surveys show continued popularity of R
http://blog.revolutionanalytics.com/2015/11/new-surveys-show-continued-popularity-of-r.html
Two recent surveys — one based on LinkedIn skills data, and another a direct survey of data miners — show that R remains the most popular software for statistical data analysis. In a study of skills associated with LinkedIn profiles by RJmetrics (and also reported on Forbes), "data analysis" was unsurprisingly the skill most associated with self-proclaimed data scientists. Of specific software skills listed, R was the most common, closely followed by Python. Meanwhile, Karl Rexer is preparing the results from his latest survey of data analytics professionals. When the survey was last published in 2013, R was the most...<p>Two recent surveys — one based on LinkedIn skills data, and another a direct survey of data miners — show that R remains the <a href="http://blog.revolutionanalytics.com/popularity/">most popular software for statistical data analysis</a>.</p>
<p>In a study of <a href="https://rjmetrics.com/resources/reports/the-state-of-data-science/">skills associated with LinkedIn profiles by RJmetrics</a> (and also <a href="http://www.forbes.com/sites/gilpress/2015/10/21/the-number-of-data-scientists-has-doubled-over-the-last-4-years/">reported on Forbes</a>), "data analysis" was unsurprisingly the skill most associated with self-proclaimed data scientists. Of specific software skills listed, R was the most common, closely followed by Python.</p>
<p><a class="asset-img-link" href="http://a5.typepad.com/6a0105360ba1c6970c01b8d1772ea5970c-pi"><img alt="RJMetrics R Skills" class="asset asset-image at-xid-6a0105360ba1c6970c01b8d1772ea5970c img-responsive" src="http://a5.typepad.com/6a0105360ba1c6970c01b8d1772ea5970c-450wi" style="width: 450px; display: block; margin-left: auto; margin-right: auto;" title="RJMetrics R Skills" /></a></p>
<p>Meanwhile, Karl Rexer is preparing the results from his latest <a href="http://www.rexeranalytics.com/">survey of data analytics professionals</a>. When the survey was <a href="http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html">last published in 2013</a>, R was the most popular tool and it was growing rapidly in popularity. Karl has shared some preliminary data from the upcoming surveyreport (now titled the "2015 Data Science Survey"), where R's popularity continues to surge. 76% of respondents use R for data analysis, and 36% use R as their primary tool (compared to 7% for the runner-up, SAS). R has continued to grow in both measures in every year since the Rexer survey was launched:</p>
<p><a class="asset-img-link" href="http://a7.typepad.com/6a0105360ba1c6970c01b8d1773127970c-pi"><img alt="Rexer 2015 preview" class="asset asset-image at-xid-6a0105360ba1c6970c01b8d1773127970c img-responsive" src="http://a7.typepad.com/6a0105360ba1c6970c01b8d1773127970c-450wi" style="width: 450px; display: block; margin-left: auto; margin-right: auto;" title="Rexer 2015 preview" /></a></p>
<p>We'll have more coverage on the 2015 Rexer Data Science Survey when the complete results are published later this year.</p>popularityRDavid Smith2015-11-16T13:19:36-08:00Because it's Friday: Magnets, how do they work?
http://blog.revolutionanalytics.com/2015/11/because-its-friday-magnets.html
Despite some skepticism (audio NSFW), we have a pretty good grasp on how magnets work, and some people have used that knowledge to do some pretty cool things with them. For example, a Norwegian startup is trying to launch a magnet-based game that allows you to create some spectacular chain reactions: Meanwhile, some scientists demonstrate the power of the 4 Tesla magnet contained within a decommissioned MRI machine (via IFLscience): Also check out what happened when they finally did decommission the MRI (and especially the FAQ in the comments, where I learned about the dangers of "air ice"). That's all...<p>Despite <a href="https://youtu.be/_-agl0pOQfs?t=1m48s">some skepticism</a> (audio NSFW), we have a pretty good grasp on how magnets work, and some people have used that knowledge to do some pretty cool things with them. For example, a <a href="http://www.magination.no/">Norwegian startup</a> is trying to launch a magnet-based game that allows you to create some spectacular chain reactions:</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb0890647a970d-pi" style="display: inline;"><img alt="Epic-collapse_1" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb0890647a970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb0890647a970d-800wi" title="Epic-collapse_1" /></a></p>
<p>Meanwhile, some scientists demonstrate the power of the 4 Tesla magnet contained within a decommissioned MRI machine (<a href="http://www.iflscience.com/health-and-medicine/how-dangerous-are-magnetic-items-near-mri-machine">via</a> IFLscience):</p>
<p><iframe allowfullscreen="" frameborder="0" height="281" src="https://www.youtube.com/embed/6BBx8BwLhqg" width="500"></iframe> </p>
<p>Also <a href="https://www.youtube.com/watch?v=9SOUJP5dFEg">check out what happened</a> when they finally did decommission the MRI (and especially the FAQ in the comments, where I learned about the dangers of "air ice").</p>
<p>That's all for this week! See you back here on Monday, and have a great weekend.</p>randomDavid Smith2015-11-13T15:21:14-08:00In case you missed it: October 2015 roundup
http://blog.revolutionanalytics.com/2015/11/in-case-you-missed-it-october-2015-roundup.html
In case you missed them, here are some articles from October of particular interest to R users. A video from the PASS 2015 conference in Seattle shows R running within SQL Server 2016. The preview for SQL Server 2016 includes Revolution R Enterprise (as SQL Server R Services). A way of dealing with confounding variables in experiments: instrumental variable analysis with the ivmodel package for R. The new dplyrXdf package allows you to manipulate large, out-of-memory data sets in the XDF format (used by the RevoScaleR package) using dplyr syntax. Some guidelines for using explicit parallel programming (e.g. the parallel...<p>In case you missed them, here are some articles from October of particular interest to <a href="https://mran.revolutionanalytics.com/documents/what-is-r/" target="_self">R</a> users. </p>
<p>A video from the PASS 2015 conference in Seattle shows <a href="http://blog.revolutionanalytics.com/215/10/demo-r-in-sql-server-2016.html">R running within SQL Server 2016</a>. The preview for <a href="http://blog.revolutionanalytics.com/2015/10/revolution-r-now-available-with-sql-server-community-preview.html">SQL Server 2016 includes Revolution R Enterprise</a> (as SQL Server R Services). </p>
<p>A way of dealing with confounding variables in experiments: <a href="http://blog.revolutionanalytics.com/2015/10/instrumental-variables.html">instrumental variable analysis with the ivmodel package</a> for R.</p>
<p>The new <a href="http://blog.revolutionanalytics.com/2015/10/the-dplyrxdf-package.html">dplyrXdf package</a> allows you to <a href="http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html">manipulate large, out-of-memory data sets in the XDF format</a> (used by the RevoScaleR package) using dplyr syntax.</p>
<p>Some <a href="http://blog.revolutionanalytics.com/2015/10/edge-cases-in-using-the-intel-mkl-and-parallel-programming.html">guidelines for using explicit parallel programming</a> (e.g. the parallel package) with the implicit multithreading provided by Revolution R Open.</p>
<p><a href="http://blog.revolutionanalytics.com/2015/10/ross-ihaka-in-the-economist.html">Ross Ihaka was featured</a> in a full-page advertisement for the University of Auckland in The Economist.</p>
<p>A comparison of <a href="http://blog.revolutionanalytics.com/2015/10/party-with-the-first-tribe.html">fitting decision trees in R with the party and rpart packages</a>.</p>
<p>The <a href="http://blog.revolutionanalytics.com/2015/10/updates-to-the-foreach-package-and-its-friends.html">foreach suite of packages for parallel programming in R has been updated</a>, and now includes support for progress bars when using doSNOW.</p>
<p>The "reach" package allows you to <a href="http://blog.revolutionanalytics.com/2015/10/reach-for-your-matlab-data-with-r.html">call Matlab functions directly from R</a>.</p>
<p>A review of <a href="http://blog.revolutionanalytics.com/2015/10/the-5th-tribe-support-vector-machines-and-caret.html">support vector machines (SVMs) in R</a>.</p>
<p>A presentation (with sample code) shows <a href="http://blog.revolutionanalytics.com/2015/10/previewing-using-revolution-r-enterprise-inside-sql-server.html">how to call Revolution R Enterprise from SQL Server 2016</a>.</p>
<p>A tutorial on using the miniCRAN package to <a href="http://blog.revolutionanalytics.com/2015/10/using-minicran-in-azure-ml.html">set up packages for use with R in Azure ML</a>.</p>
<p>Asif Salam shows how to use the RDCOMClient package to <a href="http://blog.revolutionanalytics.com/2015/10/programmatically-create-interactive-powerpoint-slides-with-r.html">construct interactive Powerpoint slide shows with R</a>.</p>
<p>A <a href="http://blog.revolutionanalytics.com/2015/10/learning-r-oct-2015.html">directory of online R courses</a> for all skill levels.</p>
<p>Using R's <span style="font-family: 'courier new', courier;">nls()</span> optimizer to <a href="http://blog.revolutionanalytics.com/2015/10/parameters-and-percentiles-the-gamma-distribution.html">solve a problem in Bayesian inference</a>.</p>
<p>A professor uses the miniCRAN package to <a href="http://blog.revolutionanalytics.com/2015/10/using-minicran-on-site-in-iran.html">deliver R packages to offline facilities</a> in Turkey and Iran.</p>
<p>Amanda Cox, graphics editor at the New York Times, <a href="http://blog.revolutionanalytics.com/2015/10/amanda-cox-on-using-r-at-the-nyt.html">calls R "the greatest software on Earth"</a> in a podcast.</p>
<p><a href="http://blog.revolutionanalytics.com/2015/10/hadley-wickhams-ask-me-anything-on-reddit.html">Hadley Wickham answered many questions</a> in a Reddit "Ask Me Anything" session.</p>
<p>A roundup of <a href="http://blog.revolutionanalytics.com/2015/10/r-user-groups-highlight-r-creativity.html">several talks given at R user group meetings</a> around the world.</p>
<p>General interest stories (not related to R) in the past month included: <a href="http://blog.revolutionanalytics.com/2015/10/chess-piece-moves.html">visualizing the movements of chess pieces</a>, <a href="http://blog.revolutionanalytics.com/2015/10/because-its-friday-faceon.html">real-time face replication</a>, a <a href="http://blog.revolutionanalytics.com/2015/10/because-its-friday-mapping-antineutrinos.html">world map of antineutrinos</a>, a <a href="http://blog.revolutionanalytics.com/2015/10/because-its-friday-a-transformation.html">gender transformation</a>, and a <a href="http://blog.revolutionanalytics.com/2015/10/because-its-friday-are-we-selling-radium-underpants.html">warning about "big data" applications</a>.</p>
<p>As always, thanks for the comments and please send any suggestions to me at <a href="mailto:davidsmi@microsoft.com">davidsmi@microsoft.com</a>. Don't forget you can follow the blog using an RSS reader, via <a href="http://blogtrottr.com/" target="_self">email using blogtrottr</a>, or by following me on Twitter (I'm <a href="http://twitter.com/revodavid">@revodavid</a>). You can find roundups of previous months <a href="http://blog.revolutionanalytics.com/roundups/">here</a>.</p>RrandomDavid Smith2015-11-13T08:29:07-08:00