Revolutions
http://blog.revolutionanalytics.com/
Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics.en-US2015-10-08T08:30:00-07:00Learning R: Index of Online R Courses, October 2015
http://blog.revolutionanalytics.com/2015/10/learning-r-oct-2015.html
by Joseph Rickert Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation. I find the following graphic to be helpful in sorting things out. The X axis is time on Malcolm Gladwell's...<p>by Joseph Rickert</p>
<p>Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation.</p>
<p>I find the following graphic to be helpful in sorting things out.</p>
<p>  <a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb087dc1b0970d-pi" style="display: inline;"><img alt="Untitled" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb087dc1b0970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb087dc1b0970d-800wi" title="Untitled" /></a></p>
<p>The X axis is time on Malcolm Gladwell's "<a href="http://www.amazon.com/Outliers-Story-Success-Malcolm-Gladwell/dp/0316017930/ref=sr_1_1?ie=UTF8&qid=1444196385&sr=8-1&keywords=outliers" target="_self">Outliers</a>" scale. His idea is that it takes 10,000 hours of real effort to master anything, R, Python or Rock and Roll Guitar. The Y axis lists increasingly difficult R tasks, and the arrows within the plot area are labels increasingly proficient types of R users.</p>
<p>The point I want to make here is that a significant amount of very productive R work happens in the area around the red ellipse. So, while their is no avoiding "10,000" hours of hard work to become an R Jedi knight, a curious and motivated person can master enough R to accomplish his/her programming goals with a more modest commitment. There are three main reasons for this:</p>
<ol>
<li>R's functional programming style is very well suited for statistical modeling, data visualization and data science tasks</li>
<li>The 7,000<sup>+</sup> packages available in the R ecosystem provide tens of thousands of functions that make it possible to accomplish quite a bit without having to write much code</li>
<li>Numerous, high quality books and online material devoted to teaching statistical theory and data science with R</li>
</ol>
<p>If you have some background in some area of statistics or data science a viable strategy for learning R is to identify a resource that works for you and just jump into the middle of things, picking up R as you go along.</p>
<p>The lists below link to courses that can either start you on a formal programming path, or help you become a productive R user in a particular application area. Some of the courses are "live events" that you take with a cohort of students, others are set up for self study.</p>
<p>The courses devoted to teaching R as a programming language are</p>
<ul>
<li><a href="https://www.coursera.org/course/datascitoolbox?utm_medium=email&utm_source=recommendations&utm_campaign=recommendationsEmail%7Erecs_email_2015_10_04" target="_self">The Data Scientist’s toolbox</a> </li>
<li><a href="https://www.coursera.org/course/rprog" target="_self">R Programming</a></li>
<li><a href="https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0?gclid=CLeklP3zrMgCFVKPfgodhGAA0A" target="_self">Introduction to R Programming</a></li>
<li><a href="https://www.datacamp.com/courses/free-introduction-to-r" target="_self">Introduction to R</a></li>
<li><a href="http://www.statistics.com/r-programming-intro-1/" target="_self">R Programing - Introduction 1</a></li>
<li><a href="https://www.coursera.org/learn/programacion-estadistica-r" target="_self">Introduction a la programacion estadistica con R</a></li>
<li><a href="http://tryr.codeschool.com/" target="_self">O’Reilly Code School</a></li>
</ul>
<p>The first two courses above are from Coursera's <a data-reactid=".1n37nbui48w.0.1.0.0.0.1.1.1.0.3:$1.0.1" href="https://www.coursera.org/specializations/jhudatascience?utm_medium=courseDescripTop">Data Science Specialization </a>sequence. Taught by Roger Peng, Jeff Leek and Brian Caffo they are probably the gold standard for MOOC R courses. I am a little late with this post. The Data Scientists's toolbox started this past Monday but there is still time to catch up. The third course, Introduction to R Programming, is a relatively new edX course from <a href="https://www.edx.org/school/microsoft" target="_self">Microsoft's online offerings</a> that is getting great <a href="https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0?gclid=CLeklP3zrMgCFVKPfgodhGAA0A#!" target="_self">review</a>s. The fourth course on the list a solid introduction to R from DataCamp. R Programming - Introduction 1 is a beginner's introduction to R taught by Paul Murrell or Tal Galili.  Next listed, is a Spanish language introduction to R from Coursera and O'Reilly's interactive Code School course.</p>
<p>These next three lists contain courses from DataCamp and statistics.com and online resources from R Studio that introduce more advanced features of R by buildng on basic R programming skills. Note that the final course on the DataCamp list introduces Big Data features of Revolution R Enterprise which is available in the <a href="https://azure.microsoft.com/en-us/marketplace/partners/revolution-analytics/revolution-r-enterprise/" target="_self">Azure Marketplace</a>.</p>
<p><a href="https://www.datacamp.com/" target="_self">DataCamp</a></p>
<ul>
<li><a href="https://www.datacamp.com/courses/intermediate-r" target="_self">Intermediate R</a></li>
<li><a href="https://www.datacamp.com/courses/ggvis-data-visualization-r-tutorial" target="_self">Data Visualization in R with ggvis</a></li>
<li><a href="https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial" target="_self">Data Manipulation with dplyr</a></li>
<li><a href="https://www.datacamp.com/courses/data-table-data-manipulation-r-tutorial" target="_self">Data Analysis in R, the data.table Way</a></li>
<li><a href="https://www.datacamp.com/courses/reporting-with-r-markdown" target="_self">Reporting with R Markdown</a></li>
<li><a href="https://www.datacamp.com/courses/big-data-revolution-r-enterprise-tutorial" target="_self">Big data Analysis with Revolution R Enterprise</a></li>
</ul>
<p><a href="http://www.statistics.com/" target="_self">statistics.com</a></p>
<ul>
<li><a href="http://www.statistics.com/r-programming-intro-2/" target="_self">R Programming Intro 2</a></li>
<li><a href="http://www.statistics.com/r-programming-advanced/">R Programming  Advanced</a></li>
<li><a href="http://www.statistics.com/r-programming-intermediate/" target="_self">R Programming Interm </a></li>
<li><a href="http://www.statistics.com/graphics-in-r/">R Graphics</a></li>
<li><a href="http://www.statistics.com/visualization-in-r-with-ggplot2/">R ggplot2</a></li>
</ul>
<p><a href="https://www.rstudio.com/" target="_self">RStudio</a></p>
<ul>
<li><a href="https://www.rstudio.com/resources/training/online-learning/" target="_self">R Studio Online Learning</a></li>
<li><a href="http://blog.rstudio.org/2014/11/06/introduction-to-data-science-with-r-video-workshop/" target="_self">Introduction to Data Science with R video workshop</a></li>
</ul>
<p>This next section lists courses from the major MOOCs, and non-MOOCs DataCamp and statistics.com that use R to teach various quantitative disciplines</p>
<p><strong><a href="https://www.coursera.org/" target="_self">Coursera</a> Courses</strong></p>
<ul>
<li><a href="https://www.coursera.org/course/statistics" target="_self">Data Analysis and Statistical Inference</a></li>
<li><a href="https://www.coursera.org/course/devdataprod" target="_self">Developing Data Products</a></li>
<li><a href="https://www.coursera.org/course/exdata" target="_self">Exploratory Data Analysis</a></li>
<li><a href="https://www.coursera.org/course/getdata" target="_self">Getting and Cleaning Data</a></li>
<li><a href="https://www.coursera.org/course/compfinance" target="_self">Introduction to Computational Finance and Financial Econometrics</a></li>
<li><a href="https://www.coursera.org/course/causaleffects" target="_self">Measuring Causal Effects in the Social Sciences</a></li>
<li><a href="https://www.coursera.org/course/regmods" target="_self">Regression Models</a></li>
<li><a href="https://www.coursera.org/course/repdata" target="_self">Reproducible Research</a></li>
<li><a href="https://www.coursera.org/course/statinference" target="_self">Statistical Inference</a></li>
<li><a href="https://www.coursera.org/course/stats1" target="_self">Statistics One</a></li>
</ul>
<p><strong><a href="https://www.edx.org/" target="_self">edX</a> Courses</strong></p>
<ul>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x" target="_self">Data Analysis for Life Sciences 1: Statistics and R</a></li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-2-harvardx-ph525-2x" target="_self">Data Analysis for life Sciences 2: Introduction to Linear Models and Matrix Algebra</a></li>
<li><a href="https://www.edx.org/course/data-analysis-life-sciences-6-high-harvardx-ph525-6x" target="_self">Data Analysis for life Sciences 6: High-performance Computing for Reproducible Genomics</a></li>
<li><a href="https://www.edx.org/course/explore-statistics-r-kix-kiexplorx-0" target="_self">Explore Statistics with R</a></li>
<li><a href="https://www.edx.org/course/sabermetrics-101-introduction-baseball-bux-sabr101x-0" target="_self">Sabermetrics 101: Introduction to Baseball Analytics</a></li>
</ul>
<p><strong><a href="https://www.udacity.com/" target="_self">Udacity</a> Course</strong></p>
<ul>
<li><a href="https://www.edx.org/course/sabermetrics-101-introduction-baseball-bux-sabr101x-0" target="_self">Sabermetrics 101: Introduction to Baseball Analytics</a></li>
</ul>
<p>DataCamp</p>
<ul>
<li><a href="https://www.datacamp.com/courses/introduction-to-machine-learning-with-R" target="_self">Introduction to Machine Learning</a></li>
<li><a href="https://www.datacamp.com/introduction-to-statistics" target="_self">A Hands-On Introduction to Statistics with R</a></li>
</ul>
<p>statistics.com</p>
<ul>
<li><a href="http://www.statistics.com/bayesian-r/">Bayesian - R</a></li>
<li><a href="http://www.statistics.com/data-mining-r/">Data Mining - R</a></li>
<li><a href="http://www.statistics.com/mapping-r/">Mapping in R</a></li>
<li><a href="http://www.statistics.com/modeling-in-r/">R Modeling</a></li>
<li><a href="http://www.statistics.com/r-for-statistical-analysis/">R Statistics</a></li>
</ul>
<p>Finally, here are a couple of google apps and Swirl, a new platform for teaching and learning R that may be useful for learning on the go.</p>
<ul>
<li><a href="https://play.google.com/store/apps/details?id=appinventor.ai_RInstructor.R2" target="_self">R instructor</a></li>
<li><a href="https://play.google.com/store/apps/details?id=com.spykeburn.rprogrammingtutorial" target="_self">R Programming</a></li>
<li><a href="http://swirlstats.com/" target="_self">Swirl</a></li>
</ul>
<p>It's time to "go back to school" and make some headway against those 10,000 hours.</p>advanced tipsbeginner tipscoursesMicrosoftRJoseph Rickert2015-10-08T08:30:00-07:00Parameters and percentiles (the gamma distribution)
http://blog.revolutionanalytics.com/2015/10/parameters-and-percentiles-the-gamma-distribution.html
by Andrie de Vries In one of John D. Cooke's blog posts of 2010 (Parameters and Percentiles), he poses the following problem: The doctor says 10% of patients respond within 30 days of treatment and 80% respond within 90 days of treatment. Now go turn that into a probability distribution. That’s a common task in Bayesian statistics, capturing expert opinion in a mathematical form to create a prior distribution. John then discusses how this level of information is highly valuable in statistical inference. The reason is that quite often this is the kind of information you might be able to...<p>by Andrie de Vries</p>
<p>In one of <a href="http://www.johndcook.com/blog/top/" target="_self">John D. Cooke</a>'s blog posts of 2010 (<a href="http://www.johndcook.com/blog/2010/01/31/parameters-from-percentiles/" target="_self">Parameters and Percentiles</a>), he poses the following problem:</p>
<p style="padding-left: 30px;"><em>The doctor says 10% of patients respond within 30 days of treatment and 80% respond within 90 days of treatment. Now go turn that into a probability distribution. That’s a common task in Bayesian statistics, capturing expert opinion in a mathematical form to create a prior distribution.</em></p>
<p>John then discusses how this level of information is highly valuable in statistical inference. The reason is that quite often this is the kind of information you might be able to elicit from a domain expert. It then is up to you as the statistician / data scientist to use this information. In <a href="https://en.wikipedia.org/wiki/Bayesian" target="_self">bayesian</a> statistics, for example, you can use this information to construct a bayesian prior distribution. In particular, he demonstrates how this expectation can be modeled with a <a href="https://en.wikipedia.org/wiki/Gamma_distribution" target="_self">gamma distribution</a> and shows how to solve the problem analytically.</p>
<p>In this post I demonstrate how to solve the problem using the <a href="https://en.wikipedia.org/wiki/Non-linear_least_squares" target="_self">non-linear least squares</a> solver in R, using the <span style="font-family: 'courier new', courier;">nls()</span> function.</p>
<p>But first, take a look at some of the properties of the <a href="https://en.wikipedia.org/wiki/Gamma_distribution" target="_self">gamma distribution</a>. The gamma is a general family of distributions. Both the <a href="https://en.wikipedia.org/wiki/Exponential_distribution" target="_self">exponential</a> and the <a href="https://en.wikipedia.org/wiki/Chi-squared_distribution" target="_self">chi-squared</a> distributions are special cases of the gamma.</p>
<p>The gamma distribution takes two arguments. The first defines the shape. If shape is close to zero, the gamma is very similar to the exponential. If shape is large, then the gamma is similar to the chi-squared distribution.</p>
<p>To create the plots, you can use the function <span style="font-family: 'courier new', courier;">curve()</span> to do the actual plotting, and <span style="font-family: 'courier new', courier;">dgamma()</span> to compute the gamma density distribution. In this grid of plots, the shape parameter varies horisontally (from 1 on the left to 6 on the right). At the same time, the scale parameter varies vertically (from 0.1 at the top to 1.0 at the bottom).</p>
<p><a class="asset-img-link" style="display: inline;" onclick="window.open( this.href, '_blank', 'width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0' ); return false" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb087e3044970d-popup"><img class="asset asset-image at-xid-6a010534b1db25970b01bb087e3044970d img-responsive" title="Gamma" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb087e3044970d-500wi" alt="Gamma" /></a></p>
<p>Next, you can use the function <span style="font-family: 'courier new', courier;">nls()</span> to solve the problem as posed by John Cooke. The <span style="font-family: 'courier new', courier;">nls()</span> function takes a loss function as an argument. This loss function is the function to be minimised by the solver. In the posed problem, you can compute the loss function as the difference between a hypothetical gamma distribution, calculated by <span style="font-family: 'courier new', courier;">qgamma()</span> and the expected values posed by the problem.</p>
<p>The <span style="font-family: 'courier new', courier;">nls()</span> solver is sensitive to the starting conditions, but easily finds a solution:</p>
<p><a class="asset-img-link" style="display: inline;" onclick="window.open( this.href, '_blank', 'width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0' ); return false" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d16421ca970c-popup"><img class="asset asset-image at-xid-6a010534b1db25970b01b8d16421ca970c img-responsive" title="Solution" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d16421ca970c-500wi" alt="Solution" /></a></p>
<p>To replicate this example, you can use this code:</p>
<p> </p>
<script src="https://gist.github.com/andrie/2ea43547ee02f3aa0e36.js"></script>advanced tipsdata scienceRstatisticsAndrie de Vries2015-10-07T09:30:33-07:00Using miniCRAN on site in Iran
http://blog.revolutionanalytics.com/2015/10/using-minicran-on-site-in-iran.html
by Jens Carl Streibig, Professor Emeritus at University of Copenhagen Editor's introduction: for background on the miniCRAN package, see our previous blog posts: Introducing miniCRAN Using R in Myanmar MiniCRAN saves my neck when out in regions where seamless running internet is and exception rather than the rule. R is definitely the programme to offer universities and research institutions in agriculture because it is open source, no money involved, and the help, although sometimes a bit nerdy, is easy to access. I usually tell my student not to buy books on specific topics because R is dynamic and within a...<p>by <a href="http://plen.ku.dk/english/employees/?pure=en/persons/88066" target="_self">Jens Carl Streibig</a>, Professor Emeritus at University of Copenhagen</p>
<p><em>Editor's introduction: for background on the miniCRAN package, see our previous blog posts:</em></p>
<ul>
<li><em><a href="http://blog.revolutionanalytics.com/2014/10/introducing-minicran.html" target="_self">Introducing miniCRAN</a></em></li>
<li><em><a href="http://blog.revolutionanalytics.com/2015/04/r-in-myanmar.html" target="_self">Using R in Myanmar</a></em></li>
</ul>
<p>MiniCRAN saves my neck when out in regions where seamless running internet is and exception rather than the rule. R is definitely the programme to offer universities and research institutions in agriculture because it is open source, no money involved, and the help, although sometimes a bit nerdy, is easy to access. I usually tell my student not to buy books on specific topics because R is dynamic and within a couple of years some of the functions in the book is obsolete and thud discourage the average user. Look at the documentation at the <a href="https://cran.r-project.org/manuals.html" target="_self">r-project.org</a> or in <a href="http://rseek.org/" target="_self">rseek.org</a>.</p>
<p>I have recently been teaching in Turkey and Iran. Sometimes the internet is ok other times it is not. Before it was a struggle to get the particularly packages downloaded and install via <a href="https://www.rstudio.com/" target="_self">RStudio</a>. In a workshop in Iran we could not download the essential packages. A shrewd student downloaded dependencies and distributed the zipfiles to her fellow students. After some glitches we got all up and running.</p>
<div class="photo-wrap photo-xid-6a017d41eeee1a970c01b7c7d2fb65970b" id="photo-xid-6a017d41eeee1a970c01b7c7d2fb65970b" style="display: inline-block; width: 320px;"><a class="asset-img-link" href="http://a5.typepad.com/6a017d41eeee1a970c01b7c7d2fb65970b-pi"><img alt="Picture1" class="asset asset-image at-xid-6a017d41eeee1a970c01b7c7d2fb65970b img-responsive" src="http://a5.typepad.com/6a017d41eeee1a970c01b7c7d2fb65970b-320wi" title="Picture1" /></a>
<div class="photo-caption caption-xid-6a017d41eeee1a970c01b7c7d2fb65970b" id="caption-xid-6a017d41eeee1a970c01b7c7d2fb65970b">Iranian students learning about R</div>
</div>
<p>When I became aware of <a href="https://cran.r-project.org/web/packages/miniCRAN/index.html" target="_self">miniCRAN</a> at the <a href="http://user2015.math.aau.dk/" target="_self">useR!2015</a> meeting all my R problems were almost solved, with help from the maintainer, Andrie de Vries at Revolution Analytics, we got it to work, when given a workshop on dose-response, also in Iran two weeks ago. Everything went all right for those students who could not install the packages at home. Some windows version were in a poor state of repair, so they could not run RStudio and we had to provide all the dependencies, but no problem they were all in the miniCRAN repository.</p>
<div class="photo-wrap photo-xid-6a017d41eeee1a970c01b7c7d2fba6970b" id="photo-xid-6a017d41eeee1a970c01b7c7d2fba6970b" style="display: inline-block; width: 300px;"><a class="asset-img-link" href="http://a6.typepad.com/6a017d41eeee1a970c01b7c7d2fba6970b-pi"><img alt="Picture2" class="asset asset-image at-xid-6a017d41eeee1a970c01b7c7d2fba6970b img-responsive" src="http://a6.typepad.com/6a017d41eeee1a970c01b7c7d2fba6970b-500wi" title="Picture2" /></a>
<div class="photo-caption caption-xid-6a017d41eeee1a970c01b7c7d2fba6970b" id="caption-xid-6a017d41eeee1a970c01b7c7d2fba6970b">Jens Carl Streibig teaching his class in Iran</div>
</div>beginner tipspackagesRAndrie de Vries2015-10-06T08:30:00-07:00Amanda Cox on using R at the NYT
http://blog.revolutionanalytics.com/2015/10/amanda-cox-on-using-r-at-the-nyt.html
For more than six years, the New York Times has been using the R language to develop and implement much of the fantastic data journalism on the website and in the newspaper. A few months ago graphics editor Amanda Cox was interviewed for the Data Stories podcast, where she described the process for creating the interactive data visualizations at the Times. Some highlights of the podcast include Amanda describing R as "The greatest software on Earth", and the background behind the visualization below which tells a unique story based on where you live. This chart, by the way, was cited...<p>For <a href="http://blog.revolutionanalytics.com/2009/06/nyt-charts-michael-jacksons-pop-hits.html" target="_self">more than six years</a>, the New York Times has been using the <a href="https://mran.revolutionanalytics.com/documents/what-is-r/" target="_self">R language</a> to develop and implement much of the fantastic data journalism on the website and in the newspaper. A few months ago graphics editor Amanda Cox was <a href="http://datastori.es/ds-56-amanda-cox-nyt" target="_self">interviewed for the Data Stories podcast</a>, where she described the process for creating the interactive data visualizations at the Times. Some highlights of the podcast include Amanda describing R as "<a href="http://datastori.es/ds-56-amanda-cox-nyt/#t=8:43.922" target="_self">The greatest software on Earth</a>", and the background behind the visualization below which tells a unique story based on where you live.</p>
<p><a class="asset-img-link" href="http://www.nytimes.com/interactive/2015/05/03/upshot/the-best-and-worst-places-to-grow-up-how-your-area-compares.html?_r=1&abt=0002&abg=1" style="display: inline;" target="_self"><img alt="NYT Poverty" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b8d1631ef2970c image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d1631ef2970c-800wi" title="NYT Poverty" /></a></p>
<p>This chart, by the way, was cited by the White House Chief Data Scientist DJ Patil as an examplar of the positive impact of <a href="http://www.data.gov/" target="_self">open data</a>, in his <a href="https://www.oreilly.com/ideas/what-the-white-house-needs-from-you" target="_self">keynote at the Strata conference</a> last week.</p>
<p>To listen to the interview, and to find useful links to the resources Amanda mentions during the interview, follow the link below.</p>
<p>Data Stories: <a href="http://datastori.es/ds-56-amanda-cox-nyt" target="_self">Amanda Cox on Working With R, NYT Projects, Favorite Data</a></p>graphicsRDavid Smith2015-10-05T13:36:50-07:00Because it's Friday: Are we selling radioactive underpants?
http://blog.revolutionanalytics.com/2015/10/because-its-friday-are-we-selling-radium-underpants.html
I just got back from the Strata+Hadoop World conference in New York, and amongst the usual talks on the technology and applications of big data and data science ran a new thread: data ethics. DJ Patil, the US government's chief data scientist, made a call for comments on data ethics in his keynote and in a follow-up discussion session. But the talk that still sticks in my head is the keynote by Maciej Ceglowski where he compared Big Data to the nuclear energy industry: a new technology once revered (you could at one time purchase radiumunderpants!) but which is now...<p>I just got back from the Strata+Hadoop World conference in New York, and amongst the usual talks on the technology and applications of big data and data science ran a new thread: data ethics. DJ Patil, the US government's chief data scientist, made a <a href="https://www.oreilly.com/ideas/what-the-white-house-needs-from-you" target="_self">call for comments on data ethics in his keynote</a> and in a follow-up discussion session. But the talk that still sticks in my head is the keynote by Maciej Ceglowski where he compared Big Data to the nuclear energy industry: a new technology once revered (you could at one time purchase radiumunderpants!) but which is now reviled by the public. It sounds alarmist, but the talk is genuinely thought-provoking, and a warning that if we don't begin to seriously consider the consequences of wide-scale data collection and analysis, our industry risks the same fate. Despite the dark message it's entertainingly delivered, and well worth your 20 minutes.</p>
<p><iframe allowfullscreen="" frameborder="0" height="281" src="https://www.youtube.com/embed/GAXLHM-1Psk" width="500"></iframe> </p>
<p>Something to think about over the weekend. We'll be back on Monday -- see you then.</p>randomDavid Smith2015-10-02T12:38:13-07:00Hadley Wickham's "Ask Me Anything" on Reddit
http://blog.revolutionanalytics.com/2015/10/hadley-wickhams-ask-me-anything-on-reddit.html
Hadley Wickham, RStudio's Chief Scientist and prolific author of R books and packages, conducted an AMA (Ask Me Anything) session on Reddit this past Monday. The session was tremendously popular, generating more than 500 questions/comments and promoting the AMA to the front page of Reddit. If you're not familiar with Hadley's work (which would be a surprise if you're an R user), his own introduction in the Reddit AMA post will fill you in: Broadly, I'm interested in the process of data analysis/science and how to make it easier, faster, and more fun. That's what has led to the development...<p><a class="asset-img-link" href="http://a5.typepad.com/6a0105360ba1c6970c01b7c7d79abd970b-pi" style="float: right;"><img alt="Hadley wickham AMA" class="asset asset-image at-xid-6a0105360ba1c6970c01b7c7d79abd970b img-responsive" src="http://a5.typepad.com/6a0105360ba1c6970c01b7c7d79abd970b-200wi" style="width: 200px; margin: 0px 0px 5px 5px;" title="Hadley wickham AMA" /></a></p>
<p>Hadley Wickham, RStudio's Chief Scientist and prolific author of R books and packages, conducted an AMA (Ask Me Anything) session on Reddit this past Monday. The session was tremendously popular, generating more than 500 questions/comments and promoting the AMA to the front page of Reddit.</p>
<p>If you're not familiar with Hadley's work (which would be a surprise if you're an R user), his own introduction in the Reddit AMA post will fill you in:</p>
<p><em>Broadly, I'm interested in the process of data analysis/science and how to make it easier, faster, and more fun. That's what has led to the development of my most popular packages like <a href="http://ggplot2.org/">ggplot2</a>, <a href="https://github.com/hadley/dplyr">dplyr</a>, <a href="https://github.com/hadley/tidyr">tidyr</a>, <a href="https://github.com/hadley/stringr">stringr</a>. This year, I've been particularly interested in making it as easy as possible to get data into R. That's lead to my work on the <a href="http://github.com/rstats-db/DBI">DBI</a>, <a href="https://github.com/hadley/haven">haven</a>, <a href="https://github.com/hadley/readr">readr</a>, <a href="https://github.com/hadley/readxl">readxl</a>, and <a href="https://github.com/hadley/httr">httr</a> packages. Please feel free to ask me anything about the craft of data science.</em></p>
<p><em>I'm also broadly interested in the craft of programming, and the design of programming languages. I'm interested in helping people see the beauty at the heart of R and learn to master it as easily as possible. As well as a number of packages like <a href="https://github.com/hadley/devtools">devtools</a>, <a href="https://github.com/hadley/testthat">testthat</a>, and <a href="https://github.com/klutometis/roxygen">roxygen2</a>, I've written two books along those lines: <a href="http://adv-r.had.co.nz/">Advanced R</a>, which teaches R as a programming language, mostly divorced from its usual application as a data analysis tool; and <a href="http://r-pkgs.had.co.nz/">R packages</a>, which teaches software development best practices for R: documentation, unit testing, etc.</em></p>
<p>Check out the comments at the link below, where you'll find insights from Hadley on <a href="https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/cvgxw4l" target="_self">the best way to teach R</a>, <a href="https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/cvh7g6m" target="_self">Big Data in R</a>, <a href="https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/cvh7vxs" target="_self">the elegance (or otherwise) of the R language</a>, <a href="https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/cvh96s0" target="_self">being productive</a>, <a href="https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/cvhb1d4" target="_self">the best BBQ</a>, and much more.</p>
<p>Reddit: <a href="https://www.reddit.com/r/dataisbeautiful/comments/3mp9r7/im_hadley_wickham_chief_scientist_at_rstudio_and/" tabindex="1">I'm Hadley Wickham, Chief Scientist at RStudio and creator of lots of R packages (incl. ggplot2, dplyr, and devtools). I love R, data analysis/science, visualisation: ask me anything!</a></p>profilesRDavid Smith2015-10-02T09:37:25-07:00R User Groups Highlight R Creativity
http://blog.revolutionanalytics.com/2015/10/r-user-groups-highlight-r-creativity.html
by Joseph Rickert I have been a big fan of R user groups since I attended my first meeting. There is just something about the vibe of being around people excited about what they are doing that feels good. From a speaker's perspective, presenting at an R user Group meeting must be the rough equivalent of doing "stand-up" at a club where you know mostly everyone and you are pretty sure people are going to like your material. So while user groups don't necessarily ignite R creativity (people don't do their best work just to present at an R User...<p>by Joseph Rickert</p>
<p>I have been a big fan of R user groups since I attended my first meeting. There is just something about the vibe of being around people excited about what they are doing that feels good. From a speaker's perspective, presenting at an R user Group meeting must be the rough equivalent of doing "stand-up" at a club where you know mostly everyone and you are pretty sure people are going to like your material. So while user groups don't necessarily ignite R creativity (people don't do their best work just to present at an R User group meeting), they do help to shine the spotlight on some really good stuff.</p>
<p>I attend all of the Bay Area useR group meetings, and quite a few other R related events throughout the year, but I only get to experience a small fraction of what is going on in the R world. In the spirit of sharing the "wish I was there" feeling, here are a few recent user group presentations from around the globe that look like they were informative, entertaining and motivating.</p>
<p>Tommy O'Dell gave <a href="https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2frpubs.com%2fdatalove%2fWARG2015&data=01%7c01%7cjrickert%40microsoft.com%7cbd849f0128344767b56908d2c09bddd2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=4dFZXJPeyli9%2bktYHwpQlqYcpaltgwe%2fcpFIHtp%2f7TQ%3d" target="_self">"Welcome to dply" talk </a>to the Western Australia R Group (<a href="http://www.meetup.com/Western-Australia-R-Group-WARG/" target="_self">WARG</a>) on September 10th. This is a very good presentation until near the very end when it becomes an absolutely great presentation!! Apparently, motivated by a desire to use <a href="https://mran.revolutionanalytics.com/package/dplyr/" target="_self">dplyr</a> with R 2.12, an older R version of R not supported by dplyr, Tommy deconstructed the dplyr "magic" to write his own package, <a href="https://github.com/datalove/rdplyr" target="_self">rdplyr</a>. This is a wonderful example of how curiosity and open source can open up many possibilities. The following slide comes from the section where Tommy explains some of the problems he encountered and how he worked through them.</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb0879c3e1970d-pi" style="display: inline;"><img alt="Dplyr_to" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb0879c3e1970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb0879c3e1970d-800wi" title="Dplyr_to" /></a></p>
<p>On the 16th of September, Kevin Little gave a talk to<a href="https://groups.google.com/forum/#!forum/maduser" target="_self"> MadR </a>about how he recovered after "hitting the wall" in failed first attempt to interface to the SurveyMonkey API using the <a href="https://mran.revolutionanalytics.com/package/Rmonkey/" target="_self">Rmonkey</a> package. Kevin's description of how he worked through the process which included wading into some JSON scripting is a motivational case study. Kevin wrote a <a href="http://iecodesign.com/index.php/239-rmonkey-business" target="_self">blog post</a> that provides background for the project and has made his slides <a href="http://www.iecodesign.com/RMonkeyBusiness.pdf" target="_self">available here</a>.</p>
<p>Also in September Jim Porzak, a long-time contributor to the San Francisco Bay Area R community, described a detailed customer segmentation analysis in a <a href="http://files.meetup.com/1225993/BARUG_Sep2015_JPorzak_CustomerSegmentation.pdf" target="_self"><span style="text-decoration: underline;"><span style="color: #0066cc;">presentation </span></span></a>to <a href="http://www.meetup.com/R-Users/" target="_self">BARUG</a>. The following slide examines the stability of the clusters.</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7d65e4e970b-pi" style="display: inline;"><img alt="JP_segmation" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b7c7d65e4e970b image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7d65e4e970b-800wi" title="JP_segmation" /></a></p>
<p>Finally, there is a small treasure trove of relatively recent work at the BaselR <a href="http://www.baselr.org/Presentations.html" target="_self">presentations page.</a> These include a <a href="http://www.baselr.org/presentations/20150716_CIforFunctions_Anne%20Kuemmel.pptx" target="_self">presentation from Aimee Gott </a>on the Mango Solutions development environment and <a href="http://www.baselr.org/presentations/20150716_CIforFunctions_Anne%20Kuemmel.pptx" target="_self">one from Anne Kuemmel </a>on using simulations to calculate confidence intervals in pharma applications. Also have a look at <a href="http://www.baselr.org/presentations/Introducing%20ReporteRs_Daniel%20Sabanes%20Bove.pptx" target="_self">Daniel Sabanes Bove's presentation</a> on using R to produce Microsoft PowerPoint presentations, and some<a href="http://www.baselr.org/presentations/Creating%20a%20lively%20R%20Community_Reinhold%20Koch.pptx" target="_self"> thoughtful advice from Reinhold Koch </a>on how to go about creating a lively R community within your company.</p>
<p>_______________________________________________________________________________________</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d16018b2970c-pi" style="display: inline;"><img alt="Mindset" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b8d16018b2970c image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d16018b2970c-800wi" title="Mindset" /></a></p>
<p>_______________________________________________________________________________________</p>
<p> Let us all adopt this mindset!!</p>applicationsopen sourcepackagesRuser groupsJoseph Rickert2015-10-01T08:30:00-07:00Resizing plots in the R kernel for Jupyter notebooks
http://blog.revolutionanalytics.com/2015/09/resizing-plots-in-the-r-kernel-for-jupyter-notebooks.html
by Andrie de Vries A few weeks ago I wrote about the Jupyter notebooks project and the R kernel. In the comments, I was asked how to resize the plots in a Jupyter notebook. The answer is that the IRKnernel project contains not only the IRKernel package itself, but also the repr package. The repr package provides "String and byte representations for all kinds of R objects". I had to dig a little to uncover the meaning behind this rather cryptic description. What I found was that the package provides wrappers around all kinds of R objects, including plots. Now,...<p>by Andrie de Vries</p>
<p>A few weeks ago <a href="http://blog.revolutionanalytics.com/2015/09/using-r-with-jupyter-notebooks.html" target="_self">I wrote about the Jupyter notebooks project</a> and the R kernel. In the comments, I was asked how to resize the plots in a Jupyter notebook.</p>
<p>The answer is that the <a href="https://github.com/IRkernel" target="_self">IRKnernel project</a> contains not only the <a href="https://github.com/IRkernel/IRkernel" target="_self">IRKernel package</a> itself, but also the <a href="https://github.com/IRkernel/repr" target="_self"><span style="font-family: 'courier new', courier;">repr</span> </a>package. The <span style="font-family: 'courier new', courier;">repr</span> package provides "<em>String and byte representations for all kinds of R objects</em>".</p>
<p>I had to dig a little to uncover the meaning behind this rather cryptic description. What I found was that the package provides wrappers around all kinds of R objects, including plots. Now, anybody who has used R has at some point asked the question "<a href="http://stackoverflow.com/questions/7144118/how-to-save-a-plot-as-image-on-the-disk" target="_self">How to save a plot as image on the disk?</a>". The answer is well-known: use a device like <span style="font-family: 'courier new', courier;">png()</span> to capture the output and save the plot to a png file on disk.</p>
<p>Now, the <span style="font-family: 'courier new', courier;">IRKernel</span> uses exactly this technique, and the <span style="font-family: 'courier new', courier;">repr</span> package gives you control over the device.</p>
<p>Very simply, you need to modify two <span style="font-family: 'courier new', courier;">repr</span> setting, using a call to <span style="font-family: 'courier new', courier;">options()</span>. The default repr settings are for plots to be 7 inches wide and 7 inches high.</p>
<p>To set the plot width and height to something else, e.g. 4 inches wide and 3 inches high, use:</p>
<p style="padding-left: 30px;"><span style="font-family: 'courier new', courier;">options(repr.plot.width=4, repr.plot.height=3)</span></p>
<h2>Example in Jupyter</h2>
<p>Here is an example of setting the plot width to different sizes in the same notebook. In the first plot I set the width to 4 inches, and in the second I set the width to 8 inches. In both cases the height is the same: 3 inches rather than the default 7 inches.</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d15d8ab6970c-popup" onclick="window.open( this.href, '_blank', 'width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0' ); return false" style="display: inline;"><img alt="Jupyter" class="asset asset-image at-xid-6a010534b1db25970b01b8d15d8ab6970c img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d15d8ab6970c-500wi" title="Jupyter" /></a></p>
<p> </p>
<h2>The code</h2>
<p>Here is the full code listing:</p>
<p>
<script src="https://gist.github.com/andrie/4e5b6b5cc13dc6be8ae7.js" type="text/javascript"></script>
</p>Andrie de Vries2015-09-30T10:55:14-07:00Why Big Data? Learning Curves
http://blog.revolutionanalytics.com/2015/09/why-big-data-learning-curves.html
by Bob Horton Microsoft Senior Data Scientist Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s Machine Learning course on Coursera. Here I present a simple simulation that illustrates this idea. Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to...<p><em>by Bob Horton<br />Microsoft Senior Data Scientist</em></p>
<div class="container-fluid main-container">
<p>Learning curves are an elaboration of the idea of validating a model on a test set, and have been widely popularized by Andrew Ng’s <a href="https://www.coursera.org/learn/machine-learning">Machine Learning course</a> on Coursera. Here I present a simple simulation that illustrates this idea.</p>
<p>Imagine you use a sample of your data to train a model, then use the model to predict the outcomes on data where you know what the real outcome is. Since you know the “real” answer, you can calculate the overall error in your predictions. The error on the same data set used to train the model is called the <em>training error</em>, and the error on an independent sample is called the <em>validation error</em>.</p>
<p>A model will commonly perform better (that is, have lower error) on the data it was trained on than on an independent sample. The difference between the training error and the validation error reflects <em>overfitting</em> of the model. Overfitting is like memorizing the answers for a test instead of learning the principles (to borrow a metaphor from the <a href="https://en.wikipedia.org/wiki/Overfitting">Wikipedia</a> article). Memorizing works fine if the test is exactly like the study guide, but it doesn’t work very well if the test questions are different; that is, it doesn’t generalize. In fact, the more a model is overfitted, the higher its validation error is likely to be. This is because the spurious correlations the overfitted model memorized from the training set most likely don’t apply in the validation set.</p>
<p>Overfitting is usually more extreme with small training sets. In large training sets the random noise tends to average out, so that the underlying patterns are more clear. But in small training sets, there is less opportunity for averaging out the noise, and accidental correlations consequently have more influence on the model. Learning curves let us visualize this relationship between training set size and the degree of overfitting.</p>
<p>We start with a function to generate simulated data:</p>
<pre class="r"><code>sim_data <- function(N, noise_level=1){
X1 <- sample(LETTERS[1:10], N, replace=TRUE)
X2 <- sample(LETTERS[1:10], N, replace=TRUE)
X3 <- sample(LETTERS[1:10], N, replace=TRUE)
y <- 100 + ifelse(X1 == X2, 10, 0) + rnorm(N, sd=noise_level)
data.frame(X1, X2, X3, y)
}</code></pre>
<p>The input columns X1, X2, and X3 are categorical variables which each have 10 possible values, represented by capital letters <code>A</code> through <code>J</code>. The outcome is cleverly named <code>y</code>; it has a base level of 100, but if the values in the first two <code>X</code> variables are equal, this is increased by 10. On top of this we add some normally distributed noise. Any other pattern that might appear in the data is accidental.</p>
<p>Now we can use this function to generate a simulated data set for experiments.</p>
<pre class="r"><code>set.seed(123)
data <- sim_data(25000, noise=10)</code></pre>
<p>There are many possible error functions, but I prefer the root mean squared error:</p>
<pre class="r"><code>rmse <- function(actual, predicted) sqrt( mean( (actual - predicted)^2 ))</code></pre>
<p>To generate a learning curve, we fit models at a series of different training set sizes, and calculate the training error and validation error for each model. Then we will plot these errors against the training set size. Here the parameters are a model formula, the data frame of simulated data, the validation set size (vss), the number of different training set sizes we want to plot, and the smallest training set size to start with. The largest training set will be all the rows of the dataset that are not used for validation.</p>
<pre class="r"><code>run_learning_curve <- function(model_formula, data, vss=5000, num_tss=30, min_tss=1000){
library(data.table)
max_tss <- nrow(data) - vss
tss_vector <- seq(min_tss, max_tss, length=num_tss)
data.table::rbindlist( lapply (tss_vector, function(tss){
vs_idx <- sample(1:nrow(data), vss)
vs <- data[vs_idx,]
ts_eligible <- setdiff(1:nrow(data), vs_idx)
ts <- data[sample(ts_eligible, tss),]
fit <- lm( model_formula, ts)
training_error <- rmse(ts$y, predict(fit, ts))
validation_error <- rmse(vs$y, predict(fit, vs))
data.frame(tss=tss,
error_type = factor(c("training", "validation"),
levels=c("validation", "training")),
error=c(training_error, validation_error))
}) )
}</code></pre>
<p>We’ll use a formula that considers all combinations of the input columns. Since these are categorical inputs, they will be represented by dummy variables in the model, with each combination of variable values getting its own coefficient.</p>
<pre class="r"><code>learning_curve <- run_learning_curve(y ~ X1*X2*X3, data)</code></pre>
<p>With this example, you get a series of warnings:</p>
<pre><code>## Warning in predict.lm(fit, vs): prediction from a rank-deficient fit may be
## misleading</code></pre>
<p>This is R trying to tell you that you don’t have enough rows to reliably fit all those coefficients. In this simulation, training set sizes above about 7500 don’t trigger the warning, though as we’ll see the curve still shows some evidence of overfitting.</p>
<pre class="r"><code>library(ggplot2)
ggplot(learning_curve, aes(x=tss, y=error, linetype=error_type)) +
geom_line(size=1, col="blue") + xlab("training set size") + geom_hline(y=10, linetype=3)</code></pre>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7d3d956970b-pi" style="display: inline;"><img alt="LC-1" border="0" class="asset asset-image at-xid-6a010534b1db25970b01b7c7d3d956970b image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b7c7d3d956970b-800wi" title="LC-1" /></a></p>
</div>
<p>In this figure, the X-axis represents different training set sizes and the Y-axis represents error. Validation error is shown in the solid blue line on the top part of the figure, and training error is shown by the dashed blue line in the bottom part. As the training set sizes get larger, these curves converge toward a level representing the amount of irreducible error in the data. This plot was generated using a simulated dataset where we know exactly what the irreducible error is; in this case it is the standard deviation of the Gaussian noise we added to the output in the simulation (10; the root mean squared error is essentially the same as standard deviation for reasonably large sample sizes). We don’t expect any model to reliably fit this error since we know it was completely random.</p>
<p>One interesting thing about this simulation is that the underlying system is very simple, yet it can take many thousands of training examples before the validation error of this model gets very close to optimum. In real life, you can easily encounter systems with many more variables, much higher cardinality, far more complex patterns, and of course lots and lots of those unpredictable variations we call “noise”. You can easily encounter situations where truly enormous numbers of samples are needed to train your model without excessive overfitting. On the other hand, if your training and validation error curves have already converged, more data may be superfluous. Learning curves can help you see if you are in a situation where more data is likely to be of benefit for training your model better.</p>
<script type="text/javascript">// <![CDATA[
// &lt;![CDATA[
// &amp;lt;![CDATA[
// &amp;amp;lt;![CDATA[
// &amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;lt;![CDATA[
// add bootstrap table styles to pandoc tables
$(document).ready(function () {
$(&amp;amp;amp;amp;#39;tr.header&amp;amp;amp;amp;#39;).parent(&amp;amp;amp;amp;#39;thead&amp;amp;amp;amp;#39;).parent(&amp;amp;amp;amp;#39;table&amp;amp;amp;amp;#39;).addClass(&amp;amp;amp;amp;#39;table table-condensed&amp;amp;amp;amp;#39;);
});
// ]]&amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;gt;
// ]]&amp;amp;gt;
// ]]&amp;gt;
// ]]&gt;
// ]]></script>
<script type="text/javascript">// <![CDATA[
// &lt;![CDATA[
// &amp;lt;![CDATA[
// &amp;amp;lt;![CDATA[
// &amp;amp;amp;lt;![CDATA[
// &amp;amp;amp;amp;lt;![CDATA[
(function () {
var script = document.createElement(&amp;amp;amp;amp;quot;script&amp;amp;amp;amp;quot;);
script.type = &amp;amp;amp;amp;quot;text/javascript&amp;amp;amp;amp;quot;;
script.src = &amp;amp;amp;amp;quot;https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&amp;amp;amp;amp;quot;;
document.getElementsByTagName(&amp;amp;amp;amp;quot;head&amp;amp;amp;amp;quot;)[0].appendChild(script);
})();
// ]]&amp;amp;amp;amp;gt;
// ]]&amp;amp;amp;gt;
// ]]&amp;amp;gt;
// ]]&amp;gt;
// ]]&gt;
// ]]></script>advanced tipsbig datadata scienceRstatisticsJoseph Rickert2015-09-29T08:30:00-07:00Call R functions from any application with the AzureML package
http://blog.revolutionanalytics.com/2015/09/publishing-r-models-as-a-service-with-azure-ml.html
If you've developed a useful function in R (say, a function to make a forecast or prediction from a statistical model), you may want to call that function from an application other than R. For example, you might want to display the forecast (calculated in R) as part of a desktop, web-based or mobile application. One solution is to install R alongside the application and call it directly, but that can be difficult — or impossible, in the case of mobile apps. (You also need to be careful to comply with R's open-source GPL2 license.) Oftentimes, an easier way is...<p>If you've developed a useful function in R (say, a function to make a forecast or prediction from a statistical model), you may want to call that function from an application other than R. For example, you might want to display the forecast (calculated in R) as part of a desktop, web-based or mobile application. One solution is to install R alongside the application and call it directly, but that can be difficult — or impossible, in the case of mobile apps. (You also need to be careful to comply with R's open-source GPL2 license.)</p>
<p>Oftentimes, an easier way is to install R on a cloud-based server, and call your R function via a remote API. If you manage such a server yourself, one solution is to install <a href="http://deployr.revolutionanalytics.com/" target="_self">DeployR</a> on the server, and publish your function that way. But now there's an even simpler alternative: use the <a href="https://mran.revolutionanalytics.com/package/AzureML/" target="_self">AzureML package</a> (now available on CRAN) to publish your function directly to the Microsoft Azure cloud service, and then call that function using a simple REST call.</p>
<p>To get started, you'll need your Azure Workspace ID <span style="font-family: 'courier new', courier;">wsID</span> and Workspace Authorization Token <span style="font-family: 'courier new', courier;">wsAuth</span> (this <a href="http://blogs.technet.com/b/machinelearning/archive/2015/09/25/build-and-deploy-a-predictive-web-app-using-rstudio-and-azure-ml.aspx" target="_self">Technet blog post</a> by Raymond Laghaeian provides the details, and if you don't yet have an Azure subscription a <a href="https://azure.microsoft.com/en-us/pricing/free-trial/">free trial</a> is available). Then, use the <span style="font-family: 'courier new', courier;">publishWebService</span> function to publish the function to the cloud. Here's the example from the blog post:</p>
<p><span style="font-family: 'courier new', courier;">irisWebService <- publishWebService(</span><br /><span style="font-family: 'courier new', courier;">  "predictSpecies",        # R function to publish</span><br /><span style="font-family: 'courier new', courier;">  "irisSpeciesWebService", # service name<br /></span><span style="font-family: 'courier new', courier;">  list("sep_len"="float", "sep_wid"="float",<br /></span><span style="font-family: 'courier new', courier;">     "pet_len"="float","pet_wid"="float"), # parameters and types<br /></span><span style="font-family: 'courier new', courier;">  list("species"="int"),   # result and type<br /></span><span style="font-family: 'courier new', courier;">  wsID, wsAuth             # authorization ID/token  <br />)</span></p>
<p>All you need to do is specify the function to publish (here, a user-defined R function called <span style="font-family: 'courier new', courier;">predictSpecies</span>), its input parameters (and their types), and the result type (along with a name for your service and authentication info). The AzureML package handles delivering the contents of your function (and any dependencies) to the Azure cloud, and setting up a web service for you according to your specifications. You can then manage this web service as a standard Azure service, test it out, and even monitor how often it's called:</p>
<p><a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb08794a6b970d-pi" style="display: inline;"><img alt="Raypain-6.png-550x0" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb08794a6b970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb08794a6b970d-800wi" title="Raypain-6.png-550x0" /></a></p>
<p>Using the AzureML package, you can make any R function you create available to any application connected to the web, as long as the inputs and outputs are simple data types supported by the API. You can find more details in the AzureML package <a href="https://mran.revolutionanalytics.com/web/packages/AzureML/vignettes/AzureML.html" target="_self">vignette</a> and in the blog post linked below.</p>
<p>Technet Machine Learning Blog: <a href="http://blogs.technet.com/b/machinelearning/archive/2015/09/25/build-and-deploy-a-predictive-web-app-using-rstudio-and-azure-ml.aspx" target="_self">Build & Deploy Predictive Web Apps Using RStudio and Azure ML</a></p>developer tipsMicrosoftpackagesRDavid Smith2015-09-28T14:39:44-07:00