While movies have been getting more orange with time, painting have been going the other direction. Paintings today are generally more blue than they were a few hundred years ago.
The image above shows the color spectrum of almost 100,000 paintings created since 1800. Martin Bellander used R to create the image, by scraping images from the BBC YourPaintings site with the help of the rvest package. He then extracted the spectrum from each of the images using the readbitmap and colorspace packages, before combining the data into the time-ordered heatmap above using the plotrix package. (You can find all of the R code in page linked at the end of this post.)
In an article for Significance magazine, Martin suggests a few possible reasons why paintings are getting bluer with time:
He explores these hypotheses by (for example) looking at just the oil paintings over time, but the result is inconclusive. One possibility that occurs to me is the rising popularity of landscape paintings over time, which might have led to more blue skies being represented in painting. (Any art historians want to chime in?) Check out all the details of Martin's analysis at the link below.
I cannot make bricks without clay: The colors of paintings: blue is the new orange
by Joseph Rickert
The New York Times is quietly changing the practice of science journalism. The Tuesday April 21, 2015 article: Ebola Lying in Wait, reports on "A growing body of scientific clues - some ambiguous, other substantive" that the Ebola virus may have lain dormant in West African rain forest for years before igniting last year's outbreak. In the 6th paragraph of the on-line edition mention is made of "a detailed prediction of other likely Ebola dangers zones" made by a team of scientists. The words "detailed prediction" are innocuously provided with the hyper-link above. What I think is extraordinary is that this link points to the scientific paper: Mapping the zoonotic niche of Ebola virus disease in Africa by David M Piggot et al. published on the open science publishing platform eLife. This is the real science including the measured language of a scientific paper, the lengthly descriptions of the data sets, the innumerable references and even the reviewers comments and the authors' responses. I don't think that there is a better way to cultivate a scientific outlook than to make relevant, science open and accessible.
The following figure from the paper, illustrates one of the low level tools of open science: the digital object identifier (DOI). A DOI is a character string that uniquely identifies a document, or other digital object, that is meant to persist for the lifetime of the document.
The paper by Piggot et al. is replete with DOIs pointing to subsections of the document, figures and other documents.
The next step for data science along these lines is to use DOIs and other tools to make it easy to search eLife, Plos, Crossref, Entrez and other open science platforms. Towards this goal, the team at rOpenSci is well on their way. With limited resources, they have developed an impressive array of R packages for accessing public data as well as for searching the scientific literature. The code below shows some of my own early efforts to use rOpenSci functions to search the literature. rplos is a mature package available on CRAN. The fulltext package is under development. When finished, it will offer functions for working with multiple open science publishers.
To go further have a look at the rOpenSci tutorials. However, interest in text mining aside, I think we should be grateful for the efforts of the PLOS, eLife and the other open science publishers, rOpenSci and the New York Times.
# Get started with rOpenSci text searches # Install the library full text devtools::install_github(c("ropensci/rplos", "ropensci/bmc", "ropensci/aRxiv", "emhart/biorxiv")) devtools::install_github("ropensci/fulltext") library("fulltext") library(rplos) DOI <- ft_search(query="ebola") DOI # Query: # [ebola] # Found: # [PLoS: 668; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0] # Returned: # [PLoS: 10; BMC: 0; Crossref: 0; Entrez: 0; arxiv: 0; biorxiv: 0] str(DOI) # List of 6 # $ plos :List of 4 # ..$ found : int 668 # ..$ data :'data.frame': 10 obs. of 1 variable: # .. ..$ id: chr [1:10] "10.1371/journal.pcbi.1004087" "10.1371/journal.pone.0037106" "10.1371/journal.pmed.0010059" "10.1371/journal.pntd.0003706" ... # ..$ opts :List of 2 # .. ..$ q : chr "ebola" # .. ..$ limit: num 10 # ..$ license:List of 3 # .. ..$ type: chr "CC-BY" # .. ..$ uri : chr "http://creativecommons.org/licenses/by/4.0/" # .. ..$ text: chr "<authors> This is an open-access article distributed under \n the terms of the Creative Commons At"| __truncated__ # ..- attr(*, "class")= chr "ft_ind" # ..- attr(*, "query")= chr "ebola" # $ bmc :List of 3 # ..$ found: NULL # ..$ data : NULL # ..$ opts : list() # ..- attr(*, "class")= chr "ft_ind" # ..- attr(*, "query")= chr "ebola" # $ crossref:List of 3 # ..$ found: NULL # ..$ data : NULL # ..$ opts : list() # ..- attr(*, "class")= chr "ft_ind" # ..- attr(*, "query")= chr "ebola" # $ entrez :List of 3 # ..$ found: NULL # ..$ data : NULL # ..$ opts : list() # ..- attr(*, "class")= chr "ft_ind" # ..- attr(*, "query")= chr "ebola" # $ arxiv :List of 3 # ..$ found: NULL # ..$ data : NULL # ..$ opts : list() # ..- attr(*, "class")= chr "ft_ind" # ..- attr(*, "query")= chr "ebola" # $ biorxiv :List of 3 # ..$ found: NULL # ..$ data : NULL # ..$ opts : list() # ..- attr(*, "class")= chr "ft_ind" # ..- attr(*, "query")= chr "ebola" # - attr(*, "class")= chr "ft" # - attr(*, "query")= chr "ebola" article <- DOI$plos[[2]][[1]][4] #Fetch the DOI from the 4th PLOS article # "10.1371/journal.pntd.0003706" URL <- full_text_urls(doi=article) #Fetch the full URL URL #[1] "http://www.plosntds.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pntd.0003706&representation=XML" text <- plos_fulltext(doi=article) #Fetch the XML text text[[1]]
by Ben Ubah
Founder, RPowerLabs
No disregard to R's colleagues, R is pioneering the creation of online virtual electric power system laboratories via RPowerLABS. RPowerLABS is a project, with the vision of deploying online, a vast array of highly demanded power system simulations for teaching and research using R. It started as an attempt to apply R to electric power system simulations as seen here. This is one important application of R that hopes to assist electrical power engineering academics in developing nations who are restricted by costly commercial tools and give the developed nations a new tool with many interactive, online learning and collaborative potentials.
From what I know about academic laboratory activities, they comprise of uniform apparatus for all participants, lab manuals, lab exercises, and interaction among participants and with the lab coordinator.
At RPowerLABS, we have and are continuously experimenting better techniques of achieving this vision using the various options from R and its extension packages. We have been able to setup several virtual labs preloaded with IEEE power systems (uniform apparatus), lab manuals, lab quizzes (plugged in from Moodle), interaction among participants (via real-time chat) on one web page.
See sample video at Youtube here:
One can even get a customized lab integrating several related simulations on one web page e.g (power flow, contingency analysis, faults, transient stability, etc). Not just that, interactive visualization of electric circuits (see Fig.1) and line diagrams (see Fig.2) are possible.
Fig. 1
Fig. 2
Access the Transformers circuit lab and the AGC/LFC lab respectively. It is also possible with RPowerLABS to write and execute your own code within the LAB. The idea is to allow students write and execute small simulation programs in R and also to write R codes that modify the preloaded lab parameters. Figures 2, 3, and 4, show how the system's nominal frequency of the visual automatic generation control lab can be changed from 60Hz to 50Hz via the code editor.
Fig.3
Fig. 4
Access the N-1 Contingency Analysis code lab to see how functions can be executed online via a code editor (see Fig.5) This code editing functionality was made possible by the shinyAce package.
Fig.5
If as a College, University or Polytechnic you need the current offerings of RPowerLABS at your lab or online, it's simple just send a mail to support@rpowerlabs.org with your request. You may still like to refer this to the electrical engineering department at your school - you are free. RPowerLABS is free, but we will be happy to receive donations. It is also possible to get a free deployment of Moodle integrated with RPowerLABS, if youmlike. The donations help us provide more advanced features to users - we hope to cut across every major topic in the field of electrical power systems.
If you are a student or individual and you want a private study lab, it's possible with RPowerLABS. RPowerLABS takes away the effort of learning how to operate simulation tools by preloading the software with test power system data and keeping the interface simple.
I forsee a possibility where RPowerLABS could help schools hoping to offer Electric Power Systems by Distance Learning to gain accreditation and attraction.
View the current offerings of RPowerLABS here and please kindly send us a review (what you think about the app and its potential) at support@rpowerlabs.org if you can. The application is currently hosted on a VPS running Ubuntu 14 (many thanks to RStudio's IDE and Shiny Server).
If this project is of interest to you and you want to get in touch, do not hesitate as we are open to feedbacks and collaborations. Please reach us at info@rpowerlabs.org
by Sherri Rose
Assistant Professor of Health Care Policy
Harvard Medical School
Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.
Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.
Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.
Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.
Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.
library(SuperLearner) ##Generate simulated data## set.seed(27) n<-500 data <- data.frame(W1=runif(n, min = .5, max = 1), W2=runif(n, min = 0, max = 1), W3=runif(n, min = .25, max = .75), W4=runif(n, min = 0, max = 1)) data <- transform(data, #add W5 dependent on W2, W3 W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3)))) data <- transform(data, #add Y dependent on W1, W2, W4, W5 Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4)))))) summary(data) ##Specify a library of algorithms## SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest") ##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library## fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE) ##Run the cross-validated super learner to obtain its CV risk## fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial()) ##Cross validated risks## mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner fit.data.SL #CV risks for algorithms in the library
The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.
The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.
The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.
Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.
Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.
The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.
R has something of a reputation for generating, shall we say, obscure error messages like this:
Error in model.frame.default(formula = y ~ female + DNC + SE_region + : could not find function "function (object, ...) \nobject"
One tip for dealing with error messages is to ignore everything between "Error in" and the colon: unless you are running a function that you wrote yourself, only the error message at the end is likely to be useful. If you're still stuck, another tip is to ask for help on Stackoverflow.com using the [r] tag, where you'll find more than 20,000 questions about R error messages.
Noam Ross has analyzed these questions to find the most commonly asked-about R error messages. Naturally, he used the stackr R package to interrogate the StackOverflow API, and downloaded around 10,000 error messages. He then used a regular expression to break the questions down into trigrams (sequences of 3 works) to be able to count which were the most common. On that basis, the most common types of error messages were:
Noam's full analysis is at the link below. In addition to providing insights about R's error messages, the trigram method he uses will be useful to anyone who needs to do frequency analysis on unstructured data.
Noam Ross (github): Common errors in R: An Empirical Investigation
by Herman Jopia
What is Binning?
Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.
Why Binning?
Though there are some reticence to it [1], the benefits of binning are pretty straight forward:
Unsupervised Discretization
Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.
Equal length intervals
Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.
Equal frequency intervals
Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).
Supervised Discretization
Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.
In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.
An Example With 'smbinning'
Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).
# Load package and its data library(smbinning) data(chileancredit) # Training and testing samples chileancredit.train=subset(chileancredit,FlagSample==1) chileancredit.test=subset(chileancredit,FlagSample==0) # Run and save results result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) result$ivtable # Relevant plots (2x2 Page) par(mfrow=c(2,2)) boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution") mtext("Time on Books (Months)",3) smbinning.plot(result,option="dist",sub="Time on Books (Months)") smbinning.plot(result,option="badrate",sub="Time on Books (Months)") smbinning.plot(result,option="WoE",sub="Time on Books (Months)")
Table 3. Time on Books cutpoints mapped to Credit Performance.
Figure 1. Plots generated by the package.
References
[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.
I spent last week at the Strata 2015 Conference in San José, California. As always, Strata made for a wonderful conference to catch up on the latest developments on big data and data science, and to connect with colleagues and friends old and new. Having been to every Strata conference since the first in XXXX, it's been interesting to see the focus change over the years. While past conferences have focused on big data and data science software (and to be sure, Hadoop, Spark, Python and R all got plenty of mentions this year), the focus has shifted to more of the applications and impacts of data science.
If you couldn't attend yourself, many of the keynote presentations are now available online. Follow the links below to watch a few of my favourites:
President Barack Obama introduced the DJ Patil, the US Government's new Chief Data Scientist (and even cracked a half-decent stats joke). DJ reviewed the advances in Data Science over the past four years, with a focus on the rise of open data and current and future open government initiatives.
Solomon Hsiang gave an inspirational presentation on using statistical analysis to quantify influence of climate change on conflict. This research was also the topic of a recent New York Times op-ed. The meta-analysis was conducted with R, and you can find the replication data and scripts here.
Eden Medina shared some lessons learned from a fascinating episode in computer history, when the Chilean government created Project Cybersyn in 1971 to create what we'd today call an economic dashboard, using only an obsolete mainframe and a "network" of Telex machines.
Joseph Sirosh described an interesting (and surprising) data science application in dairy farming: using pedometers on cows to detect when they are in heat, and even to influence the sex of their offspring.
Jeffrey Heer showed some examples of good and not-so-good data visualizations, and how he applied recent research in visual perception to the visualization tools in Trifacta.
Alistair Croll made some thought-provoking predictions about our future technological lives, including that digital agents may one day become the start of a new species.
That's just a sampling of the many keynotes from the conference. You can watch many of the others at the link before.
Strata + Hadoop World: Feb 17-20 2015, San Jose CA
by Joseph Rickert
Learning to effectively use any of the dozens of popular machine learning algorithms requires mastering many details and dealing with all kinds of practical issues. With all of this to consider, it might not be apparent to a person coming to machine learning from a background other than computer science or applied math that there are some easy to get at and very useful “theoretical” results. In this post, we will look at an upper bound result carefully described in section 2.2.1 of Schapire and Freund’s book “Boosting, Foundations and Algorithms” (This book, published in 2012, is destined to be a classic. During the course developing a thorough treatment of boosting algorithms, Schapire and Freund provide a compact introduction to the foundations of machine learning and relate the boosting algorithm to some exciting ideas from game theory, optimization and information geometry.)
The following is an example of a probabilistic upper bound result for the generalization error of an arbitrary classifier.
Assume that:
Define E_{t}, the training error, to be the percentage of misclassified samples and E_{g}, the generalization error, to be the probability of misclassifying a single example (x,y) chosen at random from D.
Then, for any d greater than zero, with probability at least 1 - d, the following upper bound holds on the generalization error of h:
E_{g} <= E_{t} + sqrt(log(1/d)/(2m)) where m is the number of random samples (R)
Schapire and Freund approach this result as a coin flipping problem, noting that when a training example (x,y) is selected at random, the probability that h(x) does not equal y can be identified with a flipped coin coming up heads. The probability of getting a head, p, is fixed for all flips. The problem becomes that of determining whether the training error, the fraction of mismatches in a sequence of m flips, is significantly different from p.
The big trick in the proof of the above result is to realize that Hoeffding’s Inequality can be used to bound the binomial expression for the probability of getting at most (p - e)m heads in m trials where e is a small positive number. For our purposes, Hoeffding’s inequality can be stated as follows:
Let X_{1} . . . X_{m} be independent random variables taking values in [0,1]. Let A_{m} denote their average. Then P(A_{m} <= E[A_{m}] - e) <= exp(-2me^{2}).
If the X_{i} are binomial random variables where X_{i} = 1 if h(x) is not equal to y, then the training error, E_{t} as defined above, is equal to A_{m} , the number of successes in m flips. E[A_{m}] = p is the generalization error, E_{g}. Hence, the expression in defining the bound in Hoeffding’s Inequality can be written:
E_{g} >= E_{t} + e.
Now, letting d = exp(-2me^{2}) where d > 0 we get the result ( R ) above. What this says is that with probability at least 1 - d,
E_{t} + sqrt(log(1/d)/(2m)) is an upper bound for the generalization error.
A couple of things to notice about the result are:
The really big assumption is the one that slipped in at the very beginning, that the training samples and test samples are random draws from the same distribution. This is something that would be difficult to verify in practice, but serves the purpose of encouraging one to think about the underlying distributions that might govern the data in a classification problem.
The plot below provides a simple visualization of the result. It was generated by simulating draws from a binomial with a little bit of noise added, where p = .4 and d = .1 This represents a classifier that does a little better than guessing. The red vertical line marks the value of the generalization error among the simulated upper bounds. The green lines focus on the 10% quantile.
As the result predicts, a little more than 90% of the upper bounds are larger than p.
And here is the code.
m <- 10000 # number of samples p <- .4 # Probability of incorrect classification N <- 1000 # Number of simulated sampling experiments delta <- .1 # 1 - delta is the upper probability bound gamma <- sqrt(log(1/delta)/(2*m)) # Calculate constant term to upper bound Am <- vector("numeric",N) # Allocate vector for(i in 1:N){ Am[i] <- sum(rbinom(m,1,p) + rnorm(m,0,.1))/m # Simulate training error } u_bound <- Am + gamma # Calculate upper bounds plot(ecdf(u_bound), xlab="Upper Bound", col = "blue", lwd = 3, main = "Empir Dist (Binomial with noise)") abline(v=.4, col = "red") abline(h=.1, col = "green") abline(v= quantile(u_bound,.1),col="green")
So, what does all this mean in practice? The result clearly pertains to an idealized situation, but to my mind it provides a rough guide as to how low you ought to be able to reduce your testing error. In some cases, it may even signal that you might want to look for better data.
by Joseph Rickert
Apache Spark, the open-source, cluster computing framework originally developed in the AMPLab at UC Berkeley and now championed by Databricks is rapidly moving from the bleeding edge of data science to the mainstream. Interest in Spark, demand for training and overall hype is on a trajectory to match the frenzy surrounding Hadoop in recent years. Next month's Strata + Hadoop World conference, for example, will offer three serious Spark training sessions: Apache Spark Advanced Training, SparkCamp and Spark developer certification with additional spark related talks on the schedule. It is only a matter of time before Spark becomes a big deal in the R world as well.
If you don't know much about Spark but want to learn more, a good place to start is the video of Reza Zadeh's keynote talk at the ACM Data Science Camp held last October at eBay in San Jose that has been recently posted.
Reza is a gifted speaker, an expert on the subject matter and adept at selecting and articulating the key points that can carry an audience towards comprehension. Reza starts slowly, beginning with the block diagram of the Spark architecture and spends some time emphasizing RDDs, Resilient Distributed Data Sets as the key feature that enables Spark's impressive performance and defines and circumscribes its capabilities.
After the preliminaries, Reza takes the audience on a deep dive of three algorithms in Spark's machine learning library MLib, gradient descent logistic regression, page rank and singular value decomposition and moves on to discuss some of the new features in Spark release 1.2.0 including All Pairs Similarity.
Reza's discussion of Spark's SVD implementation is a gem of a tutorial on computational linear algebra. The SVD algorithm considers two cases, the "Tall and Skinny" situation where there are less than about a 1,000 columns and the "roughly square" case where the number of rows and columns are about the same. I found it comforting to learn that the code for this latter case is based on highly reliable and "immensely optimized" Fortran77 code. (Some computational problems get solved and stay solved.)
Reza's discussion of the All Pairs Similarity, based on the DIMSUM (Dimension Independent Matrix Square Using MapReduce) algorithm and a non-intuitive sampling procedure where frequently occurring pairs are sampled less often, is also illuminating.
To get some hands-on experience with Spark your next steps might be to watch the three hour, Databricks video: Intro to Apache Spark Training - Part 1.
From here, the next obvious question is: "How do I use Spark with R?" Spark itself is written in Scala and has bindings for Java, Python and R. Searching for a Spark demo online, however, will most likely turn up either a Scala or Python example. sparkR, the open source project to produce an R binding, is not as far along as the other languages. Indeed, a Cloudera web page refers to SparkR as "promising work". The SparkR GitHub page shows it to be a moderately active project with 410 commits to date from 15 contributors.
In SparkR Enabling Interactive Data Science at Scale, Zongheng Yang (only a 3rd year Berkeley undergraduate when he delivered this talk last July) lucidly works through a word count demo and a live presentation using sparkR with RStudio and a number of R packages and functions. Here is the code for his word count example.
SparkR Word count Example
Note the sparkR lapply() function which is an alias for the Spark map and mapPartitions functions.
These are still early times for Spark and R. We would very much like to hear about your experiences with sparkR or any other effort to run R over Spark.
by Nick Elprin
Co-Founder Domino Data Lab
"R Notebooks" use the IPython Notebook UI to run R (rather than Python) in notebook cells, giving you an interactive R environment hosted on scalable servers, accessible through a web browser. This post describes how and why we built our "R Notebooks" feature.
Our product, Domino, is a platform that facilitates the end-to-end analytical lifecycle, from early-stage exploration, through experimentation and refinement, all the way to deploying or "operationalizing" a model. Among other things, Domino makes it easy to move long-running or computationally intensive R tasks onto powerful hardware. In our cloud-hosted environment, you can choose any type of Amazon EC2 machine you want to use; or if you deploy Domino on-premise in your enterprise, you can configure your own hardware tiers.
Domino was working great for users who wanted to run R scripts, but we had many users who also wanted to work interactively in R on a powerful server, without dealing with any infrastructure setup. I'll explain how we built our solution to this problem, but first, I'll describe the solution itself.
We wanted a solution that: (1) let our users work with R interactively; (2) on powerful machines; and (3) without requiring any setup or infrastructure management. For reasons I describe below, we adapted IPython Notebook to fill this need. The result is what we call an R Notebook: an interative, IPython Notebook environment that works with R code. It even handles plotting and visual output!
So how does it work?
Like any other run in Domino, this will spin up a new machine (on hardware of your choosing), and automatically load it with your project files.
Any R command will work, including ones that load packages, and the system
function. Since Domino lets you spin up these notebooks on ridiculously powerful machines (e.g., 32 cores, 240GB of memory), let's show off a bit:
By interleaving code, comments, and graphics, the Notebook UI provides a great way to create and preserve a narrative about the analysis you're doing. The friendly UI also makes notebooks accessible to less technical users, letting you share your work with a broader audience.
Domino adds other nice features to your notebook sessions: each session is preserved as a snapshot, so you can get back to any past result and reproduce past work. And because Domino hosts all your notebooks (and data, and results) centrally, you can share your work with others just by sending a link
Our vision for Domino is to be a platform that accelerates work across the entire analytical lifecycle, from early exploration, all the way to packaging and deployment of analytical models. We think we're well on our way toward that goal, and this post is about a recent feature we added to fill a gap in our support for early stages of that lifecycle: interactive work in R.
Analytical ideas move through different phases:
Exploration / Ideation. In the early stages of an idea, it's critical to be able to "play with data" interactively. You are trying different techniques, fixing issues quickly, to figure out what might work.
Refinement. Eventually you have an approach that you want to invest in, and you must refine or "harden" a model. Often this requires many more intensive experiments: for example, running a model over your entire data set with sevearl different parameters, to see what works best.
Packaging and Deployment. Once you have something that works, typically it will be deployed for some ongoing use: either packaged into a UI for people to interact with, or deployed with some API (or web service) so software systems can consume it.
Domino offers solutions for all three phases, in multiple different languages, but we had a gap. For interactive exploratory work, we support IPython Notebooks for work in Python, but we didn't have a good solution for work in R.
Stage of the analytical lifecycle | |||
---|---|---|---|
1. Explore / Ideate | 2. Experiment / Refine | 3. Deploy / Operationalize | |
Requirements | Interactive environment | Able to run many experiments in parallel, quickly, and track work and results | Easily create a GUI or web service around your model |
Our solution for R |
Gap to address |
Our bread and butter: easily run your scripts on remote machines, as many as you want, and keep them all tracked | Launchers for UI, and RServe powering API publishing |
Our solution for Python |
IPython Notebooks | Launchers for UI, and pyro powering API publishing |
Since we already had support for spinning up IPython Notebook servers inside docker containers on arbitrary EC2 machines, we opted to use IPython Notebook for our R solution.
A little-known fact about IPython Notebook (likely because of its name) is that it can actually run code in a variety of other languages. In particular, its RMagic functionality lets you run R commands inside IPython Notebook cells by prepending your commands with the %R
modifier. We adapted this "hack" (thanks, fperez!) to prepend the RMagic modifying automatically to every cell expression.
The approach is to make a new ipython profile with a startup script that automatically prepends the %R
magic prefix to any expression you evaluate. The result is an interactive R notebook.
The exact steps were:
pip install rpy2
ipython profile create rkernel
rkernel.py
into ~/.ipython/profile_rkernel/startup
Where rkernely.py
is a slightly-mofified version of fperez's script. We just had to change the rmagic
extension on line 15 to the rpy2.ipython
extension, to be compatible with IPython Notebook 2.
"""A "native" IPython R kernel in 15 lines of code.
This isn't a real native R kernel, just a quick and dirty hack to get the
basics running in a few lines of code.
Put this into your startup directory for a profile named 'rkernel' or somesuch,
and upon startup, the kernel will imitate an R one by simply prepending `%%R`
to every cell.
"""
from IPython.core.interactiveshell import InteractiveShell
print '*** Initializing R Kernel ***'
ip = get_ipython()
ip.run_line_magic('load_ext', 'rpy2.ipython')
ip.run_line_magic('config', 'Application.verbose_crash=True')
old_run_cell = InteractiveShell.run_cell
def run_cell(self, raw_cell, **kw):
return old_run_cell(self, '%%R\n' + raw_cell, **kw)
InteractiveShell.run_cell = run_cell
Some folks who have used this have asked why we didn't just integrate RStudio Server, so you could spin up an RStudio session in the browser. The honest answer is that using IPython Notebook was much easier, since we already supported it. We are exploring an integration with RStudio Server, though. Please let us know if you would use it.
In the meantime, please try out our new R Notebook functionality and let us know what you think!