by Joseph Rickert

The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://cran-logs.rstudio.com/. The following code uses the download_RStudio_CRAN_data() function to download a month's worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.

# CODE TO DOWNLOAD LOG RILES FROM RSTUDIO CRAN MIRROR # FIND MOST DOWNLOADED PACKAGE AND PLOT DOWNLOADS # FOR SELECTED PACKAGES # ----------------------------------------------------------------- library(installr) library(ggplot2) library(data.table) #for downloading # ---------------------------------------------------------------- # Read data from RStudio site RStudio_CRAN_dir <- download_RStudio_CRAN_data(START = '2015-05-15',END = '2015-06-15', log_folder="C:/DATA/test3") # read .gz compressed files form local directory RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #> RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #Reading C:/DATA/test3/2015-05-15.csv.gz ... #Reading C:/DATA/test3/2015-05-16.csv.gz ... #Reading C:/DATA/test3/2015-05-17.csv.gz ... #Reading C:/DATA/test3/2015-05-18.csv.gz ... #Reading C:/DATA/test3/2015-05-19.csv.gz ... #Reading C:/DATA/test3/2015-05-20.csv.gz ... #Reading C:/DATA/test3/2015-05-21.csv.gz ... #Reading C:/DATA/test3/2015-05-22.csv.gz ... dim(RStudio_CRAN_data) # [1] 8055660 10 # Find the most downloaded packages pkg_list <- most_downloaded_packages(RStudio_CRAN_data) pkg_list #Rcpp stringr ggplot2 stringi magrittr plyr #125529 115282 103921 103727 102083 97183 lineplot_package_downloads(names(pkg_list),RStudio_CRAN_data) # Look at plots for some packages barplot_package_users_per_day("checkpoint",RStudio_CRAN_data) #$total_installations #[1] 359 barplot_package_users_per_day("Rcpp", RStudio_CRAN_data) #$total_installations #[1] 23832

The function lineplot_package_downloads() produces a multiple time series plot for the top five packages:

and the barplot_package_users_per_day() function provides download plots. Here we contrast downloads for the Revolution Analytics' checkpoint package and Rcpp.

Downloads for the checkpoint package look pretty uniform over the month. checkpoint is a relatively new, specialized package for dealing with reproducibility issues. The download pattern probably represents users discovering it. Rccp, on the other hand, is essential to an incredible number of other R packages. The right skewed plot most likely represents the tail end of the download cycle that started after Rcpp was upgraded on 5/1/15.

All of this works well for small amounts of data. However, the fact that read_RStudio_CRAN_data() puts everything in a data frame presents a bit of a problem for working with longer time periods with the 6GB of RAM on my laptop. So, after downloading the files representing the period (5/28/14 to 5/28/15) to my laptop,

# Convert .gz compresed files to .csv files in_names <- list.files("C:/DATA/RStudio_logs_1yr_gz", pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) length(in_names) for(i in 1:length(in_names)){ df <- read.csv(in_names[i]) write.csv(df, out_names[i],row.names=FALSE) }

I used the external memory algorithms in Revolution R Enterprise to work with the data on disk. First, rxImport() brings all of the .csv files into a single .xdf file and stores it on my laptop. (Note that the rxSummary() function indicates that the file has over 90 million rows.) Then, the super efficient rxCube() function to tabulate the package counts.

# REVOSCALE R CODE TO IMPORT A YEARS WORTH OF DATA data_dir <- "C:/DATA/RStudio_logs_1yr" in_names <- list.files(data_dir, pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) #---------------------------------------------------- # Import to .xdf file # Establish the column classes for the variables colInfo <- list( list(name = "date", type = "character"), list(name = "time", type = "character"), list(name = "size", type = "integer"), list(name = "r_version", type = "factor"), list(name = "r_arch", type = "factor"), list(name = "r_os", type = "factor"), list(name = "package", type = "factor"), list(name = "version", type = "factor"), list(name = "country", type = "factor"), list(name = "1p_1d", type = "integer")) num_files <- length(out_names) out_file <- file.path(data_dir,"RStudio_logs_1yr") append = FALSE for(i in 1:num_files){ rxImport(inData = out_names[i], outFile = out_file, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE } # Look at a summary of the imported data rxGetInfo(out_file)

#File name: C:\DATA\RStudio_logs_1yr\RStudio_logs_1yr.xdf

#Number of observations: 90200221

#Number of variables: 10 # Long form tablualtion cube1 <- rxCube(~ package,data= out_file) # Computation time: 5.907 seconds. cube1 <- as.data.frame(cube1) sort1 <- rxSort(cube1, decreasing = TRUE, sortByVars = "Counts") #Time to sort data file: 0.078 seconds write.csv(head(sort1,100),"Top_100_Packages.csv")

Here are the download counts for top 100 packages for the period (5/28/14 to 5/28/15).

You can download this data here: Download Top_100_Packages

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#----------------------------------------------- # Set up the data location information bigDataDir <- "C:/Data/Mortgage" mortCsvDataName <- file.path(bigDataDir,"mortDefault") trainingDataFileName <- "mortDefaultTraining" mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") targetDataFileName <- "mortDefault2009.xdf" #--------------------------------------- # Import the data from multiple .csv files into2 .XDF files # One file, the training file containing data from the years # 2000 through 2008. # The other file, the test file, containing data from the year 2009. defaultLevels <- as.character(c(0,1)) ageLevels <- as.character(c(0:40)) yearLevels <- as.character(c(2000:2009)) colInfo <- list(list(name = "default", type = "factor", levels = defaultLevels), list(name = "houseAge", type = "factor", levels = ageLevels), list(name = "year", type = "factor", levels = yearLevels)) append= FALSE for (i in 2000:2008) { importFile <- paste(mortCsvDataName, i, ".csv", sep = "") rxImport(inData = importFile, outFile = trainingDataFileName, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefaultTraining.xdf #Number of observations: 9e+06 #Number of variables: 6 #Number of blocks: 18 #Compression type: zlib #Variable information: #Var 1: creditScore, Type: integer, Low/High: (432, 955) #Var 2: houseAge #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40 #Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) #Var 4: ccDebt, Type: integer, Low/High: (0, 15566) #Var 5: year #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 #Var 6: default #2 factor levels: 0 1 rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo) rxGetInfo(targetDataFileName) #> rxGetInfo(targetDataFileName) #File name: C:\Users\jrickert\Documents\Revolution\NaiveBayes\mortDefault2009.xdf #Number of observations: 1e+06 #Number of variables: 6 #Number of blocks: 2 #Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt, data = trainingDataFileName, smoothingFactor = 1) #Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds #Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds #Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB # #Naive Bayes Classifier # #Call: #rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + #ccDebt, data = trainingDataFileName, smoothingFactor = 1) # #A priori probabilities: #default #0 1 #0.997242889 0.002757111 # #Predictor types: #Variable Type #1 year factor #2 creditScore numeric #3 yearsEmploy numeric #4 ccDebt numeric # #Conditional probabilities: #$year #year #default 2000 2001 2002 2003 2004 #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01 #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02 #year #default 2005 2006 2007 2008 2009 #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07 #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05 # #$creditScore #Means StdDev #0 700.0839 50.00289 #1 686.5243 49.71074 # #$yearsEmploy #Means StdDev #0 5.006873 2.009446 #1 4.133030 1.969213 # #$ccDebt #Means StdDev #0 4991.582 1976.716 #1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") #Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 # secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1") mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1)) #head(mortNBPred) #prob_0 prob_1 default_Pred #1 0.9968860 0.003114038 0 #2 0.9569425 0.043057472 0 #3 0.5725627 0.427437291 0 #4 0.9989603 0.001039729 0 #5 0.7372746 0.262725382 0 #6 0.4142266 0.585773432 1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]] predicted_value <- mortNBPred[["default_Pred"]] results <- table(predicted_value,actual_value) #> results #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 pctMisclassified <- sum(results[2,3])/sum(results)*100 pctMisclassified #[1] 10.1779

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results library(caret) library(e1071) confusionMatrix(results,positive="1") #Confusion Matrix and Statistics # #actual_value #predicted_value 0 1 #0 877272 3792 #1 97987 20949 # #Accuracy : 0.8982 #95% CI : (0.8976, 0.8988) #No Information Rate : 0.9753 #P-Value [Acc > NIR] : 1 # #Kappa : NA #Mcnemar's Test P-Value : <2e-16 # #Sensitivity : 0.84673 #Specificity : 0.89953 #Pos Pred Value : 0.17614 #Neg Pred Value : 0.99570 #Prevalence : 0.02474 #Detection Rate : 0.02095 #Detection Prevalence : 0.11894 #Balanced Accuracy : 0.87313 # #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1) names(roc_data) <- c("predicted_value","actual_value") head(roc_data) hist(roc_data$actual_value) rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.

The latest update to Revolution R Open, RRO 3.2.0, is now available for download from MRAN. In addition to new features, this release tracks the version number of the underlying R engine version (so this is the release following RRO 8.0.3).

Revolution R Open 3.2.0 includes:

- The latest R engine, R 3.2.0. This includes many improvements, including faster processing, reduced memory usage, support for bigger in-memory objects, and an improved byte compiler.
- Multi-threaded math processing, reducing the time for some numerical operations on multi-core systems.
- A focus on reproducibility, with access to a fixed CRAN snapshot taken on May 1, 2015. Many new and updated packages are available since the previous release of RRO -- see the latest Package Spotlight for details. CRAN packages released since May 1 can be easily (and reproducibly!) accessed with the checkpoint function.
- Binary downloads for Windows, Mac and Linux systems.
- 100% compatibility with R 3.2.0, RStudio and all other R-based applications.

You can download Revolution R Open now from the link below, and we welcome comments, suggestions and other discussion on the RRO Google Group. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

*by Andrew Ekstrom**Recovering physicist, applied mathematician and graduate student in applied Stats and systems engineering*

We know that R is a great system for performing statistical analysis. The price is quite nice too ;-) . As a graduate student, I need a cheap replacement for Matlab and/or Maple. Well, R can do that too. I’m running a large program that benefits from parallel processing. RRO 8.0.2 with the MKL works exceedingly well.

For a project I am working on, I need to generate a really large matrix (10,000x10,000) and raise it to really high powers (like 10^17). This is part of my effort to model chemical kinetics reactions, specifically polymers. I’m using a Markov Matrix of 5,000x5,000 and now 10,000x10,000 to simulate polymer chain growth at femptosecond timescales.

At the beginning of this winter semester, I used Maple 18 originally. I was running my program on a Windows 7 Pro computer using an intel I7 – 3700K (3.5GHz) quad core processor with 32GB of DDR3 ram. My full program took, well, WWWWWAAAAAAAYYYYYYYY TTTTTTTTOOOOOOOO LLLLLOOOONNNNGGGGGG!!!!!!!!

After a week, my computer would still be running. I also noticed that my computer would use 12% -13% of the processor power. With that in mind, I went to the local computer parts superstore and consulted with the sales staff. I ended up getting a “Gamer” rig when I purchased a new AMD FX9590 processor (4.7GHz on 8 cores) and dropped it into a new mobo. This new computer ran the same Maple program with slightly better results. It took 4-5 days to complete... assuming no one else used the computer and turned it off.

After searching for a better method (meaning better software) for running my program, I decided to try R. After looking around for a few hours, I was able to rewrite my program using R. YEAH! Using the basic R (version 3.1.2), my new program only took a few days (2-3). A nice feature of R is an improved BLAS and LAPACK and their implementation in R over Maple 18. Even though R 3.1.2 is faster than Maple 18, R only used 12%-13% of my processor.

Why do I keep bringing up the 12%-13% CPU usage? Well, it means that on my 8 core processor, only 1 core is doing all the work. (1/8 = 0.125) Imagine you go out and buy a new car. This car has a big V8 engine but, only 1 cylinder runs at a time. Even though you have 7 other cylinders in the car, they are NOT used. If that was your car, you would be furious. For a computer program, this is standard protocol. A cure for this type of silliness is to use parallel programming.

Unfortunately, I AM NOT A PROGRAMMER! I make things happen with a minimal amount of typing. I’m very likely to use “default settings” because I’m likely to mistype something and spend an hour trying to figure out, “Is that a colon or a semi colon?” So when I looked around at other websites discussing how to compile and/or install different blas and lapack for R, I started thinking, “I wish I was taking QED right now. (QED = Quantum Electro-Dynamics)” I also use Windows, most of the websites I saw discussed doing this in Linux.

That led me to Revolution Analytics RRO. I installed RRO version 8.0.2 and the MKL available from here: http://mran.revolutionanalytics.com/download/#download

RRO uses Intel’s Math Kernel Library, which is updated and upgraded to run certain types of calculations in parallel. Yes, parallel processing in Windows, which is step one of HPC (High Performance Computing) and something many of my comp sci friends and faculty said was difficult to do.

A big part of my project is raising a matrix to a power. This is a highly parallelizable process. By that I mean, calculating element A(n,n) in the new matrix does not depend upon the value of A(x,x) in the new matrix. They only care about what is in the old matrix. Using the old style (series) computing, you calculate A(1,1), then A(1,2), A(1,3) … A(n,n). With parallel programming, on my 8 core AMD processor, I can calculate A(1,1), A(1,2), A(1,3) … A(1,8) at the same time. If these calculations were “perfectly parallel” I would get my results 8 times faster. For those of us that have read other blog posts on RevolutionAnalytics.com, you know that the speed boost for parallel programming is great, but not perfect. (Almost like it follows the laws of thermodynamics.) By using RRO, I was able to run my program in R and get results for all of my calculations in 6-8 hours. That got me thinking.

If parallel processing on 8 cores instead of series processing on 1 core is a major step up, can I boost the parallel processing possibility? Yes. GPU processors like the Tesla and FirePro are nice and all but:

1) Using them with R requires programming and using Linux. Two things I don’t have time to do.

2) Entry level Tesla and Good Firepro GPUs cost a lot of money. Something I don’t have a lot of right now.

The other option is using an Intel Phi coprocessor, or two. Fortunately, when I started looking, I could pick up a Phi coprocessor for cheap. Like $155 cheap for a brand new coprocessor from an authorized retailer. The video card in my computer cost more than my 2 Phi’s. The big issue, is getting a motherboard that has the ability to handle the Phi’s. Phi coprocessors have 6+GB of ram. Most mobo’s can’t handle more than 4GB of ram through a PCI-E 3.0 slot. So, I bought a second mobo as a “hobby” project computer. This new mobo is intended for “workstations” and has 4 PCI-E 3.0 slots. That gives me enough room for a good video card and 2 Phi’s. This new Workstation PC has an Intel Xeon E5-2620V3 (2.4GHz 6-core, 12-Thread) processor, 2 Intel Xeon Phi coprocessors 31S1P (57 cores with 4 threads per core at 1.1GHz per thread for a total of 456threads) and 48Gb DDR4 Ram.

The Intel Phi coprocessors work well with the Intel MKL. The same MKL RRO uses. Which means, if I use RRO with my Phi’s, after they are properly set up, I should be good to go….. Intel doesn’t make this easy. (I cobbled together the information from 6-7 different sources. Each source had a small piece of the puzzle.) The Phi’s are definitely not “Plug and Play”. I used MPSS version 3.4 for Windows 7. I downloaded the drivers from here:

https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#wn34rel

I had to go into the command prompt and follow some of the directions available here. (Helpful hint, use micinfo to check your Phi coprocessors after step 9 in section 2.2.3 “Updating the Flash”.)

http://registrationcenter.intel.com/irc_nas/6252/readme-windows.pdf

After many emails to Revolution Analytics staff, I was able to get the Phi’s up and running! Now, my Phi’s work harmoniously with MKL. Most of the information I needed is available here. https://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf

In the paper and website above, I needed to create some environmental variables. The generic ones are:

MKL_MIC_ENABLE=1

OFFLOAD_DEVICES=*<list>*

MKL_MIC_MAX_MEMORY=2GB

MIC_ENV_PREFIX=MIC

MIC_OMP_NUM_THREADS=###

MIC_KMP_AFFINITY=balanced

Since I have 2 Phi coprocessors, my <list> is 0, 1.(At least this is the list that worked.) I set MKL_MIC_MAX_MEMORY to 8GB. ( I have the ram to do it, so why not.) MIC_OMP_NUM_THREADS = 456.

Below, is a sample program I used to benchmark Maple 2015, R and RRO on my Gamer computer and my Workstation. Between the time I started this project and now, Maple up graded their program to Maple 2015. The big breakthrough is that Maple now does parallel processing. So, I ran the program below using Maple 2015 to see how it compares to R and RRO. (I uninstalled Maple 18 in anger.) I also ran the same program on my Workstation PC to see how well the Phi coprocessors worked. Once I had everything enabled, I didn’t want to disable anything. So, I just have the one, VERY IMPRESSIVE, time for my workstation.

require("expm")

options(digits=22)

a=10000

b=0.000000001

c=matrix(0,a,a)

for ( i in 1:a){c[i,i] = 1-1.75*b}

for ( i in 1:a){c[i-1,i] = b}

for ( i in 2:a){c[i,i-1] = 0.75*b}

c[1,1]=1-b

c[a,a]=1

c[a,a-1]=0

system.time(e=c%^%100)

By using RRO instead of R, I got my results 3.12 hours faster. Considering the fact that I have several dozen more calcs like this one, saving 3hrs per calc is wonderful ;-) By using RRO instead of Maple 2015, I saved about 41 mins. By using RRO with the Phi’s on my Workstation PC, I was done in 187.3s. I saved an additional 39 mins over my Gamer Computer! When I ran my full program, it took under an hour. Compared to the days/weeks for my smaller calculations, an hour is awesome!

An interesting note on the InteL MKL. It only uses cores, not threads, on the main processor. I’m not sure how it handles the threads on the Phi coprocessors. So, my Intel Xeon processor only had 50% usage of the main processor.

Now, your big question is, “Why should I care?” I ran a 10,000x10,000 matrix and raised it to unbelievably high values. I used a brute force method to do it. Suppose that you are doing “Big Data” analysis and you have 30 columns by 2,000,000 rows. If you run a linear regression on that data, your software will use a Pseudoinverse to calculate the coefficients of your regression. A part of the pseudoinverse involves multiplying your 30x2,000,000 matrix by a 2,000,000x30 matrix and it’s all parallelizable! Squaring my matrix uses about 1.00x10^{12} operations (assuming I have my Big O calculation correct.) The pseudo inverse of your matrix uses a mere 1.80x10^{9} operations.

Some of my friends who do these sort of “Big Data” calculations using the series method built into basic R or SAS tell me that they take hours(1-2) to complete. With my workstation, I have the computational power of 17 servers that use my same Xeon processor. That calculation would take me way less than a minute.

Behold, the power of parallel processing!

*by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics*

Without missing a beat, the engineers at Revolution Analytics have brought another strong release to users of Revolution R Enterprise (RRE). Just a few weeks after acquisition of Revolution Analytics by Microsoft, RRE 7.4 was released to customers on May 15 adding new capabilities, enhanced performance and security, ann faster and simpler Hadoop editions.

New features in version 7.4 include:

- Addition of Naïve Bayes Classifiers to the ScaleR library of algorithms
- Optional coefficient tracking for stepwise regressions. Coefficient tracking makes stepwise less of a “black box” by illustrating how model features are selected as the algorithm iterates toward a final reduced model.
- Faster import of wide data sets, faster computation of data summaries, and faster fitting of tree-based models and predictions of decision forests and gradient boosted tree algorithms.
- Support for HDFS file caching on Hadoop that speeds analysis of most files, especially when applying multi-step and iterative algorithms.
- Improved tools for distributing R packages across Cloudera Hadoop clusters.
- An updated edition of R, version 3.1.3.
- Certification of RRE 7.4 on Cloudera CDH 5.2 and 5.3, Hortonworks HDP 2.2 and MapR 4.0.2, along with certification of the much requested CentOS as a supported Linux platform.

For RRE users integrating R with enterprise apps, RRE's included DeployR integration server now includes:

- New R Session Process Controls that provide fine-grained file access controls on Linux.
- Support for external file repositories including git and svn for managing scripts and metadata used by DeployR.
- Strengthened password handling to resist recent attack vectors.
- Updates to the Java, JavaScript and .NET broker frameworks and corresponding client libraries, and,
- A new DeployR command line tool that will grow in capability with subsequent releases.

More broadly, the release of version 7.4 so shortly after acquisition of Revolution by Microsoft underscores our commitment to delivery of an expanding array enterprise-capable R platforms. It also demonstrates Microsoft’s growing commitment to the growth of advanced analytics facilities that leverage and extend upon open source technologies such as the R language.

Details of the new features in RRE 7.4 can be found in the release notes here.

Details of improvements to DeployR integration server in RRE 7.4 can be found here.

R is coming to SQL Server. SQL Server 2016 (which will be in public preview this summer) will include new real-time analytics, automatic data encryption, and the ability to run R within the database itself:

For deeper insights into data, SQL Server 2016 expands its scope beyond transaction processing, data warehousing and business intelligence to deliver advanced analytics as an additional workload in SQL Server with proven technology from Revolution Analytics. We want to make advanced analytics more accessible and increase performance for your advanced analytic workloads by bringing R processing closer to the data and building advanced analytic capabilities right into SQL Server. Additionally, we are building PolyBase into SQL Server, expanding the power to extract value from unstructured and structured data using your existing T-SQL skills. With this wave, you can then gain faster insights through rich visualizations on many devices including mobile applications on Windows, iOS and Android.

With this update, data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where is will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database.

At last weeks' Microsoft Ignite conference in Chicago, SQL Server program managers Lindsey Allen and Borko Novakovic demonstrated a prototype of running R within SQL Server. (A description of the intergration begins at 57:00, and the demo at 1:05:00, in the video below.) In the demo, Lindsey applies a Naive Bayes classification model (from the e1071 R package) to the famous Iris data, using the same R code used in this Azure ML Studio experiment.

SQL Server 2016 is the first Microsoft product to integrate Revolution R (and there are more exciting announcements on that front to come -- stay tuned). This also brings R to other Microsoft products via their native SQL Server integration, including Excel and PowerBI. Read more about the features coming to SQL Server 2016, including Revolution R integration, at the link below.

SQL Server Blog: SQL Server 2016 public preview coming this summer

Revolution R Open 8.0.3 is now available for download for Windows, OS X, Red Hat, Ubuntu and OpenSUSE. This release includes seveal new features: it upgrades RRO to the R 3.1.3 engine, which adds several new features to the R language, adds support for Ubuntu 15.04, and updates the checkpoint package for reproducibility.

RRO is designed to work with multi-threaded math libraries on all platforms, which improve R's performance for some numerically-intensive applications. Note that to enable the multi-threaded performance on Windows and Linux systems you have to download and install the Intel MKL (Math Kernel Library) separately. (Also if you installed RRO 8.0.2 for Windows, be sure to uninstall the previous MKL first.) On OS X, RRO uses the Mac's native Accelerate framework automaticallly.

The default CRAN mirror for Revolution R Open 8.0.3 is a snapshot taken on April 1, 2014. (You can always access the latest packages with the checkpoint function.) Many packages have been updated or released since the last version of 8.0.2 — check out the highlights in the latest Package Spotlight.

We're already at work on the next release of RRO, which will feature the R 3.2.0 engine. Stay tuned!

MRAN: Download Revolution R Open

Bay Area engineer Vineet Abraham recently ran some benchmarks for Revolution R Open (RRO) running on Mac OS X and on Ubuntu. Thanks to the multi-threaded processing capabilites of RRO, several operations ran much faster than R downloaded from CRAN, without having to change any code:

For the most part, RRO performs significantly faster than standard R both locally and on the server. RRO performs really well on the matrix operations as seen in column group mm (over 90% faster than standard R); this is probably due to the addition of the Intel Math Kernel library.

(In fact, while the Intel MKL is used on Ubunti, on OS X the standard Accelerate Framework provides the multi-threading capability, with similar results.) As Vineet's benchmarks show, RRO doesn't improve things for every benchmark, but with some mathematically-intensive operations the difference can be dramatically.

On a related note, I've been doing some benchmarks on RRO 8.0.3 (based on R 3.1.3), due to be released very soon. On my 2-core Surface Pro (yes, it runs fine on a Surface), using the multi-threading reduced the computation for the Urbanek benchmarks from 32 seconds to 8 seconds.

Numbr Crunch: Benchmarking R/RRO is OSX and Ubuntu on the cloud