Via Gizmodo, this generic template for a AAA movie trailer recalls that generic brand video from a couple of years back.
That's all for us for this week. Have a great weekend, we'll be back on Monday!
« July 2017 | Main | September 2017 »
Via Gizmodo, this generic template for a AAA movie trailer recalls that generic brand video from a couple of years back.
That's all for us for this week. Have a great weekend, we'll be back on Monday!
Posted by David Smith at 15:39 in random | Permalink | Comments (0)
Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, author of Parallel Computing for Data Science: With Examples in R, C++ and CUDA, has shared chapter 2 of that book online, and it describes some of the issues that can lead to poor performance. They include:
The chapter is well worth a read for anyone writing parallel code in R (or indeed any programming language). It's also worth checking out Norm Matloff's keynote from the useR!2017 conference, embedded below.
Norm Matloff: Understanding overhead issues in parallel computation
Posted by David Smith at 12:18 in high-performance computing, R | Permalink | Comments (0)
The first "official" version of R, version 1.0.0, was released on February 29, 2000. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard:
It was twenty years ago today, Ross Ihaka got the band to play.... #rstats pic.twitter.com/msSpPz2kyA
— Peter Dalgaard (@pdalgd) August 16, 2017
Twenty years ago, on August 16 1997, the R Core Group was formed. Before that date, the committers to R were the projects' founders Ross Ihaka and Robert Gentleman along with Martin Marchker, and also Luke Tierney, Heiner Schwarte and Paul Murrell. The email above was the invitation for Kurt Hornik, Peter Dalgaard and Thomas Lumley to join as well. With the sole exception of Schwarte, all of the above remain members of the R Core Group, which has since expanded to 21 members. These are the volunteers that implement the R language and its base packages, document, build, test and release it, and manage all the infrastructure that makes that possible.
Thank you to all the R Core Group members, past and present!
Updated August 19 to add Martin Maechler who was omitted in an editing error. Apologies to Martin!
Posted by David Smith at 14:03 in announcements, R | Permalink | Comments (3)
Microsoft Cognitive Services provides several APIs for image recognition, but if you want to build your own recognizer (or create one that works offline), you can use the new Image Featurizer capabilities of Microsoft R Server.
Training an image recognition system requires LOTS of images — millions and millions of them. It involves feeding those images into a deep neural network, and during that process the network generates "features" from the image. These features might be versions of the image including just the outlines, or maybe the image with only the green parts. You could further boil those features down into a single number, say the length of the outline or the percentage of the image that is green. With enough of these "features", you could use them in a traditional machine learning model to classify the images, or perform other recognition tasks.
But if you don't have millions of images, it's still possible to generate these features from a model that has already been trained on millions of images. ResNet is a very deep neural network model trained for the task of image recognition which has been used to win major computer-vision competitions. With the rxFeaturize
function in Microsoft R Client and Microsoft R Server, you can generate 4096 features from this model on any image you provide. The features themselves are meaningful only to a computer, but that vector of 4096 numbers between zero and one is (ideally) a distillation of the unique characteristics of that image as a human would recognize it. You can then use that features vector to create your own image-recognition system without the burden of training your own neural network on a large corpus of images.
On the Cortana Intelligence and ML blog, Remko de Lange provides a simple example: given a collection of 60-or-so images of chairs like those below, how can you find the image that most looks like a deck chair?
First, you need a representative image of a deck chair:
Then, you calculate the features vector for that image using rxFeaturize.
imageToMatchDF <- rxFeaturize( data = data.frame(Image="deck-chair.jpg"), mlTransforms = list( loadImage(vars = list(Features = "Image")), resizeImage(vars = "Features", width = 227, height = 227), extractPixels(vars = "Features"), featurizeImage(var = "Features", dnnModel = "alexnet") ))
Note that when featurizing an image, you need to shrink it down to a specified size (the built-in function resizeImage
handles that). There are also several pretrained models to choose from: three variants of ResNet and also Alexnet, which we use here. This gives us a features vector of 4096 numbers to represent our deck chair image. Then, we just need to use the same process to calculate the features vector for our 60 other images, and find the one that's closest to our deck chair. We can simply use the dist
function in R to do that, for example, and that's exactly what Remko's script on Github does. The image with the closest features vector to our representative image is this one:
So, even with a relatively small collection of images, it's possible to build a powerful image recognition system using image featurization and the powerful image recognition neural networks provided with Microsoft R. The complete code and sample images used in this example are available on Github. (Note, you'll need to have a license for Microsoft R Server or install the free Microsoft R Client with the pretrained models option to use the image featurization functions in the script.) And for more details on creating this recognizer, check out the blog post below.
Cortana Intelligence and Machine Learning Blog: Find Images With Images
Posted by David Smith at 14:43 in Microsoft, R | Permalink | Comments (0)
Last year, Buzzfeed broke the story that US law enforcement agencies were using small aircraft to observe points of interest in US cities, thanks to analysis of public flight-records data. With the data journalism team no doubt realizing that the Flightradar24 data set hosted many more stories of public interest, the challenge lay in separating routine, day-to-day aircraft traffic from the more unusual, covert activities.
So they trained an artificial intelligence model to identify unusual flight paths in the data. The model, implemented in the R programming language, applies a random forest algorithm to identify flight patterns similar to those of covert aircraft identified in their earlier "Spies in the Skies" story. When that model was applied to the almost 20,000 flights in the FlightRadar24 dataset, about 69 planes were flagged as possible surveillance aircraft. Several of those were false positives, but further journalistic inquiry into the provenance of the registrations led to several interesting stories.
Using this model, Buzzfeed news identified several surveillance aircraft in action during a four-month period in late 2015. These included a spy plane operated by US Marshals to hunt drug cartels in Mexico; aircraft covertly registered to US Customs and Border Protection patrolling the US-Mexico border; and a US Navy contractor operating planes circling several points over land in the San Francisco Bay Area — ostensibly for harbor porpoise research.
You can learn more about the stories Buzzfeed News uncovered in the flight data here, and for details on the implementation of the AI model in R, follow the link below.
Github (Buzzfeed): BuzzFeed News Trained A Computer To Search For Hidden Spy Planes. This Is What We Found.
Posted by David Smith at 15:05 in applications, current events, data science, R | Permalink | Comments (2)
Timo Grossenbacher, data journalist with Swiss Radio and TV in Zurich, had a bit of a surprise when he attempted to recreate the results of one of the R Markdown scripts published by SRF Data to accompany their data journalism story about vested interests of Swiss members of parliament. Upon re-running the analysis in R last week, Timo was surprised when the results differed from those published in August 2015. There was no change to the R scripts or data in the intervening two-year period, so what caused the results to be different?
The version of R Timo was using had been updated, but that wasn't the root cause of the problem. What had also changed was the version of the dplyr package used by the script: version 0.5.0 now, versus version 0.4.2 then. For some unknown reason, a change in the dplyr package in the intervening package caused some data rows (shown in red above) to be deleted during the data preparation process, and so the results changed.
Timo was able to recreate the original results by forcing the script to run with package versions as they existed back in August 2015. This is easy to do with the checkpoint package: just add a line like
library(checkpoint); checkpoint("2015-08-11")
to the top of your R script. We have been taking daily snapshots of every R package on CRAN since September 2014 to address exactly this situation, and the checkpoint package makes it super-easy to find and install all of the packages you need to make your script reproducible, without changing your main R installation or affecting any other projects you may have. (The checkpoint package is available on CRAN, and also included with all editions of Microsoft R.)
I've been including a call to checkpoint
on the top of most of my R scripts for several years now, and it's saved me from failing scripts many times. Likewise, Timo has created a structure and process to support truly reproducible data analysis with R, and it advocates using the checkpoint to manage package versions. You can find a description of the process here: A (truly) reproducible R workflow, and find the template in Github.
By the way, SRF Data — the data journalism arm of the national broadcaster in Switzerland — has published some outstanding stories over the past few years, and has even been nominated for Data Journalism Website of the Year. At the useR!2017 conference earlier this year, Timo presented several fascinating insights into the data journalism process at SRF Data, which you can see in his slides and talk (embedded below):
Timo Grossenbacher: This is what happens when you use different package versions, Larry! and A (truly) reproducible R workflow
Posted by David Smith at 15:04 in applications, data science, packages, R | Permalink | Comments (0)
I haven't seen the Dunkirk movie yet, but the video below makes me want to see it soon. It turns out it contains an auditory illusion: the "Shepard Tone", which sounds like it's continually rising but really isn't. (It turns out that many of Christopher Nolan's past movies have included it as well.) There's more explanation in the accompanying article at Vox.
That's it from the blog for this week. Have a great weekend, and see you back here on Monday.
Posted by David Smith at 13:03 in random | Permalink | Comments (0)
In case you missed them, here are some articles from July of particular interest to R users.
A tutorial on using the rsparkling package to apply H20's algorithms to data in HDInsight.
Several exercises to learn parallel programming with the foreach package.
A presentation on the R6 class system, by Winston Chang.
Introducing "joyplots", a ggplot2 extension for visualizing multiple time series or distributions (with a nod to Joy Division).
SQL Server 2017, with many new R-related capabilities, is nearing release.
Ali Zaidi on using neural embeddings with R and Spark to analyze Github comments.
R ranks #6 in the 2017 IEEE Spectrum Top Programming Languages.
Course materials on "Data Analysis for the Life Sciences", from Rafael Irizarry.
How to securely store API keys in R scripts with the "secret" package.
An in-depth tutorial on implementing neural network algorithms in R.
Recordings from the useR!2017 conference in Brussels are now available. The conference was also livestreamed.
Recording of my lightning talk, "R in Minecraft".
Uwe Ligges' useR!2017 keynote, "20 years of CRAN".
A framework for implementing a credit risk prediction system with Microsoft R.
High school students build a facial-recognition drone using Microsoft Cognitive Services and Raspberry Pi.
The R Consortium is conducting a survey of R users.
The R GUI framework "rattle" now supports XGBoost models.
My presentation at useR!2017, "R, Then and Now", on how perceptions of R have changed over the years.
Seven Microsoft customers using R for production applications.
And some general interest stories (not necessarily related to R):
As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
Posted by David Smith at 08:08 in R, roundups | Permalink | Comments (0)
I’m happy to announce that version 0.10.0 beta of the dplyrXdf package is now available. You can get it from Github:
install_github("RevolutionAnalytics/dplyrXdf", build_vignettes=FALSE)
This is a major update to dplyrXdf that adds the following features:
This (pre-)release of dplyrXdf requires Microsoft R Server or Client version 8.0 or higher, and dplyr 0.7 or higher. If you’re using R Server, dplyr 0.7 won’t be in the MRAN snapshot that is your default repo, but you can get it from CRAN:
install.packages("dplyr", repos="https://cloud.r-project.org")
This completely changes the way in which dplyr handles standard evaluation. Previously, if you wanted to program with dplyr pipelines, you had to use special versions of the verbs ending with "_": mutate_
, select_
, and so on. You then provided inputs to these verbs via formulas or strings, in a way that was almost but not quite entirely unlike normal dplyr usage. For example, if you wanted to programmatically carry out a transformation on a given column in a data frame, you did the following:
x <- "mpg" transmute_(mtcars, .dots=list(mpg2=paste0("2 * ", x))) # mpg2 #1 42.0 #2 42.0 #3 45.6 #4 42.8 #5 37.4
This is prone to errors, since it requires creating a string and then parsing it. Worse, it's also insecure, as you can't always guarantee that the input string won't be malicious.
The tidyeval framework replaces all of that. In dplyr 0.7, you call the same functions for both interactive use and programming. The equivalent of the above in the new framework would be:
# the rlang package implements the tidyeval framework used by dplyr library(rlang) x_sym <- sym(x) transmute(mtcars, mpg2=2 * (!!x_sym)) # mpg2 #1 42.0 #2 42.0 #3 45.6 #4 42.8 #5 37.4
Here, the !!
symbol is a special operator that means to get the column name from the variable to its right. The verbs in dplyr 0.7 understand the special rules for working with quoted symbols introduced in the new framework. The same code also works in dplyrXdf 0.10:
# use the new as_xdf function to import to an Xdf file mtx <- as_xdf(mtcars) transmute(mtx, mpg2=2 * (!!x_sym)) %>% as.data.frame # mpg2 #1 42.0 #2 42.0 #3 45.6 #4 42.8 #5 37.4
For more information about tidyeval, see the dplyr vignettes on programming and compatibility.
The following functions let you manipulate Xdf files as files:
copy_xdf
and move_xdf
copy and move an Xdf file, optionally renaming it as well.rename_xdf
does a strict rename, ie without changing the file’s location.delete_xdf
deletes the Xdf file.The following functions let you transfer files and datasets to and from HDFS, for working with a Spark or Hadoop cluster:
copy_to
uploads a dataset (a data frame or data source object) from the native filesystem to HDFS, saving it as an Xdf file.collect
and compute
do the reverse, downloading an Xdf file from HDFS.hdfs_upload
and hdfs_download
transfer arbitrary files and directories to and from HDFS.Uploading and downloading works (or should work) both from the edge node and from a remote client. The interface is the same in both cases: no need to remember when to use rxHadoopCopyFromLocal and rxHadoopCopyFromClient. The hdfs_* functions mostly wrap the rxHadoop* functions, but also add extra functionality in some cases (eg vectorised copy/move, test for directory existence, etc).
The following functions are for file management in HDFS, and mirror similar functions in base R for working with the native filesystem:
hdfs_dir
lists files in a HDFS directory, like dir() for the native filesystem.hdfs_dir_exists
and hdfs_file_exists
test for existence of a directory or file, like dir.exists() and file.exists().hdfs_file_copy
, hdfs_file_move
and hdfs_file_remove
copy, move and delete files in a vectorised fashion, like file.copy(), file.rename() and unlink().hdfs_dir_create
and hdfs_dir_remove
make and delete directories, like dir.create() and unlink(recursive=TRUE).in_hdfs
returns whether a data source is in HDFS or not.As far as possible, the functions avoid reading the data via rxDataStep and so should be more efficient. The only times when rxDataStep is necessary are when importing from a non-Xdf data source, and converting between standard and composite Xdfs.
as_xdf
imports a dataset or data source into an Xdf file, optionally as composite. as_standard_xdf
and as_composite_xdf
are shortcuts for creating standard and composite Xdfs respectively.is_xdf
and is_composite_xdf
return whether a data source is a (composite) Xdf.local_exec
runs an expression in the local compute context: useful for when you want to work with local Xdf files while connected to a remote cluster.For more information, check out the package vignettes in particular "Using the dplyrXdf package" and "Working with HDFS".
dplyrXdf 0.10 is tentatively scheduled for a final release at the same time as the next version of Microsoft R Server, or shortly afterwards. In the meantime, please download this and give it a try; if you run into any bugs, or if you have any feedback, you can email me or log an issue at the Github repo.
Posted by Hong Ooi at 09:30 in big data, Microsoft, open source, packages, R | Permalink | Comments (2)
by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)
Microsoft's Cognitive Toolkit (better known as CNTK) is a commercial-grade and open-source framework for deep learning tasks. At present CNTK does not have a native R interface but can be accessed through Keras, a high-level API which wraps various deep learning backends including CNTK, TensorFlow, and Theano, for the convenience of modularizing deep neural network construction. The latest version of CNTK (2.1) supports Keras. The RStudio team has developed an R interface for Keras making it possible to run different deep learning backends, including CNTK, from within an R session.
This tutorial illustrates how to simply and quickly spin up a Ubuntu-based Azure Data Science Virtual Machine (DSVM) and to configure a Keras and CNTK environment. An Azure DSVM is a curated virtual machine image coming with an extensive collection of pre-installed open source data science tools. The Keras R package can be readily setup up on the DSVM so as to experience the fun of deep learning.
The deployment of a DSVM is also largely simplified through a few R commands from a local R session (running on your own laptop), thanks to the AzureSMR and AzureDSVM packages for R. With an Azure subscription (visit Microsoft for a free trial subscription) and an initial setup of the Azure Active Directory App for authority to access and manage Azure resources, data scientists can use R to manage and operate selected Azure resources, including to easily interact with and use Azure DSVMs for data analytical jobs. A deploy-compute-destroy cycle is now trivial to orchestrate within R.
The following code snippets create an Azure resource group (a logical collection of related resources) and a Ubuntu DSVM of a specified size within the new resource group:
library(AzureSMR) library(AzureDSVM) # Authentication. asc <- createAzureContext(tenantID = "<tenant_id>", clientID = "<client_id>", authKey = "<authentication_key>") azureAuthenticate(asc) # Create a resource group. azureCreateResourceGroup(asc, location = "southeastasia", resourceGroup = "dsvmrg") # Deploy a DSVM with specifications. deployDSVM(asc, resource.group = "dsvmrg", size = "Standard_D4_v2", location = "southeastasia", hostname = "mydsvm", username = "myname", authen = "Password", password = "MyDSVM_123", mode = "Sync")
Once this DSVM is deployed it will cost approximately $0.25 per hour (pricing based on size and location). The DSVM can be powered down as required to reduce costs.
Note that the N-series VMs on Azure now include GPU devices. If a DSVM instance is deployed or resized to the N-series, Keras and CNTK will automatically activate GPU-based capabilities to accelerate model training.
Installation and configuration of Keras can be manually performed after a successful deployment of the DSVM. This post-deployment installation process can be accelerated via Azure Virtual Machine Extensions, which is also functionally available in AzureDSVM:
fileurl <- paste0("https://github.com/yueguoguo/Azure-R-Interface", "/blob/master/demos/demo-5/script.sh") addExtensionDSVM(asc, location = "southeastasia", resource.group = "dsvmrg", hostname = "mydsvm", os = "Ubuntu", fileurl = fileurl, command = "sudo sh script.sh")
The fileurl
and command
arguments point to the URL of a script file to download to the DSVM and the command to execute that script on the DSVM, respectively. A sample script for installing and configuring Keras and its R interface in Ubuntu Linux DSVM can be found here.
Once the R script has been successfully run the remote DSVM can be accessed via RStudio server or through a remote desktop (e.g., X2Go). Try the following within R on the DSVM to see whether the Keras R interface is installed and CNTK is configured as its backend:
library(keras) backend()
The output in the console should be "Using CNTK backend" indicating a successful set-up. The Keras R interface provides a set of examples to get started. More information can be found here. Note it is also feasible to switch the backend from CNTK to others such as Tensorflow or Theano, by following the instructions here.
A case study for Solar power forecasting (reproducing CNTK tutorial 106 B) is available here. It illustrates training a time series forecasting model by using Long Short-Term Memory (LSTM) for predicting solar power generation. The LSTM layer basically captures patterns and long-term dependencies in the historical time series data of solar power readings, to predict the maximum value of total power generation on a specific day. The following chart compares the prediction with the true data.
The tutorial's model is a simplified version for ease of repeatability but it can be further improved if epoch size is increased and network topology is more sophisticated.
Have fun with Keras + CNTK in R!
Posted by Guest Blogger at 09:30 in advanced tips, Microsoft, R | Permalink | Comments (0)