A new report from analyst firm Gartner forecasts that IT organizations will spend $232 billion (US) on hardware, software and services related to Big Data through 2016. Some key findings from the report:
This certainly aligns with what we're seeing with our customers here at Revolution Analytics. Across the board, companies have move from experimenting with Big Data to making significant investments in data architectures — particularly around Hadoop and Big Data appliances. Those companies who have completed (or are at least well along) the process of setting up the data infrastructure are now turning to solutions to analyze the data, to get value from the millions spent capturing and storing it. Encouragingly, the companies we've been working with are taking a forward-looking attitude to the analytics system, focusing on supporting not just ad-hoc analysis of Big Data, but building data applications that run within the operational environment, providing the results of advanced forecasting, estimation and classification analytics to decision makers on a real-time basis. It's this investment that drives the growth in software that Garner forecasts in their report.
Gartner Press Releases: Gartner Says Big Data Will Drive $28 Billion of IT Spending in 2012
In a webinar today previewing Spotfire 5 (scheduled for release this November), TIBCO announced that it will include TERR: The Tibco Enterprise Runtime for R. TERR is a closed-source reimplementation of the R language engine, and not based on the GPL-licensed R project from the R Foundation. Here's the relevant slide from the webinar:
By making the TERR engine TIBCO intellectual property (IP), rather than using the open-source R engine, TIBCO claimed in the webinar to have been able to improve performance. Apparently, while some packages will run at about the same speed, others may run at 10x speed or even faster. This performance comes at the expense of compatibility: not all R functions or CRAN packages will work with the TERR engine, and it's not clear whether or at what rate TERR will follow R's development path.
The TERR engine will be included in the Spotfire Professional Client, and some new statistical interfaces (point-and-click regression and classification modeling dialogs) will make use of the engine. But if you want to use TERR-compatible R code in the Spotfire Web Player (to deploy beyond the local desktop), you'll also need a license for Spotfire Statistics Services.
Coinceidentally, a second enterprise analytics vendor also announced integration with R today. Teradata's Big Data Appliance, which combines high-performance hardware with open-source Hadoop and Teradata Aster software, will include integration between the Hadoop engine and R. From the data sheet:
The SQL-MapReduce framework, created by Teradata Aster, allows developers to write powerful and highly expressive SQL-MapReduce functions in languages such as Java, c#, Python, c++, and R, and push them into the discovery platform for advanced in-database analytics.
Like the similar RHadoop project, Teradata's R integration works with the open-source R engine.
The fact that more enterprise software vendors are integrating with R is generally a good thing for the R community: it validates the power of R within organizations, and adds more options for bringing advanced analytical methods developed in R to production environments. I just hope that the introduction of a prorpietary R language engine doesn't promote the fracturing of the vibrant R community — one of R's greatest strengths — and result in a fork of the R language into incompatible dialects.
If you live in the US, you've probably visited a Williams Sonoma store for gourmet food or quality cookware for the kitchen. And if you've shopped at Pottery Barn or West Elm stores for furniture, those chains are part of the Williams Sonoma stable as well. All three brands have major online stores, all supported by a sophisticated marketing operation.
Well, that marketing operation just got even more sophisticated. Williams Sonoma has teamed up with UpStream Software to implement advanced marketing analytics that can better target prospective customers, and help Williams Sonoma's marketing team better understand the effectiveness of their various marketing channels:
As well as cutting costs by not sending catalogs to unresponsive customers, the new technique is helping the retailer reallocate funds to more effective online marketing channels like e-mails and display ads. “We’ve seen our ability to target with the catalog improve using these techniques on a scale that we haven’t seen with any sort of small technical improvement,” says Mohan Namboodiri, vice president of customer analytics for Williams-Sonoma. “This is a qualitative improvement in our ability to target the right type of customer with the right type of messaging, and it’s not something that we’ve had available up to now.”
The underlying algorithms, developed by statisticians at UpStream, are developed using the R language and deployed to production with Revolution R Enterprise. As described in this Revolution Analytics case study, UpStream draws data from dozens of distinct sources and uses Big Data statistical models to optimize marketing operations for Williams Sonoma and many other retail clients:
UpStream Software’s purpose-built application is a modern, high-performance big data analytics engine, scoring 50 million records per day for each of UpStream’s customers. It marries Revolution Analytics’ Big Data Analytics capabilities with Hadoop’s data management and computational power. Since no two clients are exactly alike, the statistical methods that underlie the analytical models can be customized to meet each client’s exact requirements.
From an analytics perspective, the company borrowed approaches from forward-thinking and more analytically mature industries. For example, UpStream adapted models used in the bioscience sector, where GAM (Generalized Additive Model) survival analysis techniques effectively measure differences in the outcomes in patients under different treatment regimens. However, many of the methods that UpStream wanted to use had not been designed for Big Data analytics. Using Revolution R Enterprise, which is based on the power of the R statistical platform, UpStream built a “big data analytics engine” utilizing multivariate statistics, time-to-event models and GAM survival analysis techniques.
You can see how UpStream implemented their solution with Hadoop and Revolution R Enterprise in this webinar replay, or read more at the links below.
Internet Retailer: Williams-Sonoma targets e-customers with a “treatment” approach
Linear Programming is a mathematical technique used to find the values of some variables (within the bounds of some defined constraints) to find the maximum value of a quantity. For example, consider this problem from the FishyOperations blog:
A trading company is looking for a way to maximize profit per transportation of their goods. The company has a train available with 3 wagons. When stocking the wagons they can choose between 4 types of cargo, each with its own specifications. How much of each cargo type should be loaded on which wagon in order to maximize profit?
The quantity we want to optimize here is profit. The variables to consider are the cargo items selected to fill the wagons, each of which has its own volume and profit per tonne (and quantities aren't unlimited, either). The constraints are the weight and space capacities of the three wagons.
In actual fact, this is a mixed-integer linear programming problem: the cargo comes only in one-tonne lots, and can't be divided into fractional parts. Integer programming problems are notoriously difficult to solve, but the R package lpSolveAPI makes it easy with an R language interface to lp_solve, a powerful (and free!) mixed-integer program solver. (You can find a good overview to the lpSolveAPI package in this vignette.) Simply by defining the constraints in an R language script (for example, that the total cargo weight in wagon 1 can't exceed 5000 tonnes), the solve function finds the optimal solution to the model (expressed graphically below):
The maximum profit of $107,500 comes from shipping 5 tonnes of Cargo 2 and Cargo 3 in wagon 1, 8 tonnes of Cargo 4 in wagon 2, and 12 tonnes of Cargo 4 in wagon 3 (and not shipping any of Cargo 1 at all).
For more details of how the lpSolveAPI package can be used to solve mixed-integer linear programming problems (including the R script implementing this example), see the full blog post linked below.
We can add SAP to the list of vendors offering R integration with their products. InformationWeek reports that the new SAP BusinessObjects Predictive Analysis model provides a graphical user interface to R. Created in reaction to "competitive and market forces, including the momentum of open source R", the new module provides in-database processing (presumably by embedding R within HANA in-memory database appliance, as described in this SAP presentation), and by offering "mainstream analytics" (presumably in-built analytic methods, like those included with the IBM Netezza appliance).
SAP BusinessObjects Predictive Analysis is currently available only to a limited set of clients and expected to be generally available later this year.
The past couple of years have seen a dramatic growth in the use of the R language in the enterprise. R has always been pervasive in academia for research and teaching in statistics and data science, and as new graduates trained in R have migrated to the workplace the demand for R in corporations has become more and more intense.
Database vendor Oracle estimates that "R has attracted over two million users since its introduction". James Kobielus, noted Forrester analyst and predictive analytics expert, recently said in PCWorld that "R has become a real ubiquitous force in advanced analytics. It's everywhere. Enterprise adoption of it has been growing steadily. When we ask our customers what they're using for statistical modeling they'll say SAS or [IBM's] SPSS, but they increasingly say R in the same breath."
With rapid growth of the use of R in the enterprise comes a corresponding increase in demand for enterprise support for R from its users, and demand for integration of R into corporate systems from IT (both areas in which Revolution Analytics provides software and expertise). Most large organizations have a sophisticated infrastructure devoted to data analysis, with an "analytics stack" of software to provide data warehousing and query, predictive analytics, reporting, presentation, and Business Intelligence (BI). As a result software vendors at every layer of this stack have added functionality to integrate R, accommodating demand from both users and IT, and to serve the needs of data-driven business decision makers. Let's take a look at some of the applications within the analytics stack which now provide integration with R.
The data layer is where the lifeblood of the analysis — the data — is stored and prepared. Especially for high-performance and big-data applications, analytics based in R can benefit from the infrastructure that the data layer provides. The IBM blog has a great post with an in-depth discussion about integrating R with the data layer.
IBM Netezza, the high-performance data-warehousing appliance, is integrated with R (in partnership with Revolution Analytics). R users can use Revolution R Enterprise to run massively-parallel computations in R within the IBM Netezza appliance and implement high-performance, big-data analytics (with high-frequency financial data, for example). [Update: a free webinar on February 29 will describe the integration between IBM Netezza and Revolution R Enterprise in detail.]
Oracle announced last year a forthcoming connection between R and Oracle, which was made available in February 2012. Oracle R Enterprise is aimed at statisticians who are "don't know SQL" and are "not familiar with DBA tasks". It is available as part of the Oracle Advanced Analytics option (priced at around $23,000 per core), and provides a "transparency layer" for with functions to connect to Oracle and use R functionality in the Oracle database. Oracle also maintains the open-source ROracle package which provides similar functionality for open-source R.
Cloudera's Distribution Including Apache Hadoop provides support for R in partnership with Revolution Analytics. This connection makes it possible to manipulate Hadoop data stores in R directly from HDFS and HBASE, and give R programmers the ability to write MapReduce jobs in R using Hadoop Streaming.
IBM BigInsights, the Hadoop platform from IBM, is also integrated with R and Revolution R Enterprise. BigInsight queries can make use of the Map-Reduce construct while running R computations in parallel.
Teradata's Enterprise Data Warehousing platform provides in-database analytics using R via the free teradataR package. This package allows R users to connect to Teradata, create data frames linked to Teradata and to call in-database analytic functions. The Teradata Aster MapReduce Platform also provides integration with R.
Sybase RAP, the edition of the Sybase database for financial data, provides integration with R. Providing the R language alongside Sybase RAP allows for faster algorithm development and extensive backward testing on historical data. Sybase also regularly highlights R integration in its financial newsletters and webinars.
The Analytics layer is where the magic happens: statistical modeling, predictive analytics and custom data visualization. Fed with (usually) structured data sourced from the Data Layer, R is widely used here to categorize, predict, and generally provide insight into corporate data stores. In many organizations, older data analysis tools remain in use, and so interfaces to R have been added provide support for analysts and data scientists who prefer to use R and to fill in the gaps of these legacy tools with modern, high-performance analytics.
Revolution Analytics is the leading commercial organization focused on software and support for R. Its Revolution R Enterprise software extends open-source R with productivity interfaces, high-performance statistical computing, big-data analytics, and enterprise integration of R.
SAS has been a statistical analysis workhorse since the early 70's. Now, with so many graduates in statistics trained in R instead of SAS, SAS has introduced the ability to call R from SAS/IML. (It's also possible to call R directly from base SAS thanks to a free package developed at Roche Pharmaceuticals.) SAS JMP, the point-and-click data analysis package, now also provides support for R.
IBM SPSS Statistics, the popular desktop data analysis software known simply as SPSS before being acquired by IBM in 2010, provides integration to R via the Statistics Programmability Extension module.
RStudio, an open-source software company, provides an integrated development environment for developing code in the R language.
Matlab, a numerical computing language used by engineers, also offers the ability to call R from Matlab on Windows.
Data analysis makes the most impact in the enterprise when it can be readily acted upon by decision makers: often, business executives not steeped in the arcana of data warehousing or statistical analysis. As a result, many reporting and business intelligence tools now make it possible to make it possible to incorporate the resuts of analyses generated in R in the presentation layer, in a format tuned to the needs of a business audiecne.
You might not think of Microsoft Excel as more of a spreadsheet than a presentation tool, but it is very widely used on the desktop as a "container" for static and interactive reports based on statistical analysis. While Excel does not have out-of-the-box integration with R, is is possible to integrate R-based computation into Excel spreadsheets via RExcel and Revolution Analytics' RevoDeployR web services API.
As you can see, for organizations who need to create advanced analytics applications, R is integrated throughout the analytics stack: for data access, for presentation of results, and of course for the statistical analysis process itself. This degree of integration by so many companies is indicative of the level of demand for R throughout the enterprise. As the leading provider of commercial software and support for R, Revolution Analytics supports R users througout the organization, helps IT departments integrate Revolution R Enterprise throughout the analytics stack, for high-performance and big data applications based on the R language.
[Update Mar 15 2012: Added SAP HANA to the list.]
Revolution R Enterprise: production-grade analytics software built upon the powerful open source R statistics language.
At the Oracle OpenWorld conference in San Francisco today, Oracle announced the new Oracle Big Data Appliance, "a new engineered system that includes an open source distribution of Apache™ Hadoop™, Oracle NoSQL Database, Oracle Data Integrator Application Adapter for Hadoop, Oracle Loader for Hadoop, and an open source distribution of R." Oracle's foray into the Hadoop and NoSQL spaces has captured most of the attention, but it's the inclusion of open source R that I find most interesting. Oracle is clearly finding demand from its customers for advanced analytics capabilities for big data, and is looking to R to fill that gap. I agree with Ed Dumbill's assessment:
Big data isn't much use until you can make sense of it, and the inclusion of R in Oracle's big data appliance bears this out. It also sets up R as a new industry standard for analytics: something that will raise serious concern among vendors of established statistical and analytical solutions SAS and SPSS.
Of course, this isn't the first time that R has been embedded into a data warehousing appliance. IBM Netezza's iClass device integrates with Revolution R, and AsterData, the Teradata Data Warehouse Appliance, and Greenplum all provide connections to R as well. Here at Revolution Analytics, we think that such enterprise-level integrations with R serve to grow the R ecosystem and serve as validation of R as a key platform for advanced analytics. As CEO Norman Nie said to GigaOm this weekend,
"Oracle’s announcement to embed R demonstrates validation for the leading statistics language and offers further evidence that R is a key weapon in advanced analytics today", said Dr. Nie. "As the Enterprise R company, Revolution Analytics has seen an enormous demand for R solutions, so it’s no surprise that Oracle and other companies are looking to further evangelize and distribute R among the enterprise."
In today's announcement, Oracle also announced another new component, "Oracle R Enterprise". Details are sparse on what exactly this is, so far, other than it allows you to "run existing R applications and use the R client directly against data stored in Oracle Database", presumably thanks to an R package created by Oracle. Timothy Prickett Morgan at The Register has a couple of extra details:
The Big Data Appliance also includes the R programming language, a popular open source statistical-analysis tool. This R engine will integrate with 11g R2, so presumably if you want to do statistical analysis on unstructured data stored in and chewed by Hadoop, you will have to move it to Oracle after the chewing has subsided.
This approach to R-Hadoop integration is different from that announced last week between Revolution Analytics, the so-called Red Hat for stats that is extending and commercializing the R language and its engine, and Cloudera, which sells a commercial Hadoop setup called CDH3 and which was one of the early companies to offer support for Hadoop.
We look forward to hearing more about support for R from Oracle and other enterprise vendors. As the R ecosystem continues to grow into the enterprise space, this can only mean more investment in the R project as a whole.
Oracle Press Releases: Oracle Unveils the Oracle Big Data Appliance
OpenCPU is a new initiative from R user Jeroen Ooms to make innovations in statistics, visualization and data-science more widely applicable. Based on open-source principles, it's a web-based service that lets you upload data visualizations and analyses as R scripts, and allow others to run them on demand. For example, you can upload a script to visualize a company's stock performance, which can then be embedded as a live chart, like this:
That chart was generated on-demand by the OpenCPU R server, and should display an up-to-date stock history for IBM. It was generated by a few lines of R code that Jeroen published to OpenCPU, and I called from this blog post using the OpenCPU API. You can publish your own code there as well, and share new data science techniques implemented in R in the same way.
As an open platform, be aware that any everything you do and post on OpenCPU should be considered public, and so isn't suitable for anything that contains private or sensitive data. If you're looking for a secure platform for distributing R code, where access to sensitive or resource-intensive code needs to be restricted to authenticated users, then you should take a look at the RevoDeployR platform from Revolution Analytics. (Jeroen gave a presentation about the use of RevoDeployR to address such issues at the useR! 2010 conference.)
Revolution Analytics today announced a partnership with Jaspersoft, the makers of the most widely-used business intelligence software in the world. With this partnership, Revolution R and Jaspersoft software work together to bring the power of analytics coded in R to business users working with Business Intelligence (BI) dashboards and reports.
If you're a Jaspersoft or R developer and would like to try your hand at integrating the results of R scripts into Jaspersoft, you can download the free connection between JasperReports Server and RevoDeployR, the Web Services framework for Revolution R. (If you don't have them, you can download the open-source Jaspersoft distribution from the Jaspersoft website, and RevoDeployR is available for free download to academics.)
Also for developers, we'll be hosting a live webinar on March 16 on Integrating R into 3rd Party and Web Applications Using RevoDeployR which will include a section on using the RevoConnectR for JasperReports Server connector. Complete details and registration for the webinar are available here.
Revolution Analytics Partners: Jaspersoft