The image below alternates between two versions of the same photograph. There is one difference between the two pictures. Can you spot the difference?
(Image below the jump — the flashing can be a bit taxing on the eyes.)
The image below alternates between two versions of the same photograph. There is one difference between the two pictures. Can you spot the difference?
(Image below the jump — the flashing can be a bit taxing on the eyes.)
R-core member Peter Dalgaard announced yesterday that R 3.0.3 is now available. This is the final update to the R 3.0 series, and includes several small but handy new features and minor bug fixes. Improvements include support for writing very large tables to disk, better handling of foreign-language calendar dates, and more accuracy when calculating extreme quantiles of the Cauchy distribution.
Binaries of R 3.0.3 are now available for download from your favourite CRAN mirror. The first release of the next series, R 3.1.0 is scheduled for April 10.
R-announce mailing list: R 3.0.3 is released
by Joseph Rickert
In addition to the considerable benefit of being able to meet other, like-minded R users face-to-face, R user groups fill a niche in the world of R education by providing a forum for communicating technical information in an informal and engaging manner. Conferences such as useR!, JSM and countless smaller statistical meetings solicit expert level talks, and the many online sites do an excellent job of providing introductory material. However, there are few places that adequately address the "middle level" talk where a speaker can assume an audience has some experience with R and then go on to develop the R code to perform an analysis, illuminate an application, or show how to get started with a new package.
A recent talk on Hidden Markov Models (HMM) that Joe Le Truc gave to the Singapore R User Group (RUGS) provides a very nice example of the kind of mid-level technical presentation I have in mind. I didn’t attend this talk myself, but the organizers were kind enough to post Joe’s slides and code on the RUGS' meetup website.
The general idea of a HMM is easy enough to understand: one observes some time series or stochastic process and imagines that it has been generated by an unobserved or "hidden" Markov process. However, the details of formulating and fitting a HMM involve some specialized knowledge, and the sophisticated tools available to develop a HMM in R can add an additional level of complexity. Joe’s presentation helps a beginner to dive right in. He briefly states what HMMs are all about, presents some practical examples, and then goes on to show how to use the functions in the very powerful depmixS4 package to fit an HMM model to a time series of S&P 500 returns.
The following slide from Joe’s presentation sets the stage for a concrete example
Consider the following plot of the log returns for the S&P500 for the period from 1/1/1950 to 9/9/2012.
The graph shows what looks like a more or less stationary process punctuated by a few spikes of extreme volatility, the most extreme being October 19, 1987. Joe's code shows how to construct a four state HMM to model this process. The next plot zooms in on the period around the crash of October 1987 and also shows the probabilities of being in the first state of the HMM built with Joe's code.
Note that the model shows 0 probability of being in state 1 during the crash and the other extreme low points. The general idea is that by examining the probabilities associated with the various states and the transition matrix that determines the probabilities of moving from one state to another:
toS1 toS2 toS3 toS4
fromS1 4.792678e-01 2.361060e-19 5.207322e-01 3.515133e-21
fromS2 7.503595e-01 4.377190e-10 2.496405e-01 2.669647e-24
fromS3 4.806678e-01 6.005592e-02 3.978485e-01 6.142784e-02
fromS4 8.655515e-35 1.923142e-01 2.286245e-48 8.076858e-01
one can gain some insight into the dynamics of the observable time series.
Although Joe's code is only an incremental modification of the example given in the documentation for the depmixS4 package, I believe that it serves the valuable purpose of helping to popularize a package that otherwise might be a bit intimidating to someone who is not an expert in this area. The code to generate the plots shown above may be found here: Download HMM_blog_post.
For more material on HMMs have a look at the Thinkinator post, the little book of R for bioinformatics, or the very accessible and thorough treatment in Hidden Markov Models for Time Series: An Introduction Using R (Chapman & Hall) by Walter Zucchini and Iain L. MacDonald which shows how to code HMMs in R from first principles.
Will we always need data scientists, or will the Data Science role be replaced by easy-to-use automated applications? Mikio L. Braun argues — compellingly — that data analysis is hard, and that the fact that it's easy to shoot yourself in the foot with automated tools — and convince yourself that the results are meaningful when they're not — means that human data scientists aren't going anywhere soon. (Via Mike Loukides.)
by Rodney Sparapani, PhD
Rodney is an Assistant Professor in the Institute for Health and Society from the Division of Biostatistics at the Medical College of Wisconsin in Milwaukee and president of the Milwaukee Chapter of the ASA which is hosting an R workshop on Data Mining in Milwaukee on April 4th.
Emacs Speaks Statistics (ESS) is a GPL software package for GNU Emacs that provides support for several statistical programming languages. This post which focuses on ESS and R, provides some history on both Emacs and ESS, provides some guidance on installing both environments and the basics of how to get started.
Emacs has a long and important history as a programmer's editor. It was created in 1976 by Richard Stallman AKA RMS. In 1984, RMS released GNU Emacs; the first free software program released by the GNU Project. Originally, Emacs provided intelligent editing for popular programming languages of the time such as Pascal, PL/I and Fortran; each language was supported by a corresponding "major mode" which we will call a mode for short. Now Emacs has modes for the popular programming languages of today such as R (via ESS), C/C++, Java, Perl and Python. Modes are the killer app of Emacs. You can learn one editor, Emacs, which provides an IDE for practically all of the programming languages you are likely to ever need. Emacs also supports a wide variety of markup languages like LaTeX (via AUCTeX) and HTML.
You can find the source code for GNU Emacs online. I highly recommend the latest stable release, v24.3, for its feature richness and stability. MS Windows and Mac OS X users can find binaries online (which come with ESS and AUCTeX included) at Vincent Goulet's web page. If you are using Linux or UNIX, then you may be able to find binaries for Emacs from a repo associated with your distribution. If not, then you can install Emacs from source. However, beware, Emacs has a lot of dependencies; an abbreviated list includes giflib, libpng, libtiff, ispell/aspell, libXaw and ncurses. By default, configure assumes that the install location is /usr/local, but you can override that with the --prefix option:
configure --prefix=/opt/local --with-x-toolkit=lucid --without-gconf
These options should work on a wide variety of Linux and UNIX distributions.
The Emacs window is called a Frame; we will dissect the Frame from top to bottom. In Figure 0, you can see that the Menu is at the top.
Just below that is the Toolbar with icons for common operations. Next, we come to the buffer area where the file you are editing appears. Below that is an information strip called the mode line. From left to right, the mode line has several items which you can hover over to receive tips on what they represent. In Figure 0, you will see that at the beginning of the mode line, there are 5 characters which each represent file information: the coding system, the end-of-line character, writable or read-only, whether the buffer has been modified and the current directory respectively; all but the last can be modified by clicking on the corresponding character.
Next is the file name. The mode name will follow and be in parentheses. And, finally, at the bottom is the minibuffer which we will see more of.
Besides modes, Emacs is known for its commands bound to key sequences. You can perform a lot of operations from the Menu and the Toolbar that are self-explanatory. However, due to the constant mouse movements you may find these inconvenient; key combinations exist for many common operations.
In the Emacs help notation, C-KEY means hold down the Control key while pressing another KEY. For example, C-h means hold down Control while pressing h. For new Emacs users, C-h is very helpful. C-h is the help key; note that C-h is also assigned to F1 for convenience. The key sequence C-h t or F1 t will launch the Emacs tutorial. C-h k runs the command describe-key. After pressing C-h k you will see the following prompt in the minibuffer "Describe key (or click or menu item):" which will wait until you press a key sequence (or click or pick a menu item). For example, entering C-g after the prompt produces:
C-g runs the command keyboard-quit, which is an interactive compiled Lisp function in `simple.el'.
It is bound to C-g.
Signal a `quit' condition. During execution of Lisp code, this character causes a quit directly. At top-level, as an editor command, this simply beeps.
It is this help at your finger tips which is the self-documenting feature of Emacs. Remember C-g, it can be used to cancel a command in progress if you change your mind or you launched the command in error. A few other useful commands relate to splitting the current buffer; C-x 2 will split the current buffer in half above and below. C-x 1 will return it to one buffer. Similarly, C-x 3 splits the current buffer left and right and C-x 1 will restore it.
M-KEY means hold down the Meta key while pressing another KEY. On PC (Mac) keyboards, the Meta key is usually the Alt (Option) key. On UNIX keyboards, Meta keys are usually to the left and right of the spacebar and have a solid diamond symbol. To be sure, use describe-key, i.e. C-h k M-x Of course, you will not be sure which key is the Meta key, but you will quickly find out. If you don't have a Meta key for some reason, you can press and release the Escape key and then press KEY. You can execute an emacs command by name as follows: M-x COMMAND Enter. For example, to run describe-key: M-x describe-key Enter.
In the late 1990s, Anthony Rossini lead the effort to merge S-mode (developed by David Smith, editor of this blog), SAS-mode and Stata-mode into one package: Emacs Speaks Statistics (ESS). Originally, ESS supported GNU Emacs and XEmacs. XEmacs was a popular fork of Emacs at that time, but the feature set of Emacs and XEmacs have diverged. Today, ESS only supports GNU Emacs; the current stable release is v13.09-1. However, XEmacs users can still use the slightly older version of ESS (circa 2012) v12.04-4. You can find every release of ESS from 2002 onward in the ESS archive here.
You can find the source of ESS online at the R Project. As already mentioned, you can install Emacs and ESS simultaneously with Vincent Goulet's binaries. You can get the current stable release as well as other releases from the ESS archive. Like all free software, ESS is a work in progress. Between releases, new features and bug fixes appear in the ESS repo. If you have a need to install the latest development release, then you can grab the source from one of the ESS repos. ESS has two repos; one based on subversion, AKA SVN, and the other based on git. Although, the SVN repo is the basis of releases, the two repos are synchronized regularly.
You can check out the latest development release from SVN via the command:
svn checkout https://svn.r-project.org/ESS/trunk /path/to/ESS
Replace /path/to/ESS with the directory on your local system where you want to store ESS. Or, similarly, via the git command:
git clone https://github.com/emacs-ess/ESS.git /path/to/ESS
The steps to install ESS can be found online. Please follow the steps carefully. Note that steps 2 and 3 are optional, but steps 4 and 5 are necessary.
If you have installed ESS (and re-launched Emacs), then you should be ready to go. In Emacs, type M-x ess-version Enter to see if Emacs is running the version of ESS that you installed. As of this writing, the latest released version is v13.09-1 while the latest development version in the repo is v13.09-2.
Now, let's take a look at an example from the Modern Applied Statistics with S (MASS) book.
Type: C-x C-f galaxies.R Into this new file, copy and paste:
galaxies <- galaxies/1000
c(width.SJ(galaxies, method = "dpi"), width.SJ(galaxies))
plot(x = c(0, 40), y = c(0, 0.3), type = "n", bty = "l",
xlab = "velocity of galaxy (km/s)", ylab = "density")
lines(density(galaxies, width = 3.25, n = 200), lty = 1)
lines(density(galaxies, width = 2.56, n = 200), lty = 3)
With the .R extension, this file will be recognized as an R program. On the mode line, you will see the mode name: "ESS[S] [R db -] ElDoc". Since ESS was derived from S-mode (and R from S), the mode name starts with ESS[S]. The "R" in [R db -] represents the R language. The "db -" stands for ess-tracebug which provides visual debugging, breakpoints, tracing, etc. For more on ess-tracebug, see its documentation. And, finally, "ElDoc" signifies that ElDoc is turned on. With ElDoc, the minibuffer displays function arguments at point. For example, place the cursor on the "x" in "plot(x" in galaxies.R buffer and you will see the arguments for the plot() function displayed in the minibuffer; we saw that in Figure 0.
The syntax highlighting for the R language provided by ESS is configurable. In Emacs, syntax highlighting is known as font-locking. You can customize the amount of syntax highlighting that you want to see. At the top of the Emacs window, click on the ESS menu and select "Font Lock". This will display a menu of buttons corresponding to language elements that you can syntax highlight. For example, in Figure 1, you can see that when you have turned off all font-locking, the only thing syntax highlighted are strings encased in double quotes.
At the other end of the spectrum, in Figure 2, you can see what it looks like when nearly all of the choices are picked.
You can experiment with the various settings and once you are satisfied, then press "Save to custom" at the bottom. This will save your settings in your Emacs initialization file ~/.emacs You will see them in a section that begins with "(custom-set-variables".
Now, let's return to our galaxies example. You can submit the whole buffer to an R process by pressing “C-c C-b”. If you don't have an R process running in your Emacs session, then one will be created for you in a buffer entitled "*R*" which you will see appear as your buffer is split either above/below or left-right. You can also submit a region by highlighting some code and pressing “C-c C-r”. You can submit a paragraph in which your cursor resides by C-c C-p (a paragraph is a set of one or more lines of codes separated by blank lines). You can submit the line on which your cursor resides by C-c C-j (your cursor can be anywhere in the line; it doesn't have to be at the beginning or the end).
Now, in the *R* buffer, at an R prompt, type ?galaxies. If you press "n", then you will move to the next section of the help buffer; press "n" until you get to Examples. There you will find something similar to the example up above. However, in Figure 3, notice that the R syntax is not highlighted.
When you are using R, you may find yourself editing R code that has embedded C/C++, HTML or LaTeX. Or you may simply be reading a help page. Emacs, generally, has one major mode per buffer. So, the syntax highlighting will not be what the user intended. polymode was developed as a helper mode for ESS to fix this. With polymode, R code in the help pages, as well as embedded code from another language, is syntax highlighted correctly. Look here to get polymode source code and installation tips online.
So, let's to return to our galaxies example. In Figure 4, you can see that the R code in the Examples section is now syntax highlighted via polymode.
Welcome to the Emacs and ESS world! Hopefully, this article has inspired you to give it a try. Like all software, Emacs and ESS are not perfect. However, their track record show that they have served R users well with an intelligent editing environment. To find more more about ESS look here.
The ESS documentation is a work in progress. However, to be a true zombie, don't be too squeamish to RTFM. For zombies that hunger for historical Emacs brain matter, I recommend EMACS: the extensible, customizable, self-documenting display editor by Richard M. Stallman. And for a wonderful introduction to S and R, read Modern Applied Statistics with S, 4th ed. by Bill Venables and Brian Ripley.
Finally, if you would like to talk more about Emacs, ESS or Zombies, stop by the R Workshop on Classical and Bayesian Data Mining sponsored by the Milwaukee Chapter of the ASA on April 4.
Since we've had quite a few announcements over the last month or so, I thought I'd take a moment to catch up on some of the media reports mentioning Revolution Analytics.
Last week, Gartner revealed Revolution Analytics as a Visionary in the new Magic Quadrant for Advanced Analytics Platforms. Inside BigData noted that "Alteryx, Revolution Analytics, RapidMiner and Knime are the ones to watch in 2014", while SearchBusinessAnalytics also noted that "Revolution Analytics also received high marks". Enterprise Apps Today notes the prominence of open-source vendors in the Quadrant, and quotes me saying "traditional enterprise tools struggle to match the open-source community's ability to innovate, iterate and evolve rapidly". Meanwhile, Datanami noted that while SAS and IBM are "King of [the] Analytics Hill" for now, the question is, "But for how long"? and summarized Revolution Analytics' position in the Magic Quadrant:
Revolution Analytics also fared quite well in Gartner's Magic Quadrant. The Mountain View, California-based company is credited with realizing the power of open source R, and has therefore become "the default choice for organizations without an existing provider seeking an R-based solution." High customer satisfaction and a strong sales pipeline are strengths.
Last month we also announced that Revolution R Enterprise is now available in the Amazon Cloud, with instances on Windows and Linux — including RStudio Server Pro — available on a pay-as-you go model for just $0.70 per core per hour (even less than the prices listed in Network World's New Product of the Week slideshow). ComputerWorld says this cloud-based service provides "an easy way for individuals and organizations to start and test their big-data-styled analysis projects". IT BusinessEdge notes that "users of the AWS service can run computations on data sets up to 1TB on Windows and Linux servers".
Finally, our director of product management Thomas Dinsmore was honored with an article for Wired Innovation Insights, on building smarter organizations that include data scientists, power analysts, business analysts and analytic consumers.
Since we got some great news the other day, a happiness-filled Because it's Friday post is a must this week. This Pomplamoose remix of Pharrell William's Happy with Daft Punk's Get Lucky — featuring very clever use of a standard "beamer" video projector for the visual effects — fits the bill nicely.
And if you haven't seen the 24-hour long music video for the original Happy, well, its a great way to while away a few minutes browsing around for your favourite street dancer.
Have a very happy weekend, and we'll see you back here on Monday!
This is the time of year when everyone likes to speculate on the winners of the Academy Awards, to be announced on Sunday. There are plenty of ways to try and predict which movie is going to win Best Picture or who'll win Best Actress. You could look at the various betting markets and see who the speculators are favouring. You could take a look at the predictions from various movie experts. You could base your predictions on the movie "fundamentals": prior awards won, box office receipts, and so forth. If you travel in such circles, you could listen in on the chatter at Hollywood cocktail parties. Or you could even watch all of the nominated movies and decide for yourself.
As Peter Aldhous (a data journalist we've featured in this blog before) reports in Medium, a team researchers used statistical analysis to evaluate all the possible methods for forecasting the Oscars, by using them to predict the outcomes of the 2013 Academy Awards and comparing the results to the actual outcomes. The conclusion: the predictions from the BetFair betting markets — alone — are the best indicators of the actual outcomes. BetFair even does better than a Nate Silver-style aggregation of the critics' picks on the day before the actual awards (and way better than a statistical model based on movie fundamentals), as you can see in the chart below.
You can read the details of the analysis in this paper from Microsoft research. Author David Rothschild let me know that all the computation for the paper was done in the R language, along with many R packages including plyr, reshape, ggplot2 and data.table. Rothschild uses the BetFair predictions (slightly adjusted so that the total probability of all outcomes adds to 100%) as the basis of the Oscar predictions at the PredictWise website. Click through to see the up-to-the-minute predictions, but the forecasts for the top awards as of this writing along with their predicted chance of winning are:
|Best Picture||12 Years a Slave||87.4%|
|Best Directing||Alfonso Cuarón (Gravity)||98.2%|
|Best Actor||Matthew McConaughey (Dallas Buyers Club)||91.9%|
|Best Actress||Cate Blanchett (Blue Jasmine)||98.6%|
|Best Supporting Actor||Jared Leto (Dallas Buyers Club)||97.2%|
|Best Supporting Actress||Lupita Nyong’o (12 Years a Slave)||59.1%|
|Best Visual Effects||Gravity||99.8%|
(On a personal note: I really hope Gravity does match the predictions above. It's easily one of the best films I've seen in the last decade, and I'd give it Best Picture as well if I were an Academy member. See it in 3-D if you can.)
You can see predictions for the other categories in Peter Aldhous's article linked below.
Update March 3: My my count, the PredictWise predictions (based on the Betfair betting markets) correctly predicted 21 of 24 Oscar winners. Not a bad record!
by Joseph Rickert
After an end-of-year slow down in R user group acrtivity that lasted into mid January, the Revolution Analytics’ Community Calendar indicates that R user groups worldwide are back in full swing with 48 events listed in the short month of February. And, while the total number of active user groups has not grown from this time last year it is at least holding steady. A recent review of the Revolution Analytics R User Group Directory revealed that new user groups have formed over the last few months to offset those that appear to have gone inactive. Those of you who are familiar with the map we publish from time to time will see even more dots marking R users groups in far-flung places that were previously only white space. From Kansas City to Dakar and then onto Mumbai and Wellington people are meeting to talk about R.
Our latest count is 123 groups distributed across 41 countries. If we have missed a group, please let us know. User groups founded over the past year include those in the following table:
R User Group Name City Approximate Start Date
R-omania Team Bucharest 4/4/2013
Grupo de Usarios de R Argentina Buenos Ares 8/1/2013
Cairo Bioinformatics Group Cairo 8/29/2013
DakaR R User Group (DRUG) Dakar 10/23/2013
R User Group, Gainesville Gainesville, FL 5/22/2013
Glasgow R Users Group and Modelling
Systems Community Glasgow 9/1/2013
Kansas City R Users Group Kansas City, MO 12/2/2013
Knoxville R Users Group (KRUG) Knoxville, TN 3/28/2013
RLyon Lyon 12/19/2013
MumbaiR Mumbai 9/15/2013
Pune R User Group Pune 7/17/2013
SheffieldR Sheffield 3/28/2013
Wellington R Users Group (WRUG) Wellington 9/5/2013
At Revolution Analytics we believe that user groups are essential to the growth and health of the R Community. While R Core keeps the wheels turning and the thousands of R package authors expand the capabilities of the language, the person-to-person interactions that take place in the R user groups generate the enthusiasm and vitality that fuels creativity. So far this year, Revolution Analytics has provided financial support to 30 user groups. Half of these have qualified at either the Matrix or Array levels! If you are the organizer of an R user group and you think support from Revolution might help you to grow your group, or expand the scope of what you can do, please submit an application for our 2014 user group program as soon as you can. We will be accepting Matrix and Array level applications through the end of March, just little more than a month away. (Check here to see if you qualify.) We will be accepting Vector level applications until September 20, 2014. So, if you are in an area where there is no R user group, there is still time to search out a few comrads in R, put up a meetup site and fill in some more white space. Good luck, and let us know how we can help.
The entire team at Revolution Analytics is very proud to announce that Gartner has named Revolution Analytics a Visionary in the inaugural Gartner Magic Quadrant for Advanced Analytics Platforms, published February 19, 2014. The report evaluated 16 vendors through a series of stringent criteria related to the ability to execute and completeness of vision.
Revolution Analytics is positioned the furthest for Completeness of Vision and Ability to Execute in the Visionaries Quadrant. We believe this is a validation of the leading-edge innovations of the open-source R community, and that of our own Revolution R Enterprise development team who continues to complement R with scalability, performance, and enterprise readiness. Here's what CEO Dave Rich has to say:
"It's such a pivotal moment for data scientists and the growing open-source R community that Gartner has embarked on its first ever Magic Quadrant for Advanced Analytics Platforms. Gartner estimates advanced analytics to be a $2 billion market that spans a broad array of industries globally, and ‘Gartner predicts business intelligence and analytics will remain top focus for CIOs Through 2017.’ We believe that this new Magic Quadrant puts a spotlight on big data as the great analytics disruptor and we feel highlights the need for solutions like Revolution Analytics' that are built upon a flexible, open platform, and designed for today's Big Data Big Analytics challenges." — Dave Rich, CEO, Revolution Analytics
Hear, hear! You can read more about the new Gartner Magic Quadrant at the link below:
Revolution Analytics press releases: Revolution Analytics Positioned in the “Visionaries” Quadrant of the First Ever Gartner Magic Quadrant for Advanced Analytics Platforms
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.