I confess that I'm more of a dog person than a cat person, which is probably why I found this song and video so amusing (via Sullivan):
If you want more animals behaving badly, check this out. Have a great weekend, and we'll be back on Monday!
I confess that I'm more of a dog person than a cat person, which is probably why I found this song and video so amusing (via Sullivan):
If you want more animals behaving badly, check this out. Have a great weekend, and we'll be back on Monday!
Posted by David Smith at 15:52 in random  Permalink  Comments (0)
In June 2013, the conflict between opposition and government forces around the Syrian city of Aleppo had intensified. Rockets struck residential districts, and carbombs exploded near key facilities.
Many people died. But as is common in conflict areas, the reports of the number of dead varied by the source of the information. While some agencies reported a surge in casualties in the Aleppo area around June 2013, others did not.
The true number of casualties in conflicts like the Syrian war seems unknowable, but the mission of the Human Rights Data Analysis Group (HRDAG) is to make sense of such information, clouded as it is by the fog of war. They do this not by nominating one source of information as the "best", but instead with statistical modeling of the differences between sources.
In a fascinating talk at Strata Santa Clara in February, HRDAG's Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that some victims were reported by no agency at all. By looking at the rates at which some known victims were not reported by all of the agencies, HRDAG can estimate the number of victims that were identified by nobody, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)
HRDAG is doing a noble and difficult job of understanding the facts of war from incomplete data. "If we base our conclusions about what's happening in Syria on the observed data — on the reporting rates — we get those questions wrong", said Megan in her Strata talk. "When estimate what is missing, we have a much more accurate estimate of reality."
Strata: Record Linkage and Other Statistical Models for Quantifying Conflict Casualties in Syria
Posted by David Smith at 10:06 in applications, current events, R, statistics  Permalink  Comments (1)
by Joseph Rickert
I am a book person. I collect books on all sorts of subjects that interest me and consequently I have a fairly extensive collection of R books, many of which I find to be of great value. Nevertheless, when I am asked to recommend an R book to someone new to R I am usually flummoxed. R is growing at a fantastic rate, and people coming to R for the first time span I wide range of sophistication. And besides, owning a book is kind of personal. It is one thing to go out and buy a technical book because it is required for a course, but quite an other to make a commitment to a book all on your own. Not only must it have the right content, at the right level for you, and be written in a way that you will actually read it, a book must feel right, be typeset to appeal to your sense of aesthetics, have diagrams and illustrations to draw you in, and contain enough white space to seem approachable. Furthermore, there is a burden to owning a book. There is nothing worse than making a poor selection and having a totally incomprehensible text stare at you from a shelf. Moreover, even an old friend can impose obligations. I have read deeply from The Elements of Statistical Learning, but not everything, so there it sits: admonishing me.
Recently, however, while crawling around CRAN, it occurred to me that there is a tremendous amount of high quality material on a wide range of topics in the Contributed Documentation page that would make a perfect introduction to all sorts of people coming to R. Maybe, all it needs is a little marketing and reorganization. So, from among this treasure cache (and a few other online sources), I have assembled an R “meta” book in the following table that might be called: An R Based Introduction to Probability and Statistics with Applications.
Content 
Document 
Author 

1 
Basic Probability and Statistics 
G. Jay Kerns 

2 
Fitting Probability Distributions 
Vito Ricci 

3 

Julian J. Faraway 

4 
Experimental Design 
Vikneswaran 

5 
Survival Analysis 
John Fox 

6 
Generalized Linear Models 
Virasakdi Chongsuvivatwong 

7 


8 
Time Series 
McLeod, Yu and Mahdi 

9 

Kim Seefeld and Ernst Linder 

Machine Learning

Yanchang Zhao 

11 
Bioinformatics

Wim P. Krijnen 

12 
Forecasting 
Hyndman and Athanasopoulos 

13 
Structural Equation Models 
John Fox 

14 
Credit Scoring 
Dhruv Sharma 
The content column lists the topics that I think ought to be included in a good introductory probability and statistics textbook. With a little searching, you will be able to find a discussion of each topic in the document listed to its right. Obviously, there is a lot overlap among the documents listed, since most of them are substantial works that cover much more than the few topics that I have listed.
Finally, I don’t mean to imply that the documents in my table are the best assembled in the Contributed Documentation page. The table just represents my idiosyncratic way of organizing some of the material in a way that I hope newcomers will find useful. I think that collectively the contributed documents have everything one might look for in a first date with R. They are available, approachable, contain superb content written by R experts, and are replete with examples and R code. And, with a little effort, a casual first encounter could lead to long term relationship.
Posted by Joseph Rickert at 08:30 in beginner tips, open source, R, statistics  Permalink  Comments (4)
The worldwide R user conference, useR! 2014, will take in Los Angeles, June 30July 3. If you're an R user, or just interested in learning about what R can do firsthand from members of the R community, this is the conference to attend. To get an idea of what to expect, check out our roundups of prior conferences in Warwick (UK), Nashville (USA), and Albacete (Spain).
Revolution Analytics is proud to be a sponsor once again this year, but useR! is primarily a communitydriven conference, organized and presented by R users from around the world. The quality of the presentations has always been thanks to the contributions of R users themselves, and this year is no different. If you're planning to attend, we strongly encourage you to submit a talk proposal and share with other R users your experiences and knowledge about using R.
I'd like to make a special request to anyone using R in any realworld application to contribute to the "Business and Enterprise Applications Track". In this section, the goal is to highlight some of the ways that R is involved in today's datadriven businesses, whether numbercrunching behind the scenes, or frontandcenter in interactive applications. For some food for thought, check out the applications section of this blog for some practical examples of R in action. Revolution Analytics is sponsoring the video recording of this session, so your contributions will be preserved and shared for everyone to see.
The deadline for talk proposals is March 30 has been extended to April 10 so don't delay!
useR! 2014: Abstract Submission
Posted by David Smith at 15:29 in announcements, events, R, user groups  Permalink  Comments (0)
In case you missed them, here are some articles from February of particular interest to R users:
A statistical analysis of various forecasting methods (using R) leads to correct predictions for 21 of 24 Oscars awards.
There are now 123 R User Groups worldwide, and applications for Revolution Analytics sponsorship grants are open until March 31.
Revolution Analytics was named by Gartner as a “Visionary” in the new Magic Quadrant for Advanced Analytics Software.
Dan Hanson simulates financial market returns in R using the Generalized Lambda Distribution.
I discussed the next phase of “Big Data” — driven by the rise of R — in an interview with theCUBE.
An Rbased analysis by Joshua Katz is the basis of an interactive “Dialect Quiz” that broke traffic records at the New York Times.
Rattle creator Graham Williams’ “One Page R: A Survival Guide to Data Science with R” is actually the entry point to several indepth R tutorials.
An example of using R for topological data analysis: sampling points from the surface of a torus.
A recent feature on Artificial Intelligence in The Atlantic includes a brief mention of R.
James Peruvankal on modeling “contagion” in social networks using R.
Replay of our webinar with Alteryx, including a demo of accessing R using the draganddrop workflow GUI.
Plush toys in the shape of statistical distributions, based on patterns created with R.
Currency arbitrage using BitCoin, using exchange rates downloaded from the nowdefunct MtGox exchange using the quandl package.
A guide to creating 3D perspective plots with R.
Revolution R Enterprise is now available in the cloud, in Amazon’s AWS Marketplace.
According to the 2014 Dice Salary Survey, R is the highestpaid IT skill.
A Shiny app that displayed realtime results from the Sochi Olympics, and now shows the final medal tally.
The R package weatherData makes it easy to download weather data into R.
In addition to free licenses to individuals in academia, Revolution Analytics now offers $999 site licenses to IT departments at universities and nonprofits working for the public good.
R is #15 of all programming languages in the latest RedMonk rankings.
Some nonR stories in the past month included: special effects using just an office projector, and clips from the movie Dynamic Earth.
As always, thanks for the comments and please send any suggestions to me at [email protected]. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
Posted by David Smith at 07:45 in R, roundups  Permalink  Comments (0)
by Derek McCrae Norton, Senior Sales Engineer
I often get questions in the course of a day about various algorithms. Some of those are already in Revolution R Enterprise, and some just haven't been written yet. One such algorithm is Ridge Regression. For those who are wondering "What is that?": put simply, it is a type of regression that can help deal with multicollinearity. It is part of a broader class of models called Penalized Regression, which also includes LASSO. For those who want more than that, I recommend your friendly neighborhood search engine. Here is one wikipedia link to get you started.
Since this algorithm is not yet implemented in RevoScaleR, I set out to determine how hard such a task might be... A nice surprise was that it is not difficult at all. After browsing one of my old textbooks, I found that you can calculate a ridge regression by using correlation matricies. Jackpot! RevoScaleR has that implemented so then it was just a matter of working through a bit of code.
Initially I was stumped when I looked at lm.ridge since the coefficients don't match, but then after delving into that code I was able to see that it is simply a different method of scaling. So, here is some code to do ridge regression in RevoScaleR. The two models match the example in Kutner, Nachtsheim and Neter.
rxRidgeReg < function(formula, data, lambda, ...) { myTerms < all.vars(formula) newForm < as.formula(paste("~", paste(myTerms, collapse = "+"))) myCor < rxCovCor(newForm, data = data, type = "Cor", ...) n < myCor$valid.obs k < nrow(myCor$CovCor)  1 bridgeprime < do.call(rbind, lapply(lambda,
function(l) qr.solve(myCor$CovCor[1,1] + l*diag(k),
myCor$CovCor[1,1]))) bridge < myCor$StdDevs[1] * sweep(bridgeprime, 2,
myCor$StdDevs[1], "/") bridge < cbind(t(myCor$Means[1] 
tcrossprod(myCor$Means[1], bridge)), bridge) rownames(bridge) < format(lambda) return(bridge) }
bodyfat < read.table("http://calcnet.mth.cmich.edu/org/spss/V16_materials/DataSets_v16/BodyFatTxtFormat.txt") names(bodyfat) < c("X1", "X2", "X3", "Y") library(ridge) rr1 < rxRidgeReg(Y ~ X1 + X2 + X3, data = bodyfat, lambda = 0.02, reportProgress = 0) rr2 < linearRidge(Y ~ X1 + X2 + X3, data = bodyfat, lambda = 0.02) rr1 ## X1 X2 X3 ## 0.02 7.403 0.5554 0.3681 0.1916 coef(rr2) ## (Intercept) X1 X2 X3 ## 7.4034 0.5554 0.3681 0.1916 # it also works with multiple lambdas myLambda < c(0, 0.02, 1, 5, 10) rxRidgeReg(Y ~ X1 + X2 + X3, data = bodyfat, lambda = myLambda, reportProgress = 0) ## X1 X2 X3 ## 0.00 117.085 4.33409 2.85685 2.186060 ## 0.02 7.403 0.55535 0.36814 0.191627 ## 1.00 2.249 0.28438 0.30246 0.008316 ## 5.00 10.242 0.12188 0.12457 0.017907 ## 10.00 14.340 0.07122 0.07206 0.013252
Created by Pretty R at insideR.org
Posted by Derek Norton at 08:30 in announcements  Permalink  Comments (0)
In 1990, 87% of Americans could be uniquely identified given only their gender, date of birth and the 5digit ZIP. You can check how easily you can be identified using those three data points here, and vastly more data is available about individuals today compared to 24 years ago. In this brave new world of social sharing, open data and data security revelations, data privacy is a big issue for consumers and businesses alike.
Statisticians have a unique perspective when it comes to data, yet searching for "privacy" or "ethics" at the websites of the major statisticial societies yields little of relevance. Why aren't more statisticians playing leading roles in data privacy and ethics issues? That's the topic I raise in an oped at StatsLife, the online magazine of the Royal Statistical Society. I encourage you to add your voice to the conversation in the comments.
StatsLife: Why aren’t more statisticians involved in data privacy?
Posted by David Smith at 07:33 in current events  Permalink  Comments (0)
The image below alternates between two versions of the same photograph. There is one difference between the two pictures. Can you spot the difference?
(Image below the jump — the flashing can be a bit taxing on the eyes.)
Continue reading "Because it's Friday: Spot the Difference" »
Posted by David Smith at 11:31  Permalink  Comments (1)
Rcore member Peter Dalgaard announced yesterday that R 3.0.3 is now available. This is the final update to the R 3.0 series, and includes several small but handy new features and minor bug fixes. Improvements include support for writing very large tables to disk, better handling of foreignlanguage calendar dates, and more accuracy when calculating extreme quantiles of the Cauchy distribution.
Binaries of R 3.0.3 are now available for download from your favourite CRAN mirror. The first release of the next series, R 3.1.0 is scheduled for April 10.
Rannounce mailing list: R 3.0.3 is released
Posted by David Smith at 11:02 in announcements, R  Permalink  Comments (0)
by Joseph Rickert
In addition to the considerable benefit of being able to meet other, likeminded R users facetoface, R user groups fill a niche in the world of R education by providing a forum for communicating technical information in an informal and engaging manner. Conferences such as useR!, JSM and countless smaller statistical meetings solicit expert level talks, and the many online sites do an excellent job of providing introductory material. However, there are few places that adequately address the "middle level" talk where a speaker can assume an audience has some experience with R and then go on to develop the R code to perform an analysis, illuminate an application, or show how to get started with a new package.
A recent talk on Hidden Markov Models (HMM) that Joe Le Truc gave to the Singapore R User Group (RUGS) provides a very nice example of the kind of midlevel technical presentation I have in mind. I didn’t attend this talk myself, but the organizers were kind enough to post Joe’s slides and code on the RUGS' meetup website.
The general idea of a HMM is easy enough to understand: one observes some time series or stochastic process and imagines that it has been generated by an unobserved or "hidden" Markov process. However, the details of formulating and fitting a HMM involve some specialized knowledge, and the sophisticated tools available to develop a HMM in R can add an additional level of complexity. Joe’s presentation helps a beginner to dive right in. He briefly states what HMMs are all about, presents some practical examples, and then goes on to show how to use the functions in the very powerful depmixS4 package to fit an HMM model to a time series of S&P 500 returns.
The following slide from Joe’s presentation sets the stage for a concrete example
Consider the following plot of the log returns for the S&P500 for the period from 1/1/1950 to 9/9/2012.
The graph shows what looks like a more or less stationary process punctuated by a few spikes of extreme volatility, the most extreme being October 19, 1987. Joe's code shows how to construct a four state HMM to model this process. The next plot zooms in on the period around the crash of October 1987 and also shows the probabilities of being in the first state of the HMM built with Joe's code.
Note that the model shows 0 probability of being in state 1 during the crash and the other extreme low points. The general idea is that by examining the probabilities associated with the various states and the transition matrix that determines the probabilities of moving from one state to another:
Transition matrix
toS1 toS2 toS3 toS4
fromS1 4.792678e01 2.361060e19 5.207322e01 3.515133e21
fromS2 7.503595e01 4.377190e10 2.496405e01 2.669647e24
fromS3 4.806678e01 6.005592e02 3.978485e01 6.142784e02
fromS4 8.655515e35 1.923142e01 2.286245e48 8.076858e01
one can gain some insight into the dynamics of the observable time series.
Although Joe's code is only an incremental modification of the example given in the documentation for the depmixS4 package, I believe that it serves the valuable purpose of helping to popularize a package that otherwise might be a bit intimidating to someone who is not an expert in this area. The code to generate the plots shown above may be found here: Download HMM_blog_post.
For more material on HMMs have a look at the Thinkinator post, the little book of R for bioinformatics, or the very accessible and thorough treatment in Hidden Markov Models for Time Series: An Introduction Using R (Chapman & Hall) by Walter Zucchini and Iain L. MacDonald which shows how to code HMMs in R from first principles.
Posted by Joseph Rickert at 08:58 in packages, R, statistics, user groups  Permalink  Comments (2)