by Joseph Rickert

Pasha Roberts, Chief Scientist at Talent Analytics, is writing a series of articles on Employee Churn for the Predictive Analytics Times that comprise a really instructive and valuable example of using R to do some basic predictive modeling. So far, Pasha has published Employee Churn 201 in which he makes a case for the importance of modeling employee churn, and Employee Churn 202 where he builds a fairly sophisticated interactive model from first principles using only RStudio and basic R functions. And, while the series is not even complete, I think that is is going to be unique because it is working well on multiple levels.

In Churn 201, Pasha uses R almost incidentally, to produce the following plot that illustrates the concepts involved in understanding the costs and benefits contributed by a single employee.

At the lowest level, this is a nice example of what might be called a “programming literate essay”. R clearly isn’t necessary just to build create a graphic. (Note the use of ggplot's annotate() capability.) But, if you look at the R code behind the scenes, you will see that Pasha has gone a bit further. In a few lines of annotated code he has sketched out a self-documenting model that someone else could use to get “back of the envelope” results for their business. The exercise is roughly at the level of what a business analyst might attempt in an Excel spreadsheet.

In Employee Churn 202, Pasha goes still further moving the series from essays alone to a modeling effort. He uses basic survival analysis ideas and simple R functions to create a sophisticated decision model that computes several performance measures including something he calls Expected Cumulative Net Benefit. This measures the net benefit to the corporation employees who leave for both “good” and “bad” reasons.

The following figure shows the simulation running in RStudio complete with interactive tools built with the manipulate() function to perform "what if" analyses and display the results.

Running the simulation is easy. All of the code is available on Github where the file, churn202.md, provides details on how things work. Once you have run the code in the churn202.R or issued the command source("churn202.R") from the console, running the function manipSim202() will produce the simulation. (Note that might be necessary to click on “gear” icon in the upper left hand corner of the plots panel to have the slide bar controls appear.) The function runSensitivityTests()varies each of the parameters in the simulation through a reasonable range of values, while holding the other parameters fixed, to show the sensitivity of Expected Net Chumulative Benefit to each parameter. The function runHistograms() produces histograms of the synthetic data that drive the simulation and hints at the data collection effort that would be required to run the simulation for real.

By placing the code on GitHub and inviting feedback, comments and pull requests Pasha has raised his literary efforts to the status of an open source, employee churn project without comprimising the clarity of his exposition. I, for one, am looking forward to the rest of thes series.

by Daniel Hanson

Last time, we looked at the four-parameter Generalized Lambda Distribution, as a method of incorporating skew and kurtosis into an estimated distribution of market returns, and capturing the typical fat tails that the normal distribution cannot. Having said that, however, the Normal distribution can be useful in constructing Monte Carlo simulations, and it is still commonly found in applications such as calculating the Value at Risk (VaR) of a portfolio, pricing options, and estimating the liabilities in variable annuity contracts.

We will start here with a simple example using R, focusing on a single security. Although perhaps seemingly trivial, this lays the foundation for in more complexities such as multiple correlated securities and stochastic interest rates. Discussion of these topics is planned for articles to come, as well as topics in option pricing.

*Single Security Example*

Under the oft-used assumption of Brownian Motion dynamics, the return of a single security (eg, an equity) over a period of time Δt is approximately [See Pelsser for example.]

μΔt + σZ・sqrt(Δt) (*)

where μ is the mean annual return of the equity (also called the drift), and σ is its annualized volatility (i.e., standard deviation). Z is a standard Normal random variable, which makes the second term in the expression stochastic. The time t is measured in units of years, so for quarterly returns, for example, Δt = 0.25.

As μ, σ, and Δt are all known values, generating a simulated distribution of returns is a simple task. As an example, suppose we are interested in constructing a distribution of quarterly returns, where μ = 10% and σ= 15%. In order to get a reasonable approximation of the distribution, we will generate n = 10,000 returns.

n <- 10000

# Fixing the seed gives us a consistent set of simulated returns

set.seed(106)

z <- rnorm(n) # mean = 0 and sd = 1 are defaults

mu <- 0.10

sd <- 0.15

delta_t <- 0.25

# apply to expression (*) above

qtr_returns <- mu*delta_t + sd*z*sqrt(delta_t)

Note that R is “smart enough” here by adding the scalar mu*delta_t to each element of the vector in the second term, thus giving us a set of 10,000 simulated returns. Finally, let’s check out results. First, we plot a histogram:

hist(qtr_returns, breaks = 100, col = "green")

This gives us the following:

The symmetric bell shape of the histogram is consistent with the Normal assumption. Checking the annualized mean and variance of the simulated returns,

stats <- c(mean(qtr_returns) * 4, sd(qtr_returns) * 2) # sqrt(4)

names(stats) <- c("mean", "volatility")

stats

We get:

mean volatility

0.09901252 0.14975805

which is very close to our original parameter settings of μ = 10% and σ= 15%.

Again, this is rather simple example, but in future discussions, we will see how it extends to using Monte Carlo simulation for option pricing and risk management models.

If you're still working on your March Madness brackets or fantasy teams, Rodrigo Zamith has updated his NCAA Data Visualizer with the latest teams, players and results. Just choose the two teams you want to compare and the metric to compare them on, and this R-based app will show you the results instantly.

Rodrigo Zamanth: Visualizing Season Performance by NCAA Tournament Teams (2014)

In June 2013, the conflict between opposition and government forces around the Syrian city of Aleppo had intensified. Rockets struck residential districts, and car-bombs exploded near key facilities.

Many people died. But as is common in conflict areas, the reports of the *number* of dead varied by the source of the information. While some agencies reported a surge in casualties in the Aleppo area around June 2013, others did not.

The true number of casualties in conflicts like the Syrian war seems unknowable, but the mission of the Human Rights Data Analysis Group (HRDAG) is to make sense of such information, clouded as it is by the fog of war. They do this not by nominating one source of information as the "best", but instead with statistical modeling of the *differences* between sources.

In a fascinating talk at Strata Santa Clara in February, HRDAG's Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that **some victims were reported by no agency at all**. By looking at the rates at which some known victims were not reported by *all* of the agencies, HRDAG can estimate the number of victims that were identified by *nobody*, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)

HRDAG is doing a noble and difficult job of understanding the facts of war from incomplete data. "If we base our conclusions about what's happening in Syria on the observed data — on the reporting rates — we get those questions wrong", said Megan in her Strata talk. "When estimate what is missing, we have a much more accurate estimate of reality."

Strata: Record Linkage and Other Statistical Models for Quantifying Conflict Casualties in Syria

This is the time of year when everyone likes to speculate on the winners of the Academy Awards, to be announced on Sunday. There are plenty of ways to try and predict which movie is going to win Best Picture or who'll win Best Actress. You could look at the various betting markets and see who the speculators are favouring. You could take a look at the predictions from various movie experts. You could base your predictions on the movie "fundamentals": prior awards won, box office receipts, and so forth. If you travel in such circles, you could listen in on the chatter at Hollywood cocktail parties. Or you could even watch all of the nominated movies and decide for yourself.

As Peter Aldhous (a data journalist we've featured in this blog before) reports in Medium, a team researchers used statistical analysis to evaluate all the possible methods for forecasting the Oscars, by using them to predict the outcomes of the 2013 Academy Awards and comparing the results to the actual outcomes. The conclusion: the predictions from the BetFair betting markets — alone — are the best indicators of the actual outcomes. BetFair even does better than a Nate Silver-style aggregation of the critics' picks on the day before the actual awards (and *way* better than a statistical model based on movie fundamentals), as you can see in the chart below.

You can read the details of the analysis in this paper from Microsoft research. Author David Rothschild let me know that all the computation for the paper was done in the R language, along with many R packages including plyr, reshape, ggplot2 and data.table. Rothschild uses the BetFair predictions (slightly adjusted so that the total probability of all outcomes adds to 100%) as the basis of the Oscar predictions at the PredictWise website. Click through to see the up-to-the-minute predictions, but the forecasts for the top awards as of this writing along with their predicted chance of winning are:

Best Picture | 12 Years a Slave | 87.4% |

Best Directing | Alfonso Cuarón (Gravity) | 98.2% |

Best Actor | Matthew McConaughey (Dallas Buyers Club) | 91.9% |

Best Actress | Cate Blanchett (Blue Jasmine) | 98.6% |

Best Supporting Actor | Jared Leto (Dallas Buyers Club) | 97.2% |

Best Supporting Actress | Lupita Nyong’o (12 Years a Slave) | 59.1% |

Best Visual Effects | Gravity | 99.8% |

(On a personal note: I really hope *Gravity* does match the predictions above. It's easily one of the best films I've seen in the last decade, and I'd give it Best Picture as well if I were an Academy member. See it in 3-D if you can.)

You can see predictions for the other categories in Peter Aldhous's article linked below.

**Update March 3**: By my count, the PredictWise predictions (based on the Betfair betting markets) correctly predicted 21 of 24 Oscar winners. Not a bad record!

by Daniel Hanson, QA Data Scientist, Revolution Analytics

As most readers are well aware, market return data tends to have heavier tails than that which can be captured by a normal distribution; furthermore, skewness will not be captured either. For this reason, a four parameter distribution such as the Generalized Lambda Distribution (GLD) can give us a more realistic representation of the behavior of market returns, and a source from which to draw random samples to simulate returns.

The R package GLDEX provides a fairly straightforward and well-documented solution for fitting a GLD to market return data. Documentation in pdf form may be found here. An accompanying paper written by the author of the GLDEX package, Steven Su, on is also available for download; it is a very well presented overview of the GLD, along with details and examples on using the GLDEX package.

The four parameters of the GLD are, not surprisingly, λ1, λ2, λ3, and λ4. Without going into theoretical details, suffice it to say that λ1 and λ2 are measures of location and scale respectively, and the skewness and kurtosis of the distribution are determined by λ3 and λ4.

Furthermore, there are two forms of the GLD that are implemented in GLDEX, namely those of Ramberg and Schmeiser (1974), and Freimer, Mudholkar, Kollia, and Lin (1988). These are commonly abbreviated as RS and FMKL.

A more detailed theoretical discussion may be found in the paper by Su noted above.

To demonstrate fitting a GLD to financial returns, let’s fetch 20 years of daily closing prices for the SPY ETF which tracks the S&P 500, and then calculate the corresponding daily log returns. Before starting, be sure that you have installed both the GLDEX and quantmod packages; quantmod is used for obtaining the market data.

require(GLDEX)

require(quantmod)

getSymbols("SPY", from = "1994-02-01")

SPY.Close <- SPY[,4] # Closing prices

SPY.vector <- as.vector(SPY.Close)

# Calculate log returns

sp500 <- diff(log(SPY.vector), lag = 1)

sp500 <- sp500[-1] # Remove the NA in the first position

Next, let’s use the function in GLDEX to compute the first four moments of the data:

# Set normalise="Y" so that kurtosis is calculated with

# reference to kurtosis = 0 under Normal distribution

fun.moments.r(sp500, normalise = "Y")

which gives us the following:

mean variance skewness kurtosis

0.0002659639 0.0001539469 -0.0954589371 9.4837879957

Now, let’s fit a GLD to the return data by using the fun.data.fit.mm(.) function:

spLambdaDist = fun.data.fit.mm(sp500)

Remark: running the above command will result in the following message:

"There were 50 or more warnings (use warnings() to see the first 50)"

where the individual warning messages look like this:

Warning messages:

1: In beta(a, b) : NaNs produced

2: In beta(a, b) : NaNs produced

3: In beta(a, b) : NaNs produced

…

These warnings may be safely ignored. Now, let’s look at the contents of the spLambdaDist object:

> spLambdaDist

RPRS RMFMKL

[1,] 3.753968e-04 3.234195e-04

[2,] -4.660455e+01 2.031910e+02

[3,] -1.673535e-01 -1.694597e-01

[4,] -1.638032e-01 -1.613267e-01

What this gives is the set of estimated lambda parameters λ1 through λ4 for both the RS version (1st column) and the FMKL version (2nd column) of the GLD.

There is also a convenient plotting function in the package that will display the histogram of the data along with the density curves for the RS and FMKL fits:

fun.plot.fit(fit.obj = spLambdaDist, data = sp500, nclass = 100,param = c("rs", "fmkl"), xlab = "Returns")

where fit.obj is our fitted GLD object (containing the lambda parameters), data represents the returns data sp500, nclass is the number of partitions to use for the histogram, param tells the function which models to use (we have chosen both RS and FMKL here), and xlab is the label to use for the horizontal axis.

The resulting plot is as follows here:

Note that, as in the case of the lambda parameters, RPRS refers to the RS representation, and RMFMKL to that of the FMKL.

Now that we have fitted the model, we can generate simulated returns using the rgl(.) function, which will select a random sample for a given parametrization.

In order to use this function, we need to separate out the individual lambda parameters for the RS and FMKL versions of our fitted distributions; the rgl(.) function requires individual input for each lambda parameter, as we will soon see.

lambda_params_rs <- spLambdaDist[, 1]

lambda1_rs <- lambda_params_rs[1]

lambda2_rs <- lambda_params_rs[2]

lambda3_rs <- lambda_params_rs[3]

lambda4_rs <- lambda_params_rs[4]

lambda_params_fmkl <- spLambdaDist[, 2]

lambda1_fmkl <- lambda_params_fmkl[1]

lambda2_fmkl <- lambda_params_fmkl[2]

lambda3_fmkl <- lambda_params_fmkl[3]

lambda4_fmkl <- lambda_params_fmkl[4]

Now, let’s generate a set of simulated returns with approximately the same moments as what we found with our market data. To do this, we need a large number of draws using the rgl(.) function; through some trial and error, n = 10,000,000 gets us about as close as we can with each version (RS and FKML):

# RS version:

set.seed(100) # Set seed to obtain a reproducible set

rs_sample <- rgl(n = 10000000, lambda1=lambda1_rs, lambda2 = lambda2_rs,

lambda3 = lambda3_rs,

lambda4 = lambda4_rs,param = "rs")

# Moments of simulated returns using RS method:

fun.moments.r(rs_sample, normalise="Y")

# Moments calculated from market data:

fun.moments.r(sp500, normalise="Y")

# FKML version:

set.seed(100) # Set seed to obtain a reproducible set

fmkl_sample <- rgl(n = 10000000, lambda1=lambda1_fmkl, lambda2 = lambda2_fmkl,lambda3 = lambda3_fmkl,

lambda4 = lambda4_fmkl,param = "fmkl")

# Moments of simulated returns using FMKL method:

fun.moments.r(fmkl_sample, normalise="Y")

# Moments calculated from market data:

fun.moments.r(sp500, normalise="Y")

Comparing results for the RS version vs S&P 500 market data, we get:

> fun.moments.r(rs_sample, normalise="Y")

mean variance skewness kurtosis

2.660228e-04 8.021569e-05 -1.035707e-01 9.922937e+00

> fun.moments.r(sp500, normalise="Y")

mean variance skewness kurtosis

0.0002659639 0.0001539469 -0.0954589371 9.4837879957

And for FKML vs S&P 500 market data, we get:

> fun.moments.r(fmkl_sample, normalise="Y")

mean variance skewness kurtosis

0.0002660137 0.0001537927 -0.1042857096 9.9498480532

> fun.moments.r(sp500, normalise="Y")

mean variance skewness kurtosis

0.0002659639 0.0001539469 -0.0954589371 9.4837879957

So, while we are reasonably close for the mean, skewness, and kurtosis in each case, we get better results for variance with the FKML version.

By fitting a four-parameter Generalized Lambda Distribution to market data, we are able to preserve skewness and kurtosis of the observed data; this would not be possible using a normal distribution with only the first two moments available as parameters. Kurtosis, in particular, is critical, as this captures the fat-tailed characteristics present in market data, allowing risk managers to better assess the risk of market downturns and “black swan” events.

We were able to use the GLDEX package to construct a large set of simulated returns having approximately the same four moments as that of the observed market data, from which return scenarios may be drawn for risk and pricing models, for example.

The example shown above only scratches the surface of how the GLD can be utilized in computational finance. For a more in-depth discussion, this paper by Chalabi, Scott, and Wurtz is one good place to start.

In the video below from *The Atlantic*, the differences in the way US citizens describe or pronounce various things is illustrated in a series of phone calls (via Sullivan):

If you're wondering how your dialect fits in, you can try the New York Times Dialect Quiz. Answer 25 questions, and it will identify the 3 US cities that most closely match your dialect. I'm not a native US English speaker (I grew up in Australia, and spent many years in the UK before moving to the US West Coast in 2000), so I basically flunked the quiz. My words for a freshwater crayfish ("yabbie") or that area of grass by the road ("nature strip") weren't on the list, and so I got placed somewhere in southern Florida (which I guess is at least as far South as you can go!). Judging from the responses from some of my friends on Facebook, though, the quiz can be uncannily accurate if you were brought up in the US.

By the way, the background images in the video and the NYT dialect quiz are both based on the work of Joshua Katz, and developed in the R language. The analysis was originally developed at NC State University, and Joshua Katz is now a graphics editor at the New York Times. (**Update March 10**: The Dialect Quiz broke traffic records at the New York Times website.)

That's all for this week — enjoy your weekend!

The BitCoin cryptocurrency has been much in the news of late. What, you don't have BitCoins? (Don't worry, neither do I.) Unless you have a supercomputer in your back yard and a cheap source of power, it's no longer really feasible to mine them yourself. But if you want some, several online exhanges will let you buy BitCoins for real money. But be warned: the price of BitCoins has been wildly volatile over the last year or so, so it's not really clear whether buying BitCoins would be a good long term investment.

But what if you could make money with BitCoins, without having to hold *any* over the long term? Here's one way you could do it, right now:

- Start with $100 US Dollars (USD)
- Convert your USD$100 to 10609 Japanese Yen (JPY)
- Buy 0.82963 BitCoins with your JPY10609 on the MtGox exchange
- Sell your 0.82963 BitCoins for $113.42 USD
- Profit!! (to the tune of $13.42)

Now, exchange rates (especially BitCoin exhange rates) vary all the time, so you'll need to do a real-time arbitrage analysis to find a profitable sequence of trades at any given moment. R programmer Tom Johnson showed at the Bay Area R User Group an R script he wrote to do exactly that, pulling real-time foreign-exchange rates from Quandl (with the quandl package for R) and solving the necessary equations to find the arbitrage opportunity. He's wrapped this R script into an easy-to-use Shiny app, so you can find BitCoin arbitrage opportunities via JPY and USD at any time:

So why isn't everyone rushing to MtGox to profit from the "Sushi-Burger Shuffle?". Well, as with most free-money schemes, real-life problems intervene. The main issue is that BitCoin currency exchanges routinely take upwards of 10 minutes to clear, by which time the exchange rates may have changed — elimitating the profit opportunity, or even leading to a loss. And that's on a *good* day: in recent days, it's been difficult to access BitCoin exchanges at all. So this is more of an interesting puzzle than a real money-making opportunity. Nonetheless, it's also a great example of real-time financial analysis using R.

Tom Johnson: Best 2-Currency Arbitrage with Bitcoin

By Jay Emerson and Mike Kane

We’re very happy to announce our recent publication with Steve Weston in the Journal of Statistical Software (JSS), “Scalable Strategies for Computing with Massive Data”, JSS Volume 55 Issue 14. In a nutshell:

This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the **foreach** package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP) machine, or in cluster environments without platform-specific code. Second, the **bigmemory** package implements memory- and file-mapped data structures that provide (a) access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b) data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.

We also welcome Pete Haverty from Genentech as an author on the **bigmemory** package. Pete and his colleagues at Genentech have made some substantial improvements to the package and are some of the heaviest users of these extensions (at least, to the best of our knowledge).

Secondly, we’d like to announce a new package, **BH**, with lead author and maintainer Dirk Eddelbuettel (he of **Rcpp** fame, also the first user and constructive critic of **bigmemory**). **BH** contains a subset of Boost headers used by **bigmemory** and other packages, some in active development and not yet on CRAN:

Boost provides free peer-reviewed portable C++ source libraries. A large part of Boost is provided as C++ template code, which is resolved entirely at compile-time without linking. This package aims to provide the most useful subset of Boost libraries for template use among CRAN package. By placing these libraries in this package, we offer a more efficient distribution system for CRAN as replication of this code in the sources of other packages is avoided.

New libraries from Boost may be included upon request (though we limit it to headers only, with no compiled code). Please visit our new Github site for more information.

Finally, we’d like to call attention to a change in JSS software license policy. With the publication of “Scalable Strategies for Computing with Massive Data” JSS now accepts software licensed under either GPL-2 or GPL-3. GPL-3 in turn is compatible with Apache-2.0, and all these licenses are compatible with Boost’s very permissive BSL-1.0 license. This should help to broaden the software contributions documented and reviewed in JSS, and we are grateful to the Editors of JSS for this shift in policy.

by Michael Helbraun

*Michael is member of Revolution Analytics Sales Support team. In the following post, he shows how to synthesize a probability distribution from the opinion of multiple experts: an excellent way to construct a Bayesian prior.*

There are lots of different ways to forecast. Depending on whether there’s historical data, trend, or seasonality you might choose to start with a particular technique. Assuming good domain expertise one effective method is combining expert opinion via Monte Carlo simulation to generate a stochastic forecast. While this example is set up to combine 3 different people’s perspectives of what the number might be, this technique could also be used to combine domain expertise with traditional analytic techniques like time series, regression, neural networks, etc.

First we grab some estimates from our three experts:

Next we generate triangular distributions based on each of our expert’s opinions; we then randomly select one value from each trial:

The end result – a nicely merged stochastic estimate:

*Michael's code (below) uses Revolution's RevoScaleR library. Notice that the rxSetComputeContext() function (line 22) instructs the computer to set up for parallel computation using the resources on the local machine, and the rxExec() function in line 26 executes the rtriangle() function in parallel. By just changing the compute context this same code could run in parallel using all of the resources of and LSF or Hadoop cluster.*

############################################################################### ## ## ## Revolution R Enterprise - MCS Forecasting, combining expert opinion ## ## ## ############################################################################### # Clear out memory for a fresh run and load required packages rm(list = ls()) library(triangle); library(distr); library (ggplot2) # read input parameters bigDataDir <- "C:/Data/Demos/Datasets" bigDataDir <- "C:/..." inDataFile <- file.path(bigDataDir, "/Expert Estimates.csv") expertOpinion <- rxImport(inData = inDataFile) View(expertOpinion) # Set simulation parameters trials <- 1000 rxOptions(numCoresToUse = -1) rxSetComputeContext("localpar") # create individual triangular distributions orderedTri <- function(expertNum, trials) { revoFcast <- rxExec(FUN = rtriangle, timesToRun = 1, n = trials, a = expertOpinion$Min[expertNum], b = expertOpinion$Max[expertNum], c = expertOpinion$MostLikely[expertNum], packagesToLoad = "triangle") return(revoFcast) } # create distribution for each of our experts revoFcast = NA for (i in 1:nrow(expertOpinion)) { if (is.na(revoFcast)) {revoFcast <- orderedTri(i,trials)} else revoFcast <-c(revoFcast,orderedTri(i,trials)) } # prepare the results revoFcast <-(data.frame(revoFcast)) names(revoFcast) <- paste("Expert", 1:nrow(expertOpinion), sep="") # ensure that the results are uncorrelated cor(revoFcast) # create a combined probability distribution and select a forecast value from the prob weighted dist combinedDist <- function(trialNum) { cDist <- DiscreteDistribution(supp = as.double(revoFcast[trialNum,]), prob = expertOpinion$Weighting/sum(expertOpinion$Weighting)) rD <- r(cDist) # variable to generate values from the dist return(rD(1)) # generate/select 1 value } merged <- rxExec(FUN = combinedDist, trialNum = rxElemArg(c(1:trials)), execObjects = c("revoFcast","expertOpinion"), packagesToLoad = "distr") # add the forecast to our working data set merged <- data.frame(merged) names(merged) <- NULL revoFcast$merged <- t(merged) # chart the output View(revoFcast) # Look at our combined data set # restructure the data for plotting histVals <- data.frame(Value = c(revoFcast$Expert1, revoFcast$Expert2, revoFcast$Expert3, revoFcast$merged), Source = c(rep(c("Expert1", "Expert2", "Expert3","Merged Opinion"), each = trials ))) names(histVals) = c("Value", "Source") # draw our combined plot ggplot(histVals, aes(Value, fill = Source)) + geom_density(alpha = 0.25) + ggtitle("Combined Expert Opinion")

Created by Pretty R at inside-R.org

*Download Expert Estimates the small data file used to drive Michael's simulation.*