Bayesians and Frequentists have long been ambivalent toward each other. The concept of “Prior” remains the center of this 250 years old tug-of-war: frequentists view prior as a weakness that can cloud the final inference, whereas Bayesians view it as a strength to incorporate expert knowledge into the data analysis. So, the question naturally arises, how can we develop a Bayes-frequentist consolidated data analysis workflow that enjoys the best of both worlds?
To develop a “defendable and defensible” Bayesian learning model, we have to go beyond blindly ‘turning the crank’ based on a “go-as-you-like” [approximate guess] prior. A lackluster attitude towards prior modeling could lead to disastrous inference, impacting various fields from clinical drug development to presidential election forecasts. The real questions are: How can we uncover the blind spots of the conventional wisdom-based prior? How can we develop the science of prior model-building that combines both data and science [DS-prior] in a testable manner – a double-yolk Bayesian egg? Unfortunately, these questions are outside the scope of business-as-usual Bayesian modus operandi and require new ideas, which is the goal of the paper, “Bayesian Modeling via Goodness of Fit”. In the following, we demonstrate how to prepare the “Bayesian omelet” — the operational part — using the R package BayesGOF
.
Our model-building approach proceeds sequentially as follows:
- it starts with a scientific (or empirical) parametric prior \(g(\theta;\alpha,\beta)\),
- inspects the adequacy and the remaining uncertainty of the elicited prior using a graphical exploratory tool,
- estimates the necessary “correction” for assumed \(g\) by looking at the data,
- generates the final statistical estimate \(\hat \pi(\theta)\), and
- executes macro and micro-level inference.
Our algorithmic solution yields answers to all five of the phases using one single algorithm, which we will now demonstrate for rat tumor data. The rat
tumor data consists of observations of endometrial stromal polyp incidence in \(k=70\) groups of rats. For each group, \(y_i\) is the number of rats with polyps and \(n_i\) is the total number of rats in the experiment. The dataset is available in the R package BayesGOF
.
The Rat-data model: \(y_i\,\overset{{\rm ind}}{\sim}\,\mbox{Binomial}( n_i, \theta_i)\), \(i=1,\ldots,k\), where the unobserved parameters \(\theta=(\theta_1,\ldots,\theta_k)\) are independent realizations from the unknown \(\pi(\theta)\).
Step 1. We begin by finding the starting parameter values for parametric conjugate \(g \sim Beta(\alpha, \beta)\):
library(BayesGOF)
set.seed(8697)
data(rat)
###Use MLE to determine starting values
rat.start <- gMLE.bb(rat$y, rat$n)$estimate
We use our starting parameter values to run the main DS.prior function:
rat.ds <- DS.prior(rat, max.m = 6, rat.start, family = "Binomial")
Next we will discuss how to interpret and use this rat.ds
object for exploratory Bayes modeling and prior uncertainty quantification.
Step 2. We display the U-function to quantify and characterize the uncertainty of the a priori selected \(g\):
plot(rat.ds, plot.type = "Ufunc")
The deviations from the uniform distribution (the red dashed line) indicates that our initial selection for \(g\), \(\text{Beta}(\alpha = 2.3,\beta = 14.1)\), is incompatible with the observed data and requires repair; the data indicate that there are, in fact, two different groups of incidence in the rats.
Step 3a. Extract the parameters for the nonparametrically corrected prior \(\hat{\pi}\):
rat.ds
## $g.par
## alpha beta
## 2.304768 14.079707
##
## $LP.coef
## LP1 LP2 LP3
## 0.0000000 0.0000000 -0.5040361
Therefore, our estimated DS(G.m) prior is given by: \[\hat{\pi}(\theta) = g(\theta; \alpha,\beta)\Big[1 - 0.52T_3(\theta;G) \Big].\]
The DS-prior has a unique two-component structure that combines parametric \(g\), and a nonparametric \(d\) (which we call the U-function). Here \(T_j(\Theta;G)\), \(j = 1,\ldots,m\) are a specialized orthonormal basis given by \(\text{Leg}_j[G(\Theta)]\), members of LP-class of rank-polynomials. Note that \({\rm DS}(G,m=0) \equiv g(\theta;\alpha,\beta)\). The truncation point \(m\) reflects the concentration of true unknown \(\pi\) around the pre-selected \(g\).
Step 3b. Plot the estimated DS prior \(\hat{\pi}\) along with the original parametric \(g\):
plot(rat.ds, plot.type = "DSg")
MacroInference
The term "MacroInference" aims to answer the following question: How to combine \(k\) binomial parameters to come up with an overall, macro-level aggregated statistical behavior of \(\theta_1,\ldots,\theta_k\)? This is often important in applied analysis, as the limited sample size of a single study hardly provides adequate evidence for a definitive conclusion.
Step 4. Here we are interested in the overall macro-level inference by combining the \(k=70\) parallel studies. The group-specific modes along with their SEs can be computed as follows:
rat.macro.md <- DS.macro.inf(rat.ds, num.modes = 2 , iters = 25, method = "mode")
rat.macro.md
## 1SD Lower Limit Mode 1SD Upper Limit
## [1,] 0.0161 0.0340 0.0520
## [2,] 0.1442 0.1562 0.1681
plot(rat.macro.md)
MicroInference
"Microinference" refers to the process of using information from historical studies to improve the estimates of one or more studies of particular interest. This is known as “borrowing strength” in Bayesian inference literature. It is noteworthy to mention that the classical Stein’s shrinkage does not work for rat
data due to the presence of multiple partially exchangeable studies. Our adaptive (or selective) shrinkage technology selectively borrows strength from ‘similar’ experiments in an automated manner, by answering the important question: where to shrink?.
Step 5. In addition to the earlier \(k=70\) studies for the rat tumor data, we have a current experimental study that shows \(y_{71}=4\) out of \(n_{71}=14\) rats developed tumors. The following code performs the desired microinference for \(\theta_{71}\) (posterior distribution along with its mean and mode):
rat.y71.micro <- DS.micro.inf(rat.ds, y.0 = 4, n.0 = 14)
rat.y71.micro
## Posterior summary for y = 4, n = 14:
## Posterior Mean = 0.1897
## Posterior Mode = 0.1833
## Use plot(x) to generate posterior plot
The left plot (a) compares the posterior distributions for the parametric \(g\) (blue) and the DS posterior (red). The right plot (b) compares our adaptive shrinkage with Stein’s estimates. The vertical red triangles indicate the modes of the DS prior, while the blue triangle is the mode of the parametric \(g\). For additional real-data examples, please see the package vignette.
Conclusion
All most all modern scientific research utilizes domain-knowledge and data to come up with breakthrough results. But the fundamental problem of how to fuse these “approximate” scientific prior knowledge with the data at hand is not a settled issue even 250 years after the discovery of the Bayes law (for more details, see “Bayes’ Theorem in the 21st Century” and “Statistical thinking for 21st century scientists”). Bayesian modeling via goodness-of-fit technology, synthesized in the R package BayesGOF
, allows us to determine a scientific prior that is consistent with the data at hand, in a systematic and principled way.
Thank you for this wonderful article. I read Bradley Effort’s article Bayes’ Theorem in the 21st Century” that you recommend but I had a problem following his reasoning and I would greatly appreciate if you could help me.
Mr. Effort present a case where 6,033 cases are examined and 28 out of them have a z score higher than 3.4. He then make the following statements: **“The determining fact here is that if indeed all the genes were Null, we would expect only 2.8 z-values exceeding 3.40, that is, only 10% of the actual number observed.
This brings us back to Bayes. Another interpretation of the FDR algorithm is that the Bayesian probability of nullness given a z-value exceeding 3.40 is 10%.”**
To begin with the probability of z > 3.4 is 0.00034 in the standard normal distribution. Therefore we would expect on average 6,033 x 0.00034 = 2.03 genes out of the 6,033 genes to have z > 3.4 assuming that all of those were Null. Why Mr. Effort raises this number to 2.8?
Now let’s assume that 2.8 is the correct value and continue with Mr. Effort’s argument. Mr. Effort says that the “Bayesian probability of nullness given a z-value exceeding 3.40 is 10%.” How he defines the Bayesian probability in this case? In general the Bayesian Theorem states that P(A|B) = P(A) * P(B|A)/P(B), where A and B are Events. How he defines A and B in this context and how he assigns probabilities to them?
I took a different tack and reasoned as follows: Given Null to be true the probability p of an observation with z > 3.4 is p = 0.00034. Then the number of observations with z > 3.4 under the Null when the Sample is 6,033 can be viewed as the number of Successes in 6,033 trials where each Success has probability 0.00034. This number is modeled by the Binomial Distribution with p = 0.0034 and size = 6,033. Using the Binomial Distribution one can show that the probability of observing 28 cases with z > 3.4 in a Sample of 6,033 examples is practically zero and certainly much less than 10%.
I would greatly appreciate your response.
Posted by: Alexander Konstantinidis | March 07, 2018 at 01:55
I am sorry I have mispelled Mr. Efron's name as Effort.
Posted by: Alexander Konstantinidis | March 07, 2018 at 01:57
Dear Alexander, To make a long story short : The local fdr value near z=3.4 is locfdr(z,nulltype = 0)$fdr approx .2; thus, Fdr value is half of it. (see Exercise 2.3: http://statweb.stanford.edu/~ckirby/brad/LSI/chapter2.pdf).
However, the main message of Prof. Efron, I guess, is something much deeper than just computing some FDR numbers (for mere thresholding!). There is something much more serious at stake: How can we operationalize Bayes theorem for data analysis in the 21st century? How can we distill a sensible prior that we can defend? How can we uncover the blind spots of the conventional wisdom-based prior? A casual “go-as-you-like” attitude in prior-building can potentially undermine the whole statistical findings (compare the accuracy of [Bayesian] Presidential election forecast in 2008 and 2016 by Nate Silver).
Posted by: Deep Mukhopadhyay | March 08, 2018 at 06:56