Bayesian Statistics

In this chapter, we give a concise and necessarily incomplete overview of Bayesian statistics. Our intention is to cover the bare minimum and introduce those quantities that appear in BAT. As a toolkit, BAT allows the user complete freedom in modeling. There is vast literature on the theory, applications, and modeling; for example, consider [8] [7] [14] [10] [3] [9] [5] for a comprehensive overview.

Basic terminology

Bayesian statistics is a framework for quantitative inductive or plausible reasoning; i.e., the optimal processing of incomplete information. The basic idea is to associate a degree of belief to logical propositions; e.g., the mass of the Higgs boson is 125 GeV. From fundamental axioms about reasoning, one can then show that the calculus of degree of belief is simply the ordinary calculus of probability theory; see [8] for a thorough discussion. Deductive reasoning is included as the limiting case in which the degree of belief is either 0 or 1.

From now on, we will use the symbol \(P(A)\) to denote both the degree of belief in proposition \(A\) and the probability of \(A\). In our applications below, \(A\) is often one value out of a continuum, so we use \(P(A)\) also to denote the probability density of \(A\). The conditional probability of \(A\) given \(B\) is \(P(A \cond B)\).

The two central tasks of the natural sciences are to learn about nature from data and to make predictions for (future) experiments. A model \(M\) is a proxy for all discrete pieces of information relevant to calculating the degree of belief. The model can contain parameters \(\vecth\) that take on values in a continuum, perhaps subject to constraints as for example \(\scath_1 \geq 0\). Bayesian reasoning provides an update rule to adjust the degree of belief based on new information available in the form of observed data \(D\). This update rule is the celebrated Bayes' theorem

\begin{align} \label{eq:bayes-thm} \boxed{ P(\vecth \cond D, M) \propto P(D|\vecth, M) P(\vecth \cond M)} \end{align}

\(P(\vecth \cond M)\) is the prior density, \(P(D|\vecth, M)\) is called the probability of the data when treated as a function of \(D\), and known as the likelihood when considering the dependence on \(\vecth\), for fixed \(D\). The model-dependent normalization constant is known as the evidence or marginal likelihood:

\begin{equation} \label{eq:evidence} Z = \int \rmdx{ \vecth} P(D|\vecth, M) P(\vecth \cond M). \end{equation}

Finally, the left-hand side \(P(\vecth | D, M)\) is the posterior density. Prior and posterior (density is usually omitted) represent the state of knowledge about the parameter \(\vecth\) before and after seeing the data. Note that \(\vecth\) appears on opposite sides of | in \(P(D|\vecth,M)\) and \(P(\vecth|D,M)\). That's why Bayes' theorem is also known as the theorem of inverse probability.

Marginalization

Suppose there are two parameters, \(\vecth = (\scath_1, \scath_2)\), and \(\scath_1\) is the parameter of interest whereas \(\scath_2\) is a nuisance parameter. In Bayes' theorem, there is no fundamental distinction between parameters of interest and nuisance parameter, they are all just parameters. But often the goal of the analysis is to extract the posterior of \(\scath_1\) while \(\scath_2\) is only needed at an intermediate stage; for example in order to correctly model the measurement process of \(D\). From the joint posterior \(P(\scath_1, \scath_2 | D)\), we compute the marginalized posterior and can remove the dependence on \(\scath_2\) by integration

\begin{align} \label{eq:marginal} P(\scath_1 | D) = \int \rmdx{\scath_2} P(\scath_1, \scath_2 | D). \end{align}

Model comparison

If there is only a single model under consideration, and no potential for cconfusion, the model label \(M\) is implied and usually omitted from the equations. But suppose that there are two competing models, \(M_1, M_2\), with parameters \(\scath_{1,2}\), that quantitatively predict the outcome \(D\) of an experiment. The task is to find the model with the higher degree of belief. Using Bayes' theorem, the posterior odds of the models are easily found as

\begin{equation} \label{eq:post-odds} \frac{P(M_1|D)}{P(M_2|D)} = B_{12} \cdot \frac{P(M_1)}{P(M_2)}, \end{equation}

where the Bayes factor of \(M_1\) versus \(M_2\), \(B_{12}\), is just the ratio of the evidences

\begin{equation} \label{eq:Bayes-factor} B_{12}= \dfrac{P(D|M_1)}{P(D|M_2)} = \frac{Z_1}{Z_2} = \frac{\int \rmdx{\vecth_1} P(D|\vecth_1, M_1) P(\vecth_1, M_1)} {\int \rmdx{ \vecth_2} P(D|\vecth_2, M_2) P(\vecth_2, M_2)} \end{equation}

The prior odds \(P(M_1)/P(M_2)\) represent the relative degree of belief in the models, independent of the data. % The data are accounted for in the Bayes factor. The Bayes factor quantifies the relative shift of degree of belief induced by the data. In general, \(\dim \vecth_1 \ne \dim \vecth_2\), and without loss of generality let \(\dim \vecth_1 < \dim \vecth_2\). The Bayes factor automatically penalizes \(M_2\) for its larger complexity, as the prior mass is spread out over a higher-dimensional volume. However, this can be compensated if the likelihood \(P(D|\vecth_2, M_2)\) is significantly higher in regions of reasonably high prior density; i.e. the Bayes factor implements Occam's razor the simplest model that describes the observations is preferred.

Goodness of fit

In the Bayesian approach, there is, however, no straightforward answer to the following question: if there is only one model at hand, how to decide if that model is sufficient to explain the data, or if the search for a better model needs to continue? The standard procedure to tackle this problem of evaluating the goodness of fit is to define a test statistic \(T=T(D)\) and to evaluate the following tail-area probability, that is the \(p\) value

\begin{align} \label{eq:pvalue} p \equiv \int_{T>T_{obs}} \rmdx{T} P(T|M). \end{align}

Care has to be taken in the usage, computation, and interpretation of \(p\) values. An introduction a Bayesian interpretation of \(p\) values with applications in BAT is available at [1] ; see also the references therein.

Representation in BAT

In BAT, a model \(M\) is represented as a C++ subclass of BCModel. The crucial parts are to define the likelihood \(P(D \cond \vecth, M)\) BCModel::LogLikelihood, the prior \(P(\vecth \cond M)\) BCModel::LogAPrioriProbability, and the parameters \(\vecth\). From these quantities, BAT can compute the unnormalized posterior \(P(\vecth \cond D, M)\) BCModel::LogProbabilityNN. To avoid numerical overflow, BAT operates on the log scale whenever possible. The key methods of BCModel that a user has to implement are

virtual double LogLikelihood(const std::vector<double>& params)
virtual double LogAPrioriProbability(const std::vector<double>& params)

The parameter values are passed in simply as numbers to likelihood and prior, all parameters are assumed to be real and continuous. Discrete parameters are not supported. The support of \(\vecth\) is a hyperrectangle whose bounds are given by the bounds of the individual parameters when added to the model with BCModel::AddParameter

bool BCModel::AddParameter(const std::string& name, double min, double max,
const std::string& latexname = "",
const std::string& unitstring = "")

The optional latexname and unitstring are used only for labeling plot axes; it's intended usage is to pretty up plots. For example, a parameter theta representing a time measured in seconds is defined as

AddParameter("theta", 0, 1, "#theta", "[s]");

and whenever theta appears on the axis of a plot, it will appear as \(\theta\) [s]. Note that the plots are created with ROOT, so the latexname has to be in ROOT syntax which is basically \(LaTeX\) syntax but the backslash \ is replaced by the hash #.