Tutorial

Tutorial

This tutorial demonstrates a simple application of BAT.jl: A Bayesian fit of a histogram with two Gaussian peaks.

You can also download this tutorial as a Jupyter notebook and a plain Julia source file.

Table of contents:

Note: This tutorial is somewhat verbose, as it aims to be easy to follow for users who are new to Julia. For the same reason, we deliberately avoid making use of Julia features like closures, anonymous functions, broadcasting syntax, performance annotations, etc.

Input Data Generation

First, let's generate some synthetic data to fit. We'll need the Julia standard-library packages "Random", "LinearAlgebra" and "Statistics", as well as the packages "Distributions" and "StatsBase":

using Random, LinearAlgebra, Statistics, Distributions, StatsBase

As the underlying truth of our input data/histogram, let us choose an non-normalized probability density composed of two Gaussian peaks with a peak area of 500 and 1000, a mean of -1.0 and 2.0 and a standard error of 0.5

data = vcat(
    rand(Normal(-1.0, 0.5), 500),
    rand(Normal( 2.0, 0.5), 1000)
)
1500-element Array{Float64,1}:
 -0.9965862777739484
 -1.100958528042784 
 -1.7684412893372783
 -0.972773414600595 
 -1.0240778676982736
 -1.148930690209156 
 -1.1706315740626705
 -1.8827381081469277
 -0.9635572488963757
 -2.0403454960823817
  ⋮                 
  1.9504228798132748
  1.8020337765914531
  2.173248713879491 
  2.432215976389577 
  1.5707604522562988
  1.61357134454585  
  2.437960355666802 
  2.652206120574844 
  2.055663468776079 

resulting in a vector of floating-point numbers:

typeof(data) == Vector{Float64}
true

Next, we'll create a histogram of that data, this histogram will serve as the input for the Bayesian fit:

hist = append!(Histogram(-2:0.1:4), data)
StatsBase.Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
  -2.0:0.1:4.0
weights: [8, 4, 9, 13, 28, 21, 27, 42, 40, 35  …  6, 4, 4, 1, 3, 1, 0, 0, 0, 0]
closed: left
isdensity: false

Using the Julia "Plots" package

using Plots

we can plot the histogram:

plot(
    normalize(hist, mode=:density),
    st = :steps, label = "Data",
    title = "Data"
)
savefig("tutorial-data.pdf")

Data

Let's define our fit function - the function that we expect to describe the data histogram, at each x-Axis position x, depending on a given set p of model parameters:

function fit_function(p::NamedTuple{(:a, :mu, :sigma)}, x::Real)
    p.a[1] * pdf(Normal(p.mu[1], p.sigma), x) +
    p.a[2] * pdf(Normal(p.mu[2], p.sigma), x)
end

The fit parameters (model parameters) a (peak areas) and mu (peak means) are vectors, parameter sigma (peak width) is a scalar, we assume it's the same for both Gaussian peaks.

The true values for the model/fit parameters are the values we used to generate the data:

true_par_values = (a = [500, 1000], mu = (-1.0, 2.0), sigma = 0.5)

Let's visually compare the histogram and the fit function, using these true parameter values, to make sure everything is set up correctly:

plot(
    normalize(hist, mode=:density),
    st = :steps, label = "Data",
    title = "Data and True Statistical Model"
)
plot!(
    -4:0.01:4, x -> fit_function(true_par_values, x),
    label = "Truth"
)
savefig("tutorial-data-and-truth.pdf")

Data and True Statistical Model

Bayesian Fit

Now we'll perform a Bayesian fit of the generated histogram, using BAT, to infer the model parameters from the data histogram.

In addition to the Julia packages loaded above, we need BAT itself, as well as IntervalSets:

using BAT, IntervalSets

Likelihood Definition

First, we need to define the likelihood (function) for our problem.

BAT represents densities like likelihoods and priors as subtypes of BAT.AbstractDensity. Custom likelihood can be defined by creating a new subtype of AbstractDensity and by implementing (at minimum) BAT.density_logval for that type - in complex uses cases, this may become necessary. Typically, however, it is sufficient to define a custom likelihood as a simple function that returns the log-likelihood value for a given set of parameters. BAT will automatically convert such a log-likelihood function into a subtype of AbstractDensity.

For performance reasons, functions should not access global variables directly. So we'll use an anonymous function inside of a let-statement to capture the value of the global variable hist in a local variable h (and to shorten function name fit_function to f, purely for convenience):

log_likelihood = let h = hist, f = fit_function
    params -> begin
        # Histogram counts for each bin as an array:
        counts = h.weights

        # Histogram binning, has length (length(counts) + 1):
        binning = h.edges[1]

        # sum log-likelihood value over bins:
        ll_value::Float64 = 0.0
        for i in eachindex(counts)
            # Get information about current bin:
            bin_left, bin_right = binning[i], binning[i+1]
            bin_width = bin_right - bin_left
            bin_center = (bin_right + bin_left) / 2

            observed_counts = counts[i]

            # Simple mid-point rule integration of fit function `f` over bin:
            expected_counts = bin_width * f(params, bin_center)

            # Add log of Poisson probability for bin:
            ll_value += logpdf(Poisson(expected_counts), observed_counts)
        end

        return ll_value
    end
end
#3 (generic function with 1 method)

BAT makes use of Julia's parallel programming facilities if possible, e.g. to run multiple Markov chains in parallel. Therefore, log-likelihood (and other) code must be thread-safe. Mark non-thread-safe code with @critical (provided by Julia package ParallelProcessingTools).

BAT requires Julia v1.3 or newer to use multi-threading. Support for automatic parallelization across multiple (local and remote) Julia processes is planned, but not implemented yet.

Note that Julia currently starts only a single thread by default. Set the the environment variable JULIA_NUM_THREADS to specify the desired number of Julia threads.

We can evaluate log_likelihood, e.g. for the true parameter values:

log_likelihood(true_par_values)
-150.9069276690886

Prior Definition

Next, we need to choose a sensible prior for the fit:

prior = NamedTupleDist(
    a = [0.0..10.0^4, 0.0..10.0^4],
    mu = [-2.0..0.0, 1.0..3.0],
    sigma = Truncated(Normal(0.4, 2), 0.3, 0.7)
)

In general, BAT allows instances of any subtype of AbstractDensity to be uses as a prior, as long as a sampler is defined for it. This way, users may implement complex application-specific priors. You can also use convert(AbstractDensity, distribution) to convert any continuous multivariate Distributions.Distribution to a BAT.AbstractDensity that can be used as a prior (or likelihood).

The prior also implies the shapes of the parameters:

using ValueShapes

parshapes = valshape(prior)
ValueShapes.NamedTupleShape{(:a, :mu, :sigma),Tuple{ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}},ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}},ValueShapes.ValueAccessor{ValueShapes.ScalarShape{Real}}}}((a = ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}}(ValueShapes.ArrayShape{Real,1}((2,)), 0, 2), mu = ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}}(ValueShapes.ArrayShape{Real,1}((2,)), 2, 2), sigma = ValueShapes.ValueAccessor{ValueShapes.ScalarShape{Real}}(ValueShapes.ScalarShape{Real}(), 4, 1)), 5)

These will come in handy later on, e.g. to access (the posterior distribution of) individual parameter values.

Bayesian Model Definition

Given the likelihood and prior definition, a BAT.PosteriorDensity is simply defined via

posterior = PosteriorDensity(log_likelihood, prior)

Parameter Space Exploration via MCMC

We can now use Markov chain Monte Carlo (MCMC) to explore the space of possible parameter values for the histogram fit.

To increase the verbosity level of BAT logging output, you may want to set the Julia logging level for BAT to debug via ENV["JULIA_DEBUG"] = "BAT".

Let's use 4 MCMC chains and require 10^5 unique samples from each chain (after tuning/burn-in):

nsamples = 10^4
nchains = 4

Now we can generate a set of MCMC samples via bat_sample:

samples, stats = bat_sample(posterior, (nsamples, nchains), MetropolisHastings())
[ Info: Trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 2 tuned, 4 converged.
[ Info: MCMC Tuning cycle 9 finished, 4 chains, 4 tuned, 0 converged.
[ Info: MCMC Tuning cycle 10 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 10 cycle(s).

Let's calculate some posterior statistics using the function bat_stats and print the results:

println("Truth: $true_par_values")
println("Mode: $(stats.mode)")
println("Mean: $(stats.mean)")
println("Covariance: $(stats.cov)")
Truth: (a = [500, 1000], mu = (-1.0, 2.0), sigma = 0.5)
Mode: [497.0033008718692, 994.9439541242684, -0.9729832967289617, 1.9910596819357194, 0.5011324043786379]
Mean: [499.61178944258944, 1001.0639273713308, -0.9724145751698856, 1.9919996024736981, 0.5006856971094287]
Covariance: [519.7073181996485 -8.764628270490316 -0.05310194632972877 -0.0022147984018210665 0.012118266466568751; -8.76462827049033 939.722888184015 -0.0004791285056013602 -0.0007097942852209447 -0.0006982890818712869; -0.05310194632972876 -0.0004791285056013712 0.0006083468017295504 1.851571229041153e-5 -2.8544261954095032e-5; -0.0022147984018210587 -0.0007097942852209576 1.8515712290411537e-5 0.0002606024001735707 -2.928833037171619e-6; 0.01211826646656874 -0.0006982890818712793 -2.8544261954095005e-5 -2.92883303717161e-6 9.761424623436907e-5]

We can also, e.g., get the Pearson auto-correlation of the parameters:

cor(samples.params, FrequencyWeights(samples.weight))
5×5 Array{Float64,2}:
  1.0         -0.0125416    -0.0944399    -0.00601819   0.0538028 
 -0.0125416    1.0          -0.000633689  -0.00143431  -0.00230557
 -0.0944399   -0.000633689   1.0           0.0465024   -0.117135  
 -0.00601819  -0.00143431    0.0465024     1.0         -0.0183632 
  0.0538028   -0.00230557   -0.117135     -0.0183632    1.0       

Visualization of Results

BAT.jl comes with an extensive set of plotting recipes for "Plots.jl". We can plot the marginalized distribution for a single parameter (e.g. parameter 3, i.e. μ₁):

par_names = ["a_1", "a_2", "mu_1", "mu_2", "sigma"]
plot(
    samples, 3,
    mean = true, std_dev = true, globalmode = true, localmode = true,
    nbins = 50, xlabel = par_names[3], ylabel = "P($(par_names[3]))",
    title = "Marginalized Distribution for mu_1"
)
savefig("tutorial-single-par.pdf")

Marginalized Distribution for mu_1

or plot the marginalized distribution for a pair of parameters (e.g. parameters 3 and 5, i.e. μ₁ and σ), including information from the parameter stats:

plot(
    samples, (3, 5),
    mean = true, std_dev = true, globalmode = true, localmode = true,
    nbins = 50, xlabel = par_names[3], ylabel = par_names[5],
    title = "Marginalized Distribution for mu_1 and sigma"
)
plot!(MCMCBasicStats(samples), (3, 5))
savefig("tutorial-param-pair.pdf")

Marginalized Distribution for mu_1 and sigma

We can also create an overview plot of the marginalized distribution for all pairs of parameters:

plot(
    samples,
    mean = false, std_dev = false, globalmode = true, localmode = false,
    nbins = 50
)
savefig("tutorial-all-params.pdf")

Pairwise Correlation between Parameters

Integration with Tables.jl

BAT.jl supports the Tables.jl interface. So we can also convert the vector of MCMC samples vecto a table, e.g. using TypedTables.jl](http://blog.roames.com/TypedTables.jl/stable/):

using TypedTables

tbl = Table(samples)
Table with 5 columns and 40000 rows:
      params                log_posterior  log_prior  weight  ⋯
    ┌──────────────────────────────────────────────────────────
 1  │ [512.278, 1019.35, …  -170.28        -18.8891   2       ⋯
 2  │ [514.811, 1012.33, …  -169.974       -18.8891   2       ⋯
 3  │ [492.27, 1003.27, -…  -171.441       -18.8893   4       ⋯
 4  │ [486.36, 991.03, -1…  -172.216       -18.8894   7       ⋯
 5  │ [472.63, 1027.36, -…  -172.013       -18.8891   4       ⋯
 6  │ [473.823, 1022.13, …  -172.162       -18.889    2       ⋯
 7  │ [474.37, 1014.7, -0…  -170.51        -18.8891   7       ⋯
 8  │ [486.645, 985.702, …  -169.692       -18.8892   2       ⋯
 9  │ [472.88, 976.48, -0…  -171.144       -18.8887   5       ⋯
 10 │ [474.681, 979.198, …  -170.468       -18.8888   1       ⋯
 11 │ [462.692, 1002.06, …  -170.771       -18.8888   2       ⋯
 12 │ [447.036, 1001.08, …  -172.218       -18.8888   2       ⋯
 13 │ [452.303, 979.876, …  -171.828       -18.8888   1       ⋯
 14 │ [471.593, 1039.02, …  -170.878       -18.8889   3       ⋯
 15 │ [478.782, 1033.2, -…  -171.557       -18.8889   8       ⋯
 16 │ [481.404, 1017.95, …  -172.558       -18.8887   2       ⋯
 17 │ [504.774, 1047.43, …  -172.804       -18.8891   4       ⋯
 ⋮  │          ⋮                  ⋮            ⋮        ⋮     ⋱

Using the parameter shapes, we can generate a table with named parameters, instead of flat real-valued parameter vectors:

tbl_named = parshapes.(samples)
Table with 5 columns and 40000 rows:
      params                log_posterior  log_prior  weight  ⋯
    ┌──────────────────────────────────────────────────────────
 1  │ (a = [512.278, 1019…  -170.28        -18.8891   2       ⋯
 2  │ (a = [514.811, 1012…  -169.974       -18.8891   2       ⋯
 3  │ (a = [492.27, 1003.…  -171.441       -18.8893   4       ⋯
 4  │ (a = [486.36, 991.0…  -172.216       -18.8894   7       ⋯
 5  │ (a = [472.63, 1027.…  -172.013       -18.8891   4       ⋯
 6  │ (a = [473.823, 1022…  -172.162       -18.889    2       ⋯
 7  │ (a = [474.37, 1014.…  -170.51        -18.8891   7       ⋯
 8  │ (a = [486.645, 985.…  -169.692       -18.8892   2       ⋯
 9  │ (a = [472.88, 976.4…  -171.144       -18.8887   5       ⋯
 10 │ (a = [474.681, 979.…  -170.468       -18.8888   1       ⋯
 11 │ (a = [462.692, 1002…  -170.771       -18.8888   2       ⋯
 12 │ (a = [447.036, 1001…  -172.218       -18.8888   2       ⋯
 13 │ (a = [452.303, 979.…  -171.828       -18.8888   1       ⋯
 14 │ (a = [471.593, 1039…  -170.878       -18.8889   3       ⋯
 15 │ (a = [478.782, 1033…  -171.557       -18.8889   8       ⋯
 16 │ (a = [481.404, 1017…  -172.558       -18.8887   2       ⋯
 17 │ (a = [504.774, 1047…  -172.804       -18.8891   4       ⋯
 ⋮  │          ⋮                  ⋮            ⋮        ⋮     ⋱

We can now, e.g., find the sample with the maximum posterior value (i.e. the mode):

mode_log_posterior, mode_idx = findmax(tbl_named.log_posterior)
(-168.96941803101575, 9664)

And get row mode_idx of the table, with all information about the sample at the mode:

Comparison of Truth and Best Fit

As a final step, we retrieve the parameter values at the mode, representing the best-fit parameters

fit_par_values = tbl_named[mode_idx].params
(a = [497.0033008718692, 994.9439541242684], mu = [-0.9729832967289617, 1.9910596819357194], sigma = 0.5011324043786379)

And plot the truth, data, and best fit:

plot(
    normalize(hist, mode=:density),
    st = :steps, label = "Data",
    title = "Data, True Model and Best Fit"
)
plot!(-4:0.01:4, x -> fit_function(true_par_values, x), label = "Truth")
plot!(-4:0.01:4, x -> fit_function(fit_par_values, x), label = "Best fit")
savefig("tutorial-data-truth-bestfit.pdf")

Data, True Model and Best Fit

Fine-grained control

BAT provides fine-grained control over the MCMC algorithm options, the MCMC chain initialization, tuning/burn-in strategy and convergence testing. All option value used in the following are the default values, any or all may be omitted.

We'll sample using the The Metropolis-Hastings MCMC algorithm. By default, BAT uses a multivariate t-distribution (ν = 1) as the proposal function:

algorithm = MetropolisHastings(MvTDistProposal(1.0))

BAT requires a counter-based random number generator (RNG), since it partitions the RNG space over the MCMC chains. This way, a single RNG seed is sufficient for all chains and results are reproducible even under parallel execution. By default, BAT uses a Philox4x RNG initialized with a random seed drawn from the system entropy pool:

using Random123
rng = Philox4x()

Other default parameters are:

tuning = AdaptiveMetropolisTuning(
    λ = 0.5,
    α = 0.15..0.35,
    β = 1.5,
    c = 1e-4..1e2
)

convergence = BrooksGelmanConvergence(
    threshold = 1.1,
    corrected = false
)

init = MCMCInitStrategy(
    ninit_tries_per_chain = 8..128,
    max_nsamples_pretune = 25,
    max_nsteps_pretune = 250,
    max_time_pretune = Inf
)

burnin = MCMCBurninStrategy(
    max_nsamples_per_cycle = 1000,
    max_nsteps_per_cycle = 10000,
    max_time_per_cycle = Inf,
    max_ncycles = 30
)

To generate MCMC samples with explicit control over all options, use

samples, stats = bat_sample(
    rng, posterior, (nsamples, nchains), algorithm,
    max_nsteps = 10 * nsamples,
    max_time = Inf,
    tuning = tuning,
    init = init,
    burnin = burnin,
    convergence = convergence,
    strict = false,
    filter = true
)
[ Info: Trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 8 cycle(s).

However, in many use cases, simply using the default options via

samples, stats = bat_sample(posterior, (nsamples, nchains), MetropolisHastings())

will often be sufficient.

This page was generated using Literate.jl.