Tutorial

This tutorial demonstrates a simple application of BAT.jl: A Bayesian fit of a histogram with two Gaussian peaks.

You can also download this tutorial as a Jupyter notebook and a plain Julia source file.

Table of contents:

Note: This tutorial is somewhat verbose, as it aims to be easy to follow for users who are new to Julia. For the same reason, we deliberately avoid making use of Julia features like closures, anonymous functions, broadcasting syntax, performance annotations, etc.

Input Data Generation

First, let's generate some synthetic data to fit. We'll need the Julia standard-library packages "Random", "LinearAlgebra" and "Statistics", as well as the packages "Distributions" and "StatsBase":

using Random, LinearAlgebra, Statistics, Distributions, StatsBase

As the underlying truth of our input data/histogram, let us choose the expected count to follow the sum of two Gaussian peaks with peak areas of 500 and 1000, a mean of -1.0 and 2.0 and a standard error of 0.5. Then

data = vcat(
    rand(Normal(-1.0, 0.5), 500),
    rand(Normal( 2.0, 0.5), 1000)
)
1500-element Vector{Float64}:
 -1.5689960594543395
 -1.455124138665964
 -1.8170936102020376
 -1.4413237007320125
 -2.2084024736765784
 -0.7046783390954405
 -1.5057031810132524
 -0.8796262212327188
 -1.4430223465175618
 -0.8179963861383126
  ⋮
  2.4300188826969635
  2.49645322283789
  1.4396217769023347
  2.878638853756991
  2.5317572380540248
  0.9825789980042452
  2.186136852435033
  2.1072135399888463
  2.475497762510633

resulting in a vector of floating-point numbers:

typeof(data) == Vector{Float64}
true

Next, we'll create a histogram of that data, this histogram will serve as the input for the Bayesian fit:

hist = append!(Histogram(-2:0.1:4), data)
StatsBase.Histogram{Int64, 1, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}
edges:
  -2.0:0.1:4.0
weights: [3, 13, 11, 24, 34, 29, 25, 32, 35, 46  …  9, 4, 3, 3, 1, 1, 0, 0, 0, 0]
closed: left
isdensity: false

Using the Julia "Plots" package

using Plots

we can plot the histogram:

plot(
    normalize(hist, mode=:density),
    st = :steps, label = "Data",
    title = "Data"
)
savefig("tutorial-data.pdf")

Data

Let's define our fit function - the function that we expect to describe the data histogram, at each x-Axis position x, depending on a given set p of model parameters:

function fit_function(p::NamedTuple{(:a, :mu, :sigma)}, x::Real)
    p.a[1] * pdf(Normal(p.mu[1], p.sigma), x) +
    p.a[2] * pdf(Normal(p.mu[2], p.sigma), x)
end

The fit parameters (model parameters) a (peak areas) and mu (peak means) are vectors, parameter sigma (peak width) is a scalar, we assume it's the same for both Gaussian peaks.

The true values for the model/fit parameters are the values we used to generate the data:

true_par_values = (a = [500, 1000], mu = [-1.0, 2.0], sigma = 0.5)

Let's visually compare the histogram and the fit function, using these true parameter values, to make sure everything is set up correctly:

plot(
    normalize(hist, mode=:density),
    st = :steps, label = "Data",
    title = "Data and True Statistical Model"
)
plot!(
    -4:0.01:4, x -> fit_function(true_par_values, x),
    label = "Truth"
)
savefig("tutorial-data-and-truth.pdf")

Data and True Statistical Model

Bayesian Fit

Now we'll perform a Bayesian fit of the generated histogram, using BAT, to infer the model parameters from the data histogram.

In addition to the Julia packages loaded above, we need BAT itself, as well as IntervalSets:

using BAT, DensityInterface, IntervalSets

Likelihood Definition

First, we need to define the likelihood for our problem.

BAT expects likelihoods to implements the DensityInterface API. We can simply wrap a log-likelihood function with DensityInterface.logfuncdensity to make it compatible.

For performance reasons, functions should not access global variables directly. So we'll use an anonymous function inside of a let-statement to capture the value of the global variable hist in a local variable h (and to shorten function name fit_function to f, purely for convenience). DensityInterface.logfuncdensity then turns the log-likelihood function into a DensityInterface density object.

likelihood = let h = hist, f = fit_function
    # Histogram counts for each bin as an array:
    observed_counts = h.weights

    # Histogram binning:
    bin_edges = h.edges[1]
    bin_edges_left = bin_edges[1:end-1]
    bin_edges_right = bin_edges[2:end]
    bin_widths = bin_edges_right - bin_edges_left
    bin_centers = (bin_edges_right + bin_edges_left) / 2

    logfuncdensity(function (params)
        # Log-likelihood for a single bin:
        function bin_log_likelihood(i)
            # Simple mid-point rule integration of fit function `f` over bin:
            expected_counts = bin_widths[i] * f(params, bin_centers[i])
            # Avoid zero expected counts for numerical stability:
            logpdf(Poisson(expected_counts + eps(expected_counts)), observed_counts[i])
        end

        # Sum log-likelihood over bins:
        idxs = eachindex(observed_counts)
        ll_value = bin_log_likelihood(idxs[1])
        for i in idxs[2:end]
            ll_value += bin_log_likelihood(i)
        end

        return ll_value
    end)
end
LogFuncDensity(Main.var"#3#4"{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, Vector{Int64}, typeof(Main.fit_function)}(-1.95:0.1:3.95, StepRangeLen(0.1, 0.0, 60), [3, 13, 11, 24, 34, 29, 25, 32, 35, 46  …  9, 4, 3, 3, 1, 1, 0, 0, 0, 0], Main.fit_function))

BAT makes use of Julia's parallel programming facilities if possible, e.g. to run multiple Markov chains in parallel. Therefore, log-likelihood (and other) code must be thread-safe. Mark non-thread-safe code with @critical (provided by Julia package ParallelProcessingTools).

Support for automatic parallelization across multiple (local and remote) Julia processes is planned, but not implemented yet.

Note that Julia currently starts only a single thread by default. Set the the environment variable JULIA_NUM_THREADS to specify the desired number of Julia threads.

We can evaluate likelihood, e.g. at the true parameter values:

logdensityof(likelihood, true_par_values)
-156.30498160595735

Prior Definition

Next, we need to choose a sensible prior for the fit:

prior = distprod(
    a = [Weibull(1.1, 5000), Weibull(1.1, 5000)],
    mu = [-2.0..0.0, 1.0..3.0],
    sigma = Weibull(1.2, 2)
)

BAT supports most Distributions.Distribution types, and combinations of them, as priors.

Bayesian Model Definition

Given the likelihood and prior definition, a BAT.PosteriorMeasure is simply defined via

posterior = PosteriorMeasure(likelihood, prior)

Parameter Space Exploration via MCMC

We can now use Markov chain Monte Carlo (MCMC) to explore the space of possible parameter values for the histogram fit.

To increase the verbosity level of BAT logging output, you may want to set the Julia logging level for BAT to debug via ENV["JULIA_DEBUG"] = "BAT".

Now we can generate a set of MCMC samples via bat_sample. We'll use 4 MCMC chains with 10^5 MC steps in each chain (after tuning/burn-in):

samples = bat_sample(posterior, MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^5, nchains = 4)).result
[ Info: Setting new default BAT context BATContext{Float64}(Random123.Philox4x{UInt64, 10}(0xb8be6d4ca0ff546a, 0x490bd0999c16a1dc, 0x28626d81bfc0fb8a, 0x1d51f28f66704e1f, 0xc91c9b11862880d0, 0x13d9179a5bc592d1, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0), HeterogeneousComputing.CPUnit(), BAT._NoADSelected())
[ Info: MCMCChainPoolInit: trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 0 tuned, 4 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 0 tuned, 4 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 9 finished, 4 chains, 2 tuned, 4 converged.
[ Info: MCMC Tuning cycle 10 finished, 4 chains, 3 tuned, 4 converged.
[ Info: MCMC Tuning cycle 11 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 11 cycle(s).
[ Info: Running post-tuning stabilization steps for 4 MCMC chain(s).

Let's calculate some statistics on the posterior samples:

println("Truth: $true_par_values")
println("Mode: $(mode(samples))")
println("Mean: $(mean(samples))")
println("Stddev: $(std(samples))")
Truth: (a = [500, 1000], mu = [-1.0, 2.0], sigma = 0.5)
Mode: (a = [508.1437600117665, 1000.4543344985999], mu = [-1.0360951351800272, 1.9842576309107223], sigma = 0.508178264111625)
Mean: (a = [509.3665686119413, 1002.4492974571698], mu = [-1.0352770044691757, 1.983766054391005], sigma = 0.5092521316104394)
Stddev: (a = [23.328405699795475, 32.45609254853062], mu = [0.02591628747582982, 0.016099382634551372], sigma = 0.010147142587731497)

Internally, BAT often needs to represent variates as flat real-valued vectors:

unshaped_samples, f_flatten = bat_transform(Vector, samples)
(result = DensitySampleVector(length = 117419, varshape = ValueShapes.ArrayShape{Float64, 1}((5,))), trafo = Base.Fix2{typeof(ValueShapes.unshaped), ValueShapes.NamedTupleShape{(:a, :mu, :sigma), Tuple{ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real, 1}}, ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real, 1}}, ValueShapes.ValueAccessor{ValueShapes.ScalarShape{Real}}}, NamedTuple}}(ValueShapes.unshaped, NamedTupleShape((a = ValueShapes.ArrayShape{Real, 1}((2,)), mu = ValueShapes.ArrayShape{Real, 1}((2,)), sigma = ValueShapes.ScalarShape{Real}()))), optargs = (algorithm = BAT.UnshapeTransformation(), context = BATContext{Float64}(Random123.Philox4x{UInt64, 10}(0x487f0fb1bfbbe09e, 0x38610b9ec047cffd, 0xc63e4665b673955c, 0x4ab58e42e6bc9b0f, 0xc91c9b11862880d0, 0x13d9179a5bc592d1, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x8000020100000000, 0), HeterogeneousComputing.CPUnit(), BAT._NoADSelected())))

The statisics above (mode, mean and std-dev) are presented in shaped form. However, it's not possible to represent statistics with matrix shape, e.g. the parameter covariance matrix, this way. So the covariance has to be accessed in unshaped form:

par_cov = cov(unshaped_samples)
println("Covariance: $par_cov")
Covariance: [544.2145124942574 6.1923411106082895 -0.06215096856124491 -0.001967743655622638 0.015562896005884501; 6.1923411106082895 1053.3979435188048 -0.018309861357045317 -0.005282915094924627 0.00911048313352157; -0.06215096856124491 -0.018309861357045317 0.0006716539565298537 1.5813057301773816e-5 -4.251671494406835e-5; -0.001967743655622638 -0.005282915094924627 1.5813057301773816e-5 0.000259190121213694 -1.5871844629512957e-6; 0.015562896005884501 0.00911048313352157 -4.251671494406835e-5 -1.5871844629512957e-6 0.00010296450269575466]

Use bat_report to generate an overview of the sampling result and parameter estimates (based on the marginal distributions):

bat_report(samples)

Sampling result

  • Total number of samples: 117419

  • Total weight of samples: 399994

  • Effective sample size: between 1362 and 8888

Marginals

ParameterMeanStd. dev.Gobal modeMarg. modeCred. intervalHistogram
a[1]509.36723.3284508.144510.0484.81 .. 531.667⠀⠀⠀⠀⠀420[⠀⠀⠀⠀⠀⠀⠀⠀▁▁▂▃▄▅▆▇█████▇▆▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[613⠀⠀⠀⠀⠀
a[2]1002.4532.45611000.451010.0968.318 .. 1033.04⠀⠀⠀⠀⠀880[⠀⠀⠀⠀⠀⠀⠀⠀▁▁▂▃▃▄▆▇▇████▇▆▅▄▃▂▁▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[1.14e+03
mu[1]-1.035280.0259163-1.0361-1.03-1.06272 .. -1.0107⠀⠀⠀-1.14[⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀▁▂▃▃▅▅▇▇████▇▆▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[-0.924⠀⠀
mu[2]1.983770.01609941.984261.9851.9684 .. 2.00068⠀⠀⠀⠀1.92[⠀⠀⠀⠀⠀⠀⠀⠀⠀▁▂▂▃▄▅▆▇█████▇▆▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[2.05⠀⠀⠀⠀
sigma0.5092520.01014710.5081780.50750.498555 .. 0.518831⠀⠀⠀0.465[⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀▁▂▂▃▅▆▇█████▇▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[0.555⠀⠀⠀

Visualization of Results

BAT.jl comes with an extensive set of plotting recipes for "Plots.jl". We can plot the marginalized distribution for a single parameter (e.g. parameter 3, i.e. μ[1]):

plot(
    samples, :(mu[1]),
    mean = true, std = true, globalmode = true, marginalmode = true,
    nbins = 50, title = "Marginalized Distribution for mu[1]"
)
savefig("tutorial-single-par.pdf")

Marginalized Distribution for mu_1

or plot the marginalized distribution for a pair of parameters (e.g. parameters 3 and 5, i.e. μ[1] and σ), including information from the parameter stats:

plot(
    samples, (:(mu[1]), :sigma),
    mean = true, std = true, globalmode = true, marginalmode = true,
    nbins = 50, title = "Marginalized Distribution for mu[1] and sigma"
)
plot!(BAT.MCMCBasicStats(samples), (3, 5))
savefig("tutorial-param-pair.png")

Marginalized Distribution for mu_1 and sigma

We can also create an overview plot of the marginalized distribution for all pairs of parameters:

plot(
    samples,
    mean = false, std = false, globalmode = true, marginalmode = false,
    nbins = 50
)
savefig("tutorial-all-params.png")

Pairwise Correlation between Parameters

Integration with Tables.jl

DensitySamplesVector supports the Tables.jl interface, so it is a table itself. We can also convert it to other table types, e.g. a TypedTables.Table:

using TypedTables

tbl = Table(samples)
Table with 5 columns and 117419 rows:
      v                       logd      weight  info                    aux
    ┌──────────────────────────────────────────────────────────────────────────
 1  │ (a = [548.033, 1005.0…  -176.678  1       MCMCSampleID(1, 14, 0…  nothing
 2  │ (a = [546.271, 1003.8…  -176.467  9       MCMCSampleID(1, 14, 1…  nothing
 3  │ (a = [542.193, 1001.8…  -176.019  16      MCMCSampleID(1, 14, 1…  nothing
 4  │ (a = [551.712, 1005.3…  -176.389  6       MCMCSampleID(1, 14, 2…  nothing
 5  │ (a = [556.575, 1007.8…  -177.293  1       MCMCSampleID(1, 14, 3…  nothing
 6  │ (a = [550.753, 1008.4…  -178.085  4       MCMCSampleID(1, 14, 3…  nothing
 7  │ (a = [555.371, 1012.1…  -178.378  3       MCMCSampleID(1, 14, 3…  nothing
 8  │ (a = [553.784, 1014.2…  -177.394  1       MCMCSampleID(1, 14, 4…  nothing
 9  │ (a = [553.49, 1007.3]…  -177.331  19      MCMCSampleID(1, 14, 4…  nothing
 10 │ (a = [553.802, 1003.5…  -178.748  4       MCMCSampleID(1, 14, 6…  nothing
 11 │ (a = [546.972, 1007.2…  -177.445  1       MCMCSampleID(1, 14, 6…  nothing
 12 │ (a = [562.842, 984.15…  -178.241  1       MCMCSampleID(1, 14, 6…  nothing
 13 │ (a = [560.343, 984.65…  -180.905  5       MCMCSampleID(1, 14, 6…  nothing
 14 │ (a = [550.945, 997.18…  -180.354  3       MCMCSampleID(1, 14, 7…  nothing
 15 │ (a = [547.339, 994.24…  -180.693  1       MCMCSampleID(1, 14, 7…  nothing
 16 │ (a = [551.799, 991.47…  -178.668  1       MCMCSampleID(1, 14, 7…  nothing
 17 │ (a = [549.75, 986.093…  -180.044  7       MCMCSampleID(1, 14, 7…  nothing
 ⋮  │           ⋮                ⋮        ⋮               ⋮                ⋮

or a DataFrames.DataFrame, etc.

Comparison of Truth and Best Fit

As a final step, we retrieve the parameter values at the mode, representing the best-fit parameters

samples_mode = mode(samples)
(a = [508.1437600117665, 1000.4543344985999], mu = [-1.0360951351800272, 1.9842576309107223], sigma = 0.508178264111625)

Like the samples themselves, the result can be viewed in both shaped and unshaped form. samples_mode is presented as a 0-dimensional array that contains a NamedTuple, this representation preserves the shape information:

samples_mode isa NamedTuple
true

samples_mode is only an estimate of the mode of the posterior distribution. It can be further refined using bat_findmode:

using Optim

findmode_result = bat_findmode(
    posterior,
    OptimAlg(optalg = Optim.NelderMead(), init = ExplicitInit([samples_mode]))
)

fit_par_values = findmode_result.result
(a = [507.51195919612974, 999.9344743442189], mu = [-1.034452167714924, 1.98417519259745], sigma = 0.5083294545477082)

Let's plot the data and fit function given the true parameters and MCMC samples

plot(-4:0.01:4, fit_function, samples)

plot!(
    normalize(hist, mode=:density),
    color=1, linewidth=2, fillalpha=0.0,
    st = :steps, fill=false, label = "Data",
    title = "Data, True Model and Best Fit"
)

plot!(-4:0.01:4, x -> fit_function(true_par_values, x), color=4, label = "Truth")
savefig("tutorial-data-truth-bestfit.pdf")

Data, True Model and Best Fit

Fine-grained control

BAT provides fine-grained control over the MCMC algorithm options, the MCMC chain initialization, tuning/burn-in strategy and convergence testing. All option value used in the following are the default values, any or all may be omitted.

We'll sample using the The Metropolis-Hastings MCMC algorithm:

mcmcalgo = MetropolisHastings(
    weighting = RepetitionWeighting(),
    tuning = AdaptiveMHTuning()
)
MetropolisHastings{Distributions.TDist{Float64}, RepetitionWeighting{Int64}, AdaptiveMHTuning}
  proposal: Distributions.TDist{Float64}
  weighting: RepetitionWeighting{Int64} RepetitionWeighting{Int64}()
  tuning: AdaptiveMHTuning

BAT requires a counter-based random number generator (RNG), since it partitions the RNG space over the MCMC chains. This way, a single RNG seed is sufficient for all chains and results are reproducible even under parallel execution. By default, BAT uses a Philox4x RNG initialized with a random seed drawn from the system entropy pool:

using Random123
rng = Philox4x()
context = BATContext(rng = Philox4x())

By default, MetropolisHastings() uses the following options.

For Markov chain initialization:

init = MCMCChainPoolInit()
MCMCChainPoolInit
  init_tries_per_chain: IntervalSets.ClosedInterval{Int64}
  nsteps_init: Int64 1000
  initval_alg: InitFromTarget InitFromTarget()

For the MCMC burn-in procedure:

burnin = MCMCMultiCycleBurnin()
MCMCMultiCycleBurnin
  nsteps_per_cycle: Int64 10000
  max_ncycles: Int64 30
  nsteps_final: Int64 1000

For convergence testing:

convergence = BrooksGelmanConvergence()
BrooksGelmanConvergence
  threshold: Float64 1.1
  corrected: Bool false

To generate MCMC samples with explicit control over all options, use something like

samples = bat_sample(
    posterior,
    MCMCSampling(
        mcalg = mcmcalgo,
        nchains = 4,
        nsteps = 10^5,
        init = init,
        burnin = burnin,
        convergence = convergence,
        strict = true,
        store_burnin = false,
        nonzero_weights = true,
        callback = (x...) -> nothing
    ),
    context
).result
[ Info: MCMCChainPoolInit: trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 2 tuned, 4 converged.
[ Info: MCMC Tuning cycle 9 finished, 4 chains, 3 tuned, 0 converged.
[ Info: MCMC Tuning cycle 10 finished, 4 chains, 3 tuned, 0 converged.
[ Info: MCMC Tuning cycle 11 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 11 cycle(s).
[ Info: Running post-tuning stabilization steps for 4 MCMC chain(s).

Saving result data to files

The package FileIO.jl(in conjunction with JLD2.jl) offers a convenient way to store results like posterior samples to file:

using FileIO
import JLD2
FileIO.save("results.jld2", Dict("samples" => samples))

JLD2 persists the full information (including value shapes), so you can reload exactly the same data into memory in a new Julia session via

using FileIO
import JLD2
samples = FileIO.load("results.jld2", "samples")

provided you use compatible versions of BAT and it's dependencies. Note that JLD2 is not a long-term stable file format. Also note that this functionality is provided by FileIO.jl and JLD2.jl and not part of the BAT API itself.

BAT.jl itself can write samples to standard HDF5 files in a form suitable for long-term storage (via HDF5.jl):

import HDF5
bat_write("results.h5", samples)

The resulting files have an intuitive HDF5 layout and can be read with the standard HDF5 libraries, so they are easily accessible from other programming languages as well. Not all value shape information can be preserved, though. To read BAT.jl HDF5 sample data, use

using BAT
import HDF5
samples = bat_read("results.h5").result

BAT.jl's HDF5 file format may evolve over time, but future versions of BAT.jl will be able to read HDF5 sample data written by this version of BAT.jl.


This page was generated using Literate.jl.