Tutorial
This tutorial demonstrates a simple application of BAT.jl: A Bayesian fit of a histogram with two Gaussian peaks.
You can also download this tutorial as a Jupyter notebook and a plain Julia source file.
Table of contents:
Note: This tutorial is somewhat verbose, as it aims to be easy to follow for users who are new to Julia. For the same reason, we deliberately avoid making use of Julia features like closures, anonymous functions, broadcasting syntax, performance annotations, etc.
Input Data Generation
First, let's generate some synthetic data to fit. We'll need the Julia standard-library packages "Random", "LinearAlgebra" and "Statistics", as well as the packages "Distributions" and "StatsBase":
using Random, LinearAlgebra, Statistics, Distributions, StatsBase
As the underlying truth of our input data/histogram, let us choose the expected count to follow the sum of two Gaussian peaks with peak areas of 500 and 1000, a mean of -1.0 and 2.0 and a standard error of 0.5. Then
data = vcat(
rand(Normal(-1.0, 0.5), 500),
rand(Normal( 2.0, 0.5), 1000)
)
1500-element Vector{Float64}:
-1.5689960594543395
-1.455124138665964
-1.8170936102020376
-1.4413237007320125
-2.2084024736765784
-0.7046783390954405
-1.5057031810132524
-0.8796262212327188
-1.4430223465175618
-0.8179963861383126
⋮
2.4300188826969635
2.49645322283789
1.4396217769023347
2.878638853756991
2.5317572380540248
0.9825789980042452
2.186136852435033
2.1072135399888463
2.475497762510633
resulting in a vector of floating-point numbers:
typeof(data) == Vector{Float64}
true
Next, we'll create a histogram of that data, this histogram will serve as the input for the Bayesian fit:
hist = append!(Histogram(-2:0.1:4), data)
StatsBase.Histogram{Int64, 1, Tuple{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}}
edges:
-2.0:0.1:4.0
weights: [3, 13, 11, 24, 34, 29, 25, 32, 35, 46 … 9, 4, 3, 3, 1, 1, 0, 0, 0, 0]
closed: left
isdensity: false
Using the Julia "Plots" package
using Plots
we can plot the histogram:
plot(
normalize(hist, mode=:density),
st = :steps, label = "Data",
title = "Data"
)
savefig("tutorial-data.pdf")
Let's define our fit function - the function that we expect to describe the data histogram, at each x-Axis position x
, depending on a given set p
of model parameters:
function fit_function(p::NamedTuple{(:a, :mu, :sigma)}, x::Real)
p.a[1] * pdf(Normal(p.mu[1], p.sigma), x) +
p.a[2] * pdf(Normal(p.mu[2], p.sigma), x)
end
The fit parameters (model parameters) a
(peak areas) and mu
(peak means) are vectors, parameter sigma
(peak width) is a scalar, we assume it's the same for both Gaussian peaks.
The true values for the model/fit parameters are the values we used to generate the data:
true_par_values = (a = [500, 1000], mu = [-1.0, 2.0], sigma = 0.5)
Let's visually compare the histogram and the fit function, using these true parameter values, to make sure everything is set up correctly:
plot(
normalize(hist, mode=:density),
st = :steps, label = "Data",
title = "Data and True Statistical Model"
)
plot!(
-4:0.01:4, x -> fit_function(true_par_values, x),
label = "Truth"
)
savefig("tutorial-data-and-truth.pdf")
Bayesian Fit
Now we'll perform a Bayesian fit of the generated histogram, using BAT, to infer the model parameters from the data histogram.
In addition to the Julia packages loaded above, we need BAT itself, as well as IntervalSets:
using BAT, DensityInterface, IntervalSets
Likelihood Definition
First, we need to define the likelihood for our problem.
BAT expects likelihoods to implements the DensityInterface
API. We can simply wrap a log-likelihood function with DensityInterface.logfuncdensity
to make it compatible.
For performance reasons, functions should not access global variables directly. So we'll use an anonymous function inside of a let-statement to capture the value of the global variable hist
in a local variable h
(and to shorten function name fit_function
to f
, purely for convenience). DensityInterface.logfuncdensity
then turns the log-likelihood function into a DensityInterface
density object.
likelihood = let h = hist, f = fit_function
# Histogram counts for each bin as an array:
observed_counts = h.weights
# Histogram binning:
bin_edges = h.edges[1]
bin_edges_left = bin_edges[1:end-1]
bin_edges_right = bin_edges[2:end]
bin_widths = bin_edges_right - bin_edges_left
bin_centers = (bin_edges_right + bin_edges_left) / 2
logfuncdensity(function (params)
# Log-likelihood for a single bin:
function bin_log_likelihood(i)
# Simple mid-point rule integration of fit function `f` over bin:
expected_counts = bin_widths[i] * f(params, bin_centers[i])
# Avoid zero expected counts for numerical stability:
logpdf(Poisson(expected_counts + eps(expected_counts)), observed_counts[i])
end
# Sum log-likelihood over bins:
idxs = eachindex(observed_counts)
ll_value = bin_log_likelihood(idxs[1])
for i in idxs[2:end]
ll_value += bin_log_likelihood(i)
end
return ll_value
end)
end
LogFuncDensity(Main.var"#3#4"{StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, Vector{Int64}, typeof(Main.fit_function)}(-1.95:0.1:3.95, StepRangeLen(0.1, 0.0, 60), [3, 13, 11, 24, 34, 29, 25, 32, 35, 46 … 9, 4, 3, 3, 1, 1, 0, 0, 0, 0], Main.fit_function))
BAT makes use of Julia's parallel programming facilities if possible, e.g. to run multiple Markov chains in parallel. Therefore, log-likelihood (and other) code must be thread-safe. Mark non-thread-safe code with @critical
(provided by Julia package ParallelProcessingTools
).
Support for automatic parallelization across multiple (local and remote) Julia processes is planned, but not implemented yet.
Note that Julia currently starts only a single thread by default. Set the the environment variable JULIA_NUM_THREADS
to specify the desired number of Julia threads.
We can evaluate likelihood
, e.g. at the true parameter values:
logdensityof(likelihood, true_par_values)
-156.30498160595735
Prior Definition
Next, we need to choose a sensible prior for the fit:
prior = distprod(
a = [Weibull(1.1, 5000), Weibull(1.1, 5000)],
mu = [-2.0..0.0, 1.0..3.0],
sigma = Weibull(1.2, 2)
)
BAT supports most Distributions.Distribution
types, and combinations of them, as priors.
Bayesian Model Definition
Given the likelihood and prior definition, a BAT.PosteriorMeasure
is simply defined via
posterior = PosteriorMeasure(likelihood, prior)
Parameter Space Exploration via MCMC
We can now use Markov chain Monte Carlo (MCMC) to explore the space of possible parameter values for the histogram fit.
To increase the verbosity level of BAT logging output, you may want to set the Julia logging level for BAT to debug via ENV["JULIA_DEBUG"] = "BAT"
.
Now we can generate a set of MCMC samples via bat_sample
. We'll use 4 MCMC chains with 10^5 MC steps in each chain (after tuning/burn-in):
samples = bat_sample(posterior, MCMCSampling(mcalg = MetropolisHastings(), nsteps = 10^5, nchains = 4)).result
[ Info: Setting new default BAT context BATContext{Float64}(Random123.Philox4x{UInt64, 10}(0xb8be6d4ca0ff546a, 0x490bd0999c16a1dc, 0x28626d81bfc0fb8a, 0x1d51f28f66704e1f, 0xc91c9b11862880d0, 0x13d9179a5bc592d1, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0), HeterogeneousComputing.CPUnit(), BAT._NoADSelected())
[ Info: MCMCChainPoolInit: trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 0 tuned, 4 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 0 tuned, 4 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 9 finished, 4 chains, 2 tuned, 4 converged.
[ Info: MCMC Tuning cycle 10 finished, 4 chains, 3 tuned, 4 converged.
[ Info: MCMC Tuning cycle 11 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 11 cycle(s).
[ Info: Running post-tuning stabilization steps for 4 MCMC chain(s).
Let's calculate some statistics on the posterior samples:
println("Truth: $true_par_values")
println("Mode: $(mode(samples))")
println("Mean: $(mean(samples))")
println("Stddev: $(std(samples))")
Truth: (a = [500, 1000], mu = [-1.0, 2.0], sigma = 0.5)
Mode: (a = [508.1437600117665, 1000.4543344985999], mu = [-1.0360951351800272, 1.9842576309107223], sigma = 0.508178264111625)
Mean: (a = [509.3665686119413, 1002.4492974571698], mu = [-1.0352770044691757, 1.983766054391005], sigma = 0.5092521316104394)
Stddev: (a = [23.328405699795475, 32.45609254853062], mu = [0.02591628747582982, 0.016099382634551372], sigma = 0.010147142587731497)
Internally, BAT often needs to represent variates as flat real-valued vectors:
unshaped_samples, f_flatten = bat_transform(Vector, samples)
(result = DensitySampleVector(length = 117419, varshape = ValueShapes.ArrayShape{Float64, 1}((5,))), trafo = Base.Fix2{typeof(ValueShapes.unshaped), ValueShapes.NamedTupleShape{(:a, :mu, :sigma), Tuple{ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real, 1}}, ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real, 1}}, ValueShapes.ValueAccessor{ValueShapes.ScalarShape{Real}}}, NamedTuple}}(ValueShapes.unshaped, NamedTupleShape((a = ValueShapes.ArrayShape{Real, 1}((2,)), mu = ValueShapes.ArrayShape{Real, 1}((2,)), sigma = ValueShapes.ScalarShape{Real}()))), optargs = (algorithm = BAT.UnshapeTransformation(), context = BATContext{Float64}(Random123.Philox4x{UInt64, 10}(0x487f0fb1bfbbe09e, 0x38610b9ec047cffd, 0xc63e4665b673955c, 0x4ab58e42e6bc9b0f, 0xc91c9b11862880d0, 0x13d9179a5bc592d1, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, 0x8000020100000000, 0), HeterogeneousComputing.CPUnit(), BAT._NoADSelected())))
The statisics above (mode, mean and std-dev) are presented in shaped form. However, it's not possible to represent statistics with matrix shape, e.g. the parameter covariance matrix, this way. So the covariance has to be accessed in unshaped form:
par_cov = cov(unshaped_samples)
println("Covariance: $par_cov")
Covariance: [544.2145124942574 6.1923411106082895 -0.06215096856124491 -0.001967743655622638 0.015562896005884501; 6.1923411106082895 1053.3979435188048 -0.018309861357045317 -0.005282915094924627 0.00911048313352157; -0.06215096856124491 -0.018309861357045317 0.0006716539565298537 1.5813057301773816e-5 -4.251671494406835e-5; -0.001967743655622638 -0.005282915094924627 1.5813057301773816e-5 0.000259190121213694 -1.5871844629512957e-6; 0.015562896005884501 0.00911048313352157 -4.251671494406835e-5 -1.5871844629512957e-6 0.00010296450269575466]
Use bat_report
to generate an overview of the sampling result and parameter estimates (based on the marginal distributions):
bat_report(samples)
Sampling result
Total number of samples: 117419
Total weight of samples: 399994
Effective sample size: between 1362 and 8888
Marginals
Parameter | Mean | Std. dev. | Gobal mode | Marg. mode | Cred. interval | Histogram |
---|---|---|---|---|---|---|
a[1] | 509.367 | 23.3284 | 508.144 | 510.0 | 484.81 .. 531.667 | ⠀⠀⠀⠀⠀420[⠀⠀⠀⠀⠀⠀⠀⠀▁▁▂▃▄▅▆▇█████▇▆▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[613⠀⠀⠀⠀⠀ |
a[2] | 1002.45 | 32.4561 | 1000.45 | 1010.0 | 968.318 .. 1033.04 | ⠀⠀⠀⠀⠀880[⠀⠀⠀⠀⠀⠀⠀⠀▁▁▂▃▃▄▆▇▇████▇▆▅▄▃▂▁▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[1.14e+03 |
mu[1] | -1.03528 | 0.0259163 | -1.0361 | -1.03 | -1.06272 .. -1.0107 | ⠀⠀⠀-1.14[⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀▁▂▃▃▅▅▇▇████▇▆▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[-0.924⠀⠀ |
mu[2] | 1.98377 | 0.0160994 | 1.98426 | 1.985 | 1.9684 .. 2.00068 | ⠀⠀⠀⠀1.92[⠀⠀⠀⠀⠀⠀⠀⠀⠀▁▂▂▃▄▅▆▇█████▇▆▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[2.05⠀⠀⠀⠀ |
sigma | 0.509252 | 0.0101471 | 0.508178 | 0.5075 | 0.498555 .. 0.518831 | ⠀⠀⠀0.465[⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀▁▂▂▃▅▆▇█████▇▅▄▃▂▁▁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀[0.555⠀⠀⠀ |
Visualization of Results
BAT.jl comes with an extensive set of plotting recipes for "Plots.jl". We can plot the marginalized distribution for a single parameter (e.g. parameter 3, i.e. μ[1]):
plot(
samples, :(mu[1]),
mean = true, std = true, globalmode = true, marginalmode = true,
nbins = 50, title = "Marginalized Distribution for mu[1]"
)
savefig("tutorial-single-par.pdf")
or plot the marginalized distribution for a pair of parameters (e.g. parameters 3 and 5, i.e. μ[1] and σ), including information from the parameter stats:
plot(
samples, (:(mu[1]), :sigma),
mean = true, std = true, globalmode = true, marginalmode = true,
nbins = 50, title = "Marginalized Distribution for mu[1] and sigma"
)
plot!(BAT.MCMCBasicStats(samples), (3, 5))
savefig("tutorial-param-pair.png")
We can also create an overview plot of the marginalized distribution for all pairs of parameters:
plot(
samples,
mean = false, std = false, globalmode = true, marginalmode = false,
nbins = 50
)
savefig("tutorial-all-params.png")
Integration with Tables.jl
DensitySamplesVector
supports the Tables.jl interface, so it is a table itself. We can also convert it to other table types, e.g. a TypedTables.Table
:
using TypedTables
tbl = Table(samples)
Table with 5 columns and 117419 rows:
v logd weight info aux
┌──────────────────────────────────────────────────────────────────────────
1 │ (a = [548.033, 1005.0… -176.678 1 MCMCSampleID(1, 14, 0… nothing
2 │ (a = [546.271, 1003.8… -176.467 9 MCMCSampleID(1, 14, 1… nothing
3 │ (a = [542.193, 1001.8… -176.019 16 MCMCSampleID(1, 14, 1… nothing
4 │ (a = [551.712, 1005.3… -176.389 6 MCMCSampleID(1, 14, 2… nothing
5 │ (a = [556.575, 1007.8… -177.293 1 MCMCSampleID(1, 14, 3… nothing
6 │ (a = [550.753, 1008.4… -178.085 4 MCMCSampleID(1, 14, 3… nothing
7 │ (a = [555.371, 1012.1… -178.378 3 MCMCSampleID(1, 14, 3… nothing
8 │ (a = [553.784, 1014.2… -177.394 1 MCMCSampleID(1, 14, 4… nothing
9 │ (a = [553.49, 1007.3]… -177.331 19 MCMCSampleID(1, 14, 4… nothing
10 │ (a = [553.802, 1003.5… -178.748 4 MCMCSampleID(1, 14, 6… nothing
11 │ (a = [546.972, 1007.2… -177.445 1 MCMCSampleID(1, 14, 6… nothing
12 │ (a = [562.842, 984.15… -178.241 1 MCMCSampleID(1, 14, 6… nothing
13 │ (a = [560.343, 984.65… -180.905 5 MCMCSampleID(1, 14, 6… nothing
14 │ (a = [550.945, 997.18… -180.354 3 MCMCSampleID(1, 14, 7… nothing
15 │ (a = [547.339, 994.24… -180.693 1 MCMCSampleID(1, 14, 7… nothing
16 │ (a = [551.799, 991.47… -178.668 1 MCMCSampleID(1, 14, 7… nothing
17 │ (a = [549.75, 986.093… -180.044 7 MCMCSampleID(1, 14, 7… nothing
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
or a DataFrames.DataFrame
, etc.
Comparison of Truth and Best Fit
As a final step, we retrieve the parameter values at the mode, representing the best-fit parameters
samples_mode = mode(samples)
(a = [508.1437600117665, 1000.4543344985999], mu = [-1.0360951351800272, 1.9842576309107223], sigma = 0.508178264111625)
Like the samples themselves, the result can be viewed in both shaped and unshaped form. samples_mode
is presented as a 0-dimensional array that contains a NamedTuple, this representation preserves the shape information:
samples_mode isa NamedTuple
true
samples_mode
is only an estimate of the mode of the posterior distribution. It can be further refined using bat_findmode
:
using Optim
findmode_result = bat_findmode(
posterior,
OptimAlg(optalg = Optim.NelderMead(), init = ExplicitInit([samples_mode]))
)
fit_par_values = findmode_result.result
(a = [507.51195919612974, 999.9344743442189], mu = [-1.034452167714924, 1.98417519259745], sigma = 0.5083294545477082)
Let's plot the data and fit function given the true parameters and MCMC samples
plot(-4:0.01:4, fit_function, samples)
plot!(
normalize(hist, mode=:density),
color=1, linewidth=2, fillalpha=0.0,
st = :steps, fill=false, label = "Data",
title = "Data, True Model and Best Fit"
)
plot!(-4:0.01:4, x -> fit_function(true_par_values, x), color=4, label = "Truth")
savefig("tutorial-data-truth-bestfit.pdf")
Fine-grained control
BAT provides fine-grained control over the MCMC algorithm options, the MCMC chain initialization, tuning/burn-in strategy and convergence testing. All option value used in the following are the default values, any or all may be omitted.
We'll sample using the The Metropolis-Hastings MCMC algorithm:
mcmcalgo = MetropolisHastings(
weighting = RepetitionWeighting(),
tuning = AdaptiveMHTuning()
)
MetropolisHastings{Distributions.TDist{Float64}, RepetitionWeighting{Int64}, AdaptiveMHTuning}
proposal: Distributions.TDist{Float64}
weighting: RepetitionWeighting{Int64} RepetitionWeighting{Int64}()
tuning: AdaptiveMHTuning
BAT requires a counter-based random number generator (RNG), since it partitions the RNG space over the MCMC chains. This way, a single RNG seed is sufficient for all chains and results are reproducible even under parallel execution. By default, BAT uses a Philox4x RNG initialized with a random seed drawn from the system entropy pool:
using Random123
rng = Philox4x()
context = BATContext(rng = Philox4x())
By default, MetropolisHastings()
uses the following options.
For Markov chain initialization:
init = MCMCChainPoolInit()
MCMCChainPoolInit
init_tries_per_chain: IntervalSets.ClosedInterval{Int64}
nsteps_init: Int64 1000
initval_alg: InitFromTarget InitFromTarget()
For the MCMC burn-in procedure:
burnin = MCMCMultiCycleBurnin()
MCMCMultiCycleBurnin
nsteps_per_cycle: Int64 10000
max_ncycles: Int64 30
nsteps_final: Int64 1000
For convergence testing:
convergence = BrooksGelmanConvergence()
BrooksGelmanConvergence
threshold: Float64 1.1
corrected: Bool false
To generate MCMC samples with explicit control over all options, use something like
samples = bat_sample(
posterior,
MCMCSampling(
mcalg = mcmcalgo,
nchains = 4,
nsteps = 10^5,
init = init,
burnin = burnin,
convergence = convergence,
strict = true,
store_burnin = false,
nonzero_weights = true,
callback = (x...) -> nothing
),
context
).result
[ Info: MCMCChainPoolInit: trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 2 tuned, 4 converged.
[ Info: MCMC Tuning cycle 9 finished, 4 chains, 3 tuned, 0 converged.
[ Info: MCMC Tuning cycle 10 finished, 4 chains, 3 tuned, 0 converged.
[ Info: MCMC Tuning cycle 11 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 11 cycle(s).
[ Info: Running post-tuning stabilization steps for 4 MCMC chain(s).
Saving result data to files
The package FileIO.jl(in conjunction with JLD2.jl) offers a convenient way to store results like posterior samples to file:
using FileIO
import JLD2
FileIO.save("results.jld2", Dict("samples" => samples))
JLD2 persists the full information (including value shapes), so you can reload exactly the same data into memory in a new Julia session via
using FileIO
import JLD2
samples = FileIO.load("results.jld2", "samples")
provided you use compatible versions of BAT and it's dependencies. Note that JLD2 is not a long-term stable file format. Also note that this functionality is provided by FileIO.jl and JLD2.jl and not part of the BAT API itself.
BAT.jl itself can write samples to standard HDF5 files in a form suitable for long-term storage (via HDF5.jl):
import HDF5
bat_write("results.h5", samples)
The resulting files have an intuitive HDF5 layout and can be read with the standard HDF5 libraries, so they are easily accessible from other programming languages as well. Not all value shape information can be preserved, though. To read BAT.jl HDF5 sample data, use
using BAT
import HDF5
samples = bat_read("results.h5").result
BAT.jl's HDF5 file format may evolve over time, but future versions of BAT.jl will be able to read HDF5 sample data written by this version of BAT.jl.
This page was generated using Literate.jl.