Tutorial
This tutorial demonstrates a simple application of BAT.jl: A Bayesian fit of a histogram with two Gaussian peaks.
You can also download this tutorial as a Jupyter notebook and a plain Julia source file.
Table of contents:
Note: This tutorial is somewhat verbose, as it aims to be easy to follow for users who are new to Julia. For the same reason, we deliberately avoid making use of Julia features like closures, anonymous functions, broadcasting syntax, performance annotations, etc.
Input Data Generation
First, let's generate some synthetic data to fit. We'll need the Julia standard-library packages "Random", "LinearAlgebra" and "Statistics", as well as the packages "Distributions" and "StatsBase":
using Random, LinearAlgebra, Statistics, Distributions, StatsBase
As the underlying truth of our input data/histogram, let us choose an non-normalized probability density composed of two Gaussian peaks with a peak area of 500 and 1000, a mean of -1.0 and 2.0 and a standard error of 0.5
data = vcat(
rand(Normal(-1.0, 0.5), 500),
rand(Normal( 2.0, 0.5), 1000)
)
1500-element Array{Float64,1}:
-0.9965862777739484
-1.100958528042784
-1.7684412893372783
-0.972773414600595
-1.0240778676982736
-1.148930690209156
-1.1706315740626705
-1.8827381081469277
-0.9635572488963757
-2.0403454960823817
⋮
1.9504228798132748
1.8020337765914531
2.173248713879491
2.432215976389577
1.5707604522562988
1.61357134454585
2.437960355666802
2.652206120574844
2.055663468776079
resulting in a vector of floating-point numbers:
typeof(data) == Vector{Float64}
true
Next, we'll create a histogram of that data, this histogram will serve as the input for the Bayesian fit:
hist = append!(Histogram(-2:0.1:4), data)
StatsBase.Histogram{Int64,1,Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}
edges:
-2.0:0.1:4.0
weights: [8, 4, 9, 13, 28, 21, 27, 42, 40, 35 … 6, 4, 4, 1, 3, 1, 0, 0, 0, 0]
closed: left
isdensity: false
Using the Julia "Plots" package
using Plots
we can plot the histogram:
plot(
normalize(hist, mode=:density),
st = :steps, label = "Data",
title = "Data"
)
savefig("tutorial-data.pdf")
Let's define our fit function - the function that we expect to describe the data histogram, at each x-Axis position x
, depending on a given set p
of model parameters:
function fit_function(p::NamedTuple{(:a, :mu, :sigma)}, x::Real)
p.a[1] * pdf(Normal(p.mu[1], p.sigma), x) +
p.a[2] * pdf(Normal(p.mu[2], p.sigma), x)
end
The fit parameters (model parameters) a
(peak areas) and mu
(peak means) are vectors, parameter sigma
(peak width) is a scalar, we assume it's the same for both Gaussian peaks.
The true values for the model/fit parameters are the values we used to generate the data:
true_par_values = (a = [500, 1000], mu = (-1.0, 2.0), sigma = 0.5)
Let's visually compare the histogram and the fit function, using these true parameter values, to make sure everything is set up correctly:
plot(
normalize(hist, mode=:density),
st = :steps, label = "Data",
title = "Data and True Statistical Model"
)
plot!(
-4:0.01:4, x -> fit_function(true_par_values, x),
label = "Truth"
)
savefig("tutorial-data-and-truth.pdf")
Bayesian Fit
Now we'll perform a Bayesian fit of the generated histogram, using BAT, to infer the model parameters from the data histogram.
In addition to the Julia packages loaded above, we need BAT itself, as well as IntervalSets:
using BAT, IntervalSets
Likelihood Definition
First, we need to define the likelihood (function) for our problem.
BAT represents densities like likelihoods and priors as subtypes of BAT.AbstractDensity
. Custom likelihood can be defined by creating a new subtype of AbstractDensity
and by implementing (at minimum) BAT.density_logval
for that type - in complex uses cases, this may become necessary. Typically, however, it is sufficient to define a custom likelihood as a simple function that returns the log-likelihood value for a given set of parameters. BAT will automatically convert such a log-likelihood function into a subtype of AbstractDensity
.
For performance reasons, functions should not access global variables directly. So we'll use an anonymous function inside of a let-statement to capture the value of the global variable hist
in a local variable h
(and to shorten function name fit_function
to f
, purely for convenience):
log_likelihood = let h = hist, f = fit_function
params -> begin
# Histogram counts for each bin as an array:
counts = h.weights
# Histogram binning, has length (length(counts) + 1):
binning = h.edges[1]
# sum log-likelihood value over bins:
ll_value::Float64 = 0.0
for i in eachindex(counts)
# Get information about current bin:
bin_left, bin_right = binning[i], binning[i+1]
bin_width = bin_right - bin_left
bin_center = (bin_right + bin_left) / 2
observed_counts = counts[i]
# Simple mid-point rule integration of fit function `f` over bin:
expected_counts = bin_width * f(params, bin_center)
# Add log of Poisson probability for bin:
ll_value += logpdf(Poisson(expected_counts), observed_counts)
end
return ll_value
end
end
#3 (generic function with 1 method)
BAT makes use of Julia's parallel programming facilities if possible, e.g. to run multiple Markov chains in parallel. Therefore, log-likelihood (and other) code must be thread-safe. Mark non-thread-safe code with @critical
(provided by Julia package ParallelProcessingTools
).
BAT requires Julia v1.3 or newer to use multi-threading. Support for automatic parallelization across multiple (local and remote) Julia processes is planned, but not implemented yet.
Note that Julia currently starts only a single thread by default. Set the the environment variable JULIA_NUM_THREADS
to specify the desired number of Julia threads.
We can evaluate log_likelihood
, e.g. for the true parameter values:
log_likelihood(true_par_values)
-150.9069276690886
Prior Definition
Next, we need to choose a sensible prior for the fit:
prior = NamedTupleDist(
a = [0.0..10.0^4, 0.0..10.0^4],
mu = [-2.0..0.0, 1.0..3.0],
sigma = Truncated(Normal(0.4, 2), 0.3, 0.7)
)
In general, BAT allows instances of any subtype of AbstractDensity
to be uses as a prior, as long as a sampler is defined for it. This way, users may implement complex application-specific priors. You can also use convert(AbstractDensity, distribution)
to convert any continuous multivariate Distributions.Distribution
to a BAT.AbstractDensity
that can be used as a prior (or likelihood).
The prior also implies the shapes of the parameters:
using ValueShapes
parshapes = valshape(prior)
ValueShapes.NamedTupleShape{(:a, :mu, :sigma),Tuple{ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}},ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}},ValueShapes.ValueAccessor{ValueShapes.ScalarShape{Real}}}}((a = ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}}(ValueShapes.ArrayShape{Real,1}((2,)), 0, 2), mu = ValueShapes.ValueAccessor{ValueShapes.ArrayShape{Real,1}}(ValueShapes.ArrayShape{Real,1}((2,)), 2, 2), sigma = ValueShapes.ValueAccessor{ValueShapes.ScalarShape{Real}}(ValueShapes.ScalarShape{Real}(), 4, 1)), 5)
These will come in handy later on, e.g. to access (the posterior distribution of) individual parameter values.
Bayesian Model Definition
Given the likelihood and prior definition, a BAT.PosteriorDensity
is simply defined via
posterior = PosteriorDensity(log_likelihood, prior)
Parameter Space Exploration via MCMC
We can now use Markov chain Monte Carlo (MCMC) to explore the space of possible parameter values for the histogram fit.
To increase the verbosity level of BAT logging output, you may want to set the Julia logging level for BAT to debug via ENV["JULIA_DEBUG"] = "BAT"
.
Let's use 4 MCMC chains and require 10^5 unique samples from each chain (after tuning/burn-in):
nsamples = 10^4
nchains = 4
Now we can generate a set of MCMC samples via bat_sample
:
samples, stats = bat_sample(posterior, (nsamples, nchains), MetropolisHastings())
[ Info: Trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 2 tuned, 0 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 2 tuned, 4 converged.
[ Info: MCMC Tuning cycle 9 finished, 4 chains, 4 tuned, 0 converged.
[ Info: MCMC Tuning cycle 10 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 10 cycle(s).
Let's calculate some posterior statistics using the function bat_stats
and print the results:
println("Truth: $true_par_values")
println("Mode: $(stats.mode)")
println("Mean: $(stats.mean)")
println("Covariance: $(stats.cov)")
Truth: (a = [500, 1000], mu = (-1.0, 2.0), sigma = 0.5)
Mode: [497.0033008718692, 994.9439541242684, -0.9729832967289617, 1.9910596819357194, 0.5011324043786379]
Mean: [499.61178944258944, 1001.0639273713308, -0.9724145751698856, 1.9919996024736981, 0.5006856971094287]
Covariance: [519.7073181996485 -8.764628270490316 -0.05310194632972877 -0.0022147984018210665 0.012118266466568751; -8.76462827049033 939.722888184015 -0.0004791285056013602 -0.0007097942852209447 -0.0006982890818712869; -0.05310194632972876 -0.0004791285056013712 0.0006083468017295504 1.851571229041153e-5 -2.8544261954095032e-5; -0.0022147984018210587 -0.0007097942852209576 1.8515712290411537e-5 0.0002606024001735707 -2.928833037171619e-6; 0.01211826646656874 -0.0006982890818712793 -2.8544261954095005e-5 -2.92883303717161e-6 9.761424623436907e-5]
We can also, e.g., get the Pearson auto-correlation of the parameters:
cor(samples.params, FrequencyWeights(samples.weight))
5×5 Array{Float64,2}:
1.0 -0.0125416 -0.0944399 -0.00601819 0.0538028
-0.0125416 1.0 -0.000633689 -0.00143431 -0.00230557
-0.0944399 -0.000633689 1.0 0.0465024 -0.117135
-0.00601819 -0.00143431 0.0465024 1.0 -0.0183632
0.0538028 -0.00230557 -0.117135 -0.0183632 1.0
Visualization of Results
BAT.jl comes with an extensive set of plotting recipes for "Plots.jl". We can plot the marginalized distribution for a single parameter (e.g. parameter 3, i.e. μ₁):
par_names = ["a_1", "a_2", "mu_1", "mu_2", "sigma"]
plot(
samples, 3,
mean = true, std_dev = true, globalmode = true, localmode = true,
nbins = 50, xlabel = par_names[3], ylabel = "P($(par_names[3]))",
title = "Marginalized Distribution for mu_1"
)
savefig("tutorial-single-par.pdf")
or plot the marginalized distribution for a pair of parameters (e.g. parameters 3 and 5, i.e. μ₁ and σ), including information from the parameter stats:
plot(
samples, (3, 5),
mean = true, std_dev = true, globalmode = true, localmode = true,
nbins = 50, xlabel = par_names[3], ylabel = par_names[5],
title = "Marginalized Distribution for mu_1 and sigma"
)
plot!(MCMCBasicStats(samples), (3, 5))
savefig("tutorial-param-pair.pdf")
We can also create an overview plot of the marginalized distribution for all pairs of parameters:
plot(
samples,
mean = false, std_dev = false, globalmode = true, localmode = false,
nbins = 50
)
savefig("tutorial-all-params.pdf")
Integration with Tables.jl
BAT.jl supports the Tables.jl interface. So we can also convert the vector of MCMC samples vecto a table, e.g. using TypedTables.jl](http://blog.roames.com/TypedTables.jl/stable/):
using TypedTables
tbl = Table(samples)
Table with 5 columns and 40000 rows:
params log_posterior log_prior weight ⋯
┌──────────────────────────────────────────────────────────
1 │ [512.278, 1019.35, … -170.28 -18.8891 2 ⋯
2 │ [514.811, 1012.33, … -169.974 -18.8891 2 ⋯
3 │ [492.27, 1003.27, -… -171.441 -18.8893 4 ⋯
4 │ [486.36, 991.03, -1… -172.216 -18.8894 7 ⋯
5 │ [472.63, 1027.36, -… -172.013 -18.8891 4 ⋯
6 │ [473.823, 1022.13, … -172.162 -18.889 2 ⋯
7 │ [474.37, 1014.7, -0… -170.51 -18.8891 7 ⋯
8 │ [486.645, 985.702, … -169.692 -18.8892 2 ⋯
9 │ [472.88, 976.48, -0… -171.144 -18.8887 5 ⋯
10 │ [474.681, 979.198, … -170.468 -18.8888 1 ⋯
11 │ [462.692, 1002.06, … -170.771 -18.8888 2 ⋯
12 │ [447.036, 1001.08, … -172.218 -18.8888 2 ⋯
13 │ [452.303, 979.876, … -171.828 -18.8888 1 ⋯
14 │ [471.593, 1039.02, … -170.878 -18.8889 3 ⋯
15 │ [478.782, 1033.2, -… -171.557 -18.8889 8 ⋯
16 │ [481.404, 1017.95, … -172.558 -18.8887 2 ⋯
17 │ [504.774, 1047.43, … -172.804 -18.8891 4 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
Using the parameter shapes, we can generate a table with named parameters, instead of flat real-valued parameter vectors:
tbl_named = parshapes.(samples)
Table with 5 columns and 40000 rows:
params log_posterior log_prior weight ⋯
┌──────────────────────────────────────────────────────────
1 │ (a = [512.278, 1019… -170.28 -18.8891 2 ⋯
2 │ (a = [514.811, 1012… -169.974 -18.8891 2 ⋯
3 │ (a = [492.27, 1003.… -171.441 -18.8893 4 ⋯
4 │ (a = [486.36, 991.0… -172.216 -18.8894 7 ⋯
5 │ (a = [472.63, 1027.… -172.013 -18.8891 4 ⋯
6 │ (a = [473.823, 1022… -172.162 -18.889 2 ⋯
7 │ (a = [474.37, 1014.… -170.51 -18.8891 7 ⋯
8 │ (a = [486.645, 985.… -169.692 -18.8892 2 ⋯
9 │ (a = [472.88, 976.4… -171.144 -18.8887 5 ⋯
10 │ (a = [474.681, 979.… -170.468 -18.8888 1 ⋯
11 │ (a = [462.692, 1002… -170.771 -18.8888 2 ⋯
12 │ (a = [447.036, 1001… -172.218 -18.8888 2 ⋯
13 │ (a = [452.303, 979.… -171.828 -18.8888 1 ⋯
14 │ (a = [471.593, 1039… -170.878 -18.8889 3 ⋯
15 │ (a = [478.782, 1033… -171.557 -18.8889 8 ⋯
16 │ (a = [481.404, 1017… -172.558 -18.8887 2 ⋯
17 │ (a = [504.774, 1047… -172.804 -18.8891 4 ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
We can now, e.g., find the sample with the maximum posterior value (i.e. the mode):
mode_log_posterior, mode_idx = findmax(tbl_named.log_posterior)
(-168.96941803101575, 9664)
And get row mode_idx
of the table, with all information about the sample at the mode:
Comparison of Truth and Best Fit
As a final step, we retrieve the parameter values at the mode, representing the best-fit parameters
fit_par_values = tbl_named[mode_idx].params
(a = [497.0033008718692, 994.9439541242684], mu = [-0.9729832967289617, 1.9910596819357194], sigma = 0.5011324043786379)
And plot the truth, data, and best fit:
plot(
normalize(hist, mode=:density),
st = :steps, label = "Data",
title = "Data, True Model and Best Fit"
)
plot!(-4:0.01:4, x -> fit_function(true_par_values, x), label = "Truth")
plot!(-4:0.01:4, x -> fit_function(fit_par_values, x), label = "Best fit")
savefig("tutorial-data-truth-bestfit.pdf")
Fine-grained control
BAT provides fine-grained control over the MCMC algorithm options, the MCMC chain initialization, tuning/burn-in strategy and convergence testing. All option value used in the following are the default values, any or all may be omitted.
We'll sample using the The Metropolis-Hastings MCMC algorithm. By default, BAT uses a multivariate t-distribution (ν = 1) as the proposal function:
algorithm = MetropolisHastings(MvTDistProposal(1.0))
BAT requires a counter-based random number generator (RNG), since it partitions the RNG space over the MCMC chains. This way, a single RNG seed is sufficient for all chains and results are reproducible even under parallel execution. By default, BAT uses a Philox4x RNG initialized with a random seed drawn from the system entropy pool:
using Random123
rng = Philox4x()
Other default parameters are:
tuning = AdaptiveMetropolisTuning(
λ = 0.5,
α = 0.15..0.35,
β = 1.5,
c = 1e-4..1e2
)
convergence = BrooksGelmanConvergence(
threshold = 1.1,
corrected = false
)
init = MCMCInitStrategy(
ninit_tries_per_chain = 8..128,
max_nsamples_pretune = 25,
max_nsteps_pretune = 250,
max_time_pretune = Inf
)
burnin = MCMCBurninStrategy(
max_nsamples_per_cycle = 1000,
max_nsteps_per_cycle = 10000,
max_time_per_cycle = Inf,
max_ncycles = 30
)
To generate MCMC samples with explicit control over all options, use
samples, stats = bat_sample(
rng, posterior, (nsamples, nchains), algorithm,
max_nsteps = 10 * nsamples,
max_time = Inf,
tuning = tuning,
init = init,
burnin = burnin,
convergence = convergence,
strict = false,
filter = true
)
[ Info: Trying to generate 4 viable MCMC chain(s).
[ Info: Selected 4 MCMC chain(s).
[ Info: Begin tuning of 4 MCMC chain(s).
[ Info: MCMC Tuning cycle 1 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 2 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 3 finished, 4 chains, 0 tuned, 0 converged.
[ Info: MCMC Tuning cycle 4 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 5 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 6 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 7 finished, 4 chains, 1 tuned, 4 converged.
[ Info: MCMC Tuning cycle 8 finished, 4 chains, 4 tuned, 4 converged.
[ Info: MCMC tuning of 4 chains successful after 8 cycle(s).
However, in many use cases, simply using the default options via
samples, stats = bat_sample(posterior, (nsamples, nchains), MetropolisHastings())
will often be sufficient.
This page was generated using Literate.jl.