Overview: Why Bayesian Neural Networks?

Standard Deep Learning models are notoriously overconfident. Even when presented with out-of-distribution data or noisy inputs, a deterministic neural network will output a point estimate with absolute certainty. This "confident ignorance" is unacceptable in safety-critical domains like medicine, autonomous driving, or finance.

Bayesian Neural Networks (BNNs) solve this by placing a probability distribution over the network's weights rather than learning a single fixed set of parameters. This project provides a comprehensive suite of reference implementations for various Bayesian inference techniques, allowing us to quantify both epistemic (model) uncertainty and aleatoric (data) uncertainty.

Monte Carlo Dropout (MC Dropout)

Traditionally, dropout is merely a regularization technique used during training to prevent overfitting. However, as demonstrated by Yarin Gal and Zoubin Ghahramani, keeping dropout active during inference mathematically approximates Bayesian inference in deep Gaussian processes.

By running multiple forward passes for the same input, the network generates an ensemble of predictions. The mean of these predictions serves as the final output, while their variance provides a reliable estimate of epistemic uncertainty.

Practical application: In an autonomous vehicle's vision system, if an object is partially obscured by fog, MC Dropout will yield highly varied predictions across forward passes. This high variance acts as a red flag, allowing the system to hand control back to a human driver instead of confidently acting on a hallucinated classification.

Notebook: MC Dropout Notebook

Variational Inference (Bayes by Backprop)

Exact Bayesian inference requires calculating the true posterior distribution of the network's weights, which is computationally intractable for deep learning. Variational Inference (VI) turns this integration problem into an optimization problem.

Using the "Bayes by Backprop" algorithm, instead of learning a single weight value, the network learns the parameters (mean and standard deviation) of a Gaussian distribution for every weight. During training, it optimizes the Evidence Lower Bound (ELBO), balancing how well the model fits the data against how closely the learned weight distributions match a prior distribution.

Practical application: In medical imaging, if a model encounters a rare pathology it hasn't seen during training, the wide predictive intervals generated by the weight distributions will trigger an uncertainty alert, routing the scan to a specialist for human review.

Notebook: Variational Inference Notebook

Probabilistic Backpropagation (PBP)

While MCMC and Variational Inference rely on sampling, Probabilistic Backpropagation (PBP) offers a deterministic alternative. It propagates probabilities analytically through the network layers using Assumed Density Filtering.

In PBP, the forward pass computes the mean and variance of the activations iteratively, and the backward pass updates the marginal distributions of the weights. Because it completely avoids the expensive Monte Carlo sampling steps, it is highly efficient while still delivering robust, well-calibrated uncertainty estimates.

Practical application: In algorithmic trading or high-frequency sensor forecasting, PBP allows models to continuously update their predictive confidence bands in real time as market volatility or signal noise shifts instantaneously.

Notebook: PBP Notebook

Hamiltonian Monte Carlo (HMC)

Hamiltonian Monte Carlo is widely considered the gold standard for Bayesian inference. Rather than taking random walks through the parameter space like standard MCMC, HMC uses concepts from classical physics-specifically Hamiltonian dynamics-to explore the posterior distribution.

It treats the network's weights as the position of a particle and the negative log-posterior as a potential energy field. By simulating frictionless motion across this landscape using gradient information, HMC can take large steps across the parameter space with high acceptance rates, yielding incredibly accurate posterior samples.

Practical application: While too computationally expensive for real-time edge computing, HMC is perfect for offline, heavily regulated applications like pharmacological dosage-risk modeling, where generating a flawless, unbiased probabilistic safety margin is more important than inference speed.

Notebook: HMC Notebook

Approximate Bayesian Computation (ABC-SS)

Most Bayesian methods require access to the model's gradients (via backpropagation) and a tractable likelihood function. Approximate Bayesian Computation with Subset Simulation (ABC-SS) is a likelihood-free method designed for scenarios where these are unavailable.

ABC-SS uses a population-based approach, iteratively selecting and mutating candidate weight configurations based solely on forward-pass error thresholds. It progressively narrows down the parameter space to approximate the posterior without gradients.

Practical application: When deep learning models are coupled with proprietary, black-box physics simulators or legacy engineering software where backpropagation is impossible, ABC-SS can still evolve candidate networks and extract posterior uncertainty.

Notebook: ABC-SS Notebook

Probabilistic Evaluation Metrics

  • Root Mean Square Error (RMSE): Evaluates the standard point-estimate accuracy by comparing the mean of the predictive distribution against the observed ground truth.
  • Prediction Interval Coverage Probability (PICP): Measures reliability. It calculates the percentage of true observations that actually fall within the model's predicted 95% confidence intervals. A well-calibrated model should have a PICP close to 0.95.
  • Mean Prediction Interval Width (MPIW): Measures sharpness. If a model predicts a confidence interval of negative infinity to positive infinity, its PICP will be 100%, but it will be useless. MPIW ensures the intervals are as tight and actionable as possible.
  • Negative Log-Likelihood (NLL): A strictly proper scoring rule that evaluates the entire predictive distribution, gracefully balancing both accuracy (closeness to truth) and calibration (confidence).
  • Winkler Score: An asymmetric penalty function. It rewards narrow intervals but severely penalizes the model when the true value falls outside the predicted interval, making it ideal for risk-averse applications.

Hyperparameter Tuning with Optuna

  • Bayesian-Specific Search Spaces: Custom hyperparameter definitions for each method, such as tuning prior standard deviations for VI, dropout rates and length-scales for MC Dropout, or trajectory lengths for HMC.
  • Tree-structured Parzen Estimator (TPE): Utilized for single-objective optimization, rapidly converging on configurations that minimize strictly proper scoring rules like NLL or Winkler Score.
  • NSGA-II Multi-Objective Optimization: Used to find the Pareto optimal front between competing objectives—for instance, maximizing accuracy (RMSE) while simultaneously minimizing interval width (MPIW) to find the perfect balance between correctness and sharpness.

Interesting Papers