Background: Logarithmic posterior probability

The general setting is the same as in Background: Posterior probability. The starting point is

\[p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m) \propto p_0(\lambda\mid \{x_i\}_{i=1}^N, m) \prod_{i=1}^n \ell_i(y_i\mid \lambda, x_i,m)\ ,\]

assuming the likelihoods $\ell_i$ are known.

Product of small numbers

In the formula for the posterior likelihood, it can happen that many small numbers (close to zero) need to be multiplied together. Because floating point numbers can only represent numbers up to a certain precision, such products, though theoretically non-zero, tend to be rounded to zero.

For example, consider the following array as the likelihood values:

using Distributions, BenchmarkTools
small_values = [pdf(Normal(0,1),10+i) for i in 1:10]

10-element Vector{Float64}:
 2.1188192535093538e-27
 2.1463837356630605e-32
 7.998827757006813e-38
 1.0966065593889713e-43
 5.530709549844416e-50
 1.0261630727919036e-56
 7.004182134318583e-64
 1.758749542595104e-71
 1.624636036773608e-79
 5.520948362159764e-88

Although the values are non-zero, the product is rounded to zero:

prod(small_values)

0.0

One could use floating point types with higher precision:

small_values_high_precision = BigFloat.(small_values)
prod(small_values_high_precision)

2.501536784247610370239770054628823917126174421661076782600307193509415451541111e-544

However, this means a huge performance loss together with increased memory usage:

@benchmark prod(small_values)

BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  11.148 ns …  1.760 μs  ┊ GC (min … max): 0.00% … 97.37%
 Time  (median):     12.811 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.342 ns ± 24.389 ns  ┊ GC (mean ± σ):  2.58% ±  1.66%

     ▃▃██▅█                                                    
  ▁▂▄███████▆▃▂▂▂▁▁▁▁▂▂▂▃▄▅▄▃▄▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  11.1 ns         Histogram: frequency by time          23 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

@benchmark prod(small_values_high_precision)

BenchmarkTools.Trial: 10000 samples with 246 evaluations.
 Range (min … max):  307.423 ns …   5.995 μs  ┊ GC (min … max): 0.00% … 91.68%
 Time  (median):     346.510 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   369.997 ns ± 263.214 ns  ┊ GC (mean ± σ):  4.77% ±  6.22%

   ▁  █▃  ▄▆                                                     
  ▃██▇██▇▆███▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▁▂ ▃
  307 ns           Histogram: frequency by time          629 ns <

 Memory estimate: 936 bytes, allocs estimate: 18.

Logarithmic scale

Since posterior probabilities are often unnormalized, one is not interested in the particular values, but only in relative differences. But then, any strictly monotonically increasing function can be applied to compare relative differences. A convenient choice for such a strictly monotonically increasing function is the natural logarithm.

Note that because of proportionality, there is a constant $\alpha > 0$ such that

\[p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m) = \alpha \cdot p_0(\lambda\mid \{x_i\}_{i=1}^N, m) \prod_{i=1}^n \ell_i(y_i\mid \lambda, x_i,m)\ .\]

Applying the natural logarithm leads to:

\[\begin{aligned} \ln (p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m)) &= \ln \left(\alpha \cdot p_0(\lambda\mid \{x_i\}_{i=1}^N, m) \prod_{i=1}^n \ell_i(y_i\mid \lambda, x_i,m) \right) \\ &= \ln(p_0(\lambda\mid \{x_i\}_{i=1}^N, m)) + \sum_{i=1}^N \ln(\ell_i(y_i\mid \lambda, x_i,m)) + \ln(\alpha) \end{aligned}\]

Using the logarithm allowed to exchange the multiplication of small numbers for an addition in the logarithmic scale, at the cost of having to calculate the logarithm of every value. However, the cost of calculating the logarithm is the worst case scenario. In many cases, it is possible if not easier to implement the logarithm of the involved densities (e.g. for the normal distribution, Laplace distribution, etc.).

To shorten the notation, we write $L_p = \ln \circ\ p$ for the posterior, $L_i= \ln \circ\ \ell_i$ for the likelihoods and $L_0 =\ln \circ\ p_0$ for the prior. Then the log-posterior reads

\[L_p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m) = L_0(\lambda\mid \{x_i\}_{i=1}^N, m) + \sum_{i_1}^n L_i(y_i\mid \lambda, x_i,m) + \text{const.}\]

Effect of the logarithmic scale

Using the logarithm has two effects. First of all, the product becomes a sum. This alone would suffice to prevent the rounding to zero problem:

sum(small_values)

2.1188407174266987e-27

In addition, the logarithm has the effect of compressing the number scale for numbers larger than 1 and to stretch out the number scale for numbers between 0 and 1:

log_values = log.(small_values)

10-element Vector{Float64}:
  -61.418938533204674
  -72.91893853320467
  -85.41893853320467
  -98.91893853320467
 -113.41893853320467
 -128.91893853320468
 -145.41893853320468
 -162.91893853320468
 -181.41893853320468
 -200.91893853320468

Of course, the sum is still non-zero:

sum(log_values)

-1251.6893853320469

While the use of higher precision floating point numbers (BigFloat) lead to a huge performance loss, the log scale method dose not impair performance:

@benchmark sum(log_values)

BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.566 ns … 628.090 ns  ┊ GC (min … max): 0.00% … 91.52%
 Time  (median):     14.248 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.603 ns ±   9.610 ns  ┊ GC (mean ± σ):  0.89% ±  1.57%

   ▃▅▆███▇▆▅▅▄▃▂▁ ▂▂▁▃▄▃▂▂▂▂▁▁ ▁▁ ▁  ▁▁▃▂                      ▂
  ███████████████▇█████████████████▇███████▇▇█▆▇▆▆▆▆▅▅▆▅▆▄▆▄▅▅ █
  12.6 ns       Histogram: log(frequency) by time      27.6 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

Logarithmic densities and `Distributions.jl`

As mentioned in Logarithmic scale it can be easier to implement some density functions in a logarithmic scale. This is not only true for the implementations from scratch, but also for Distributions.jl. Observe that "far" away from the mean, the pdf of a normal distribution is rounded to zero:

pdf(Normal(0,1),100)

0.0

Obviously, the logarithm cannot be applied to this. However, Distributions.jl offers a logpdf function:

logpdf(Normal(0,1),100)

-5000.918938533205

This allows values even further away from the mean:

logpdf(Normal(0,1),10^20)

-3.0157549656954986e37

Background: Logarithmic posterior probability

Product of small numbers

Logarithmic scale

Effect of the logarithmic scale

Logarithmic densities and Distributions.jl

Logarithmic densities and `Distributions.jl`