Background: Logarithmic posterior probability
The general setting is the same as in Background: Posterior probability. The starting point is
\[p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m) \propto p_0(\lambda\mid \{x_i\}_{i=1}^N, m) \prod_{i=1}^n \ell_i(y_i\mid \lambda, x_i,m)\ ,\]
assuming the likelihoods $\ell_i$ are known.
Product of small numbers
In the formula for the posterior likelihood, it can happen that many small numbers (close to zero) need to be multiplied together. Because floating point numbers can only represent numbers up to a certain precision, such products, though theoretically non-zero, tend to be rounded to zero.
For example, consider the following array as the likelihood values:
using Distributions, BenchmarkTools
small_values = [pdf(Normal(0,1),10+i) for i in 1:10]
10-element Vector{Float64}:
2.1188192535093538e-27
2.1463837356630605e-32
7.998827757006813e-38
1.0966065593889713e-43
5.530709549844416e-50
1.0261630727919036e-56
7.004182134318583e-64
1.758749542595104e-71
1.624636036773608e-79
5.520948362159764e-88
Although the values are non-zero, the product is rounded to zero:
prod(small_values)
0.0
One could use floating point types with higher precision:
small_values_high_precision = BigFloat.(small_values)
prod(small_values_high_precision)
2.501536784247610370239770054628823917126174421661076782600307193509415451541111e-544
However, this means a huge performance loss together with increased memory usage:
@benchmark prod(small_values)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 11.148 ns … 1.760 μs ┊ GC (min … max): 0.00% … 97.37%
Time (median): 12.811 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.342 ns ± 24.389 ns ┊ GC (mean ± σ): 2.58% ± 1.66%
▃▃██▅█
▁▂▄███████▆▃▂▂▂▁▁▁▁▂▂▂▃▄▅▄▃▄▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
11.1 ns Histogram: frequency by time 23 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
@benchmark prod(small_values_high_precision)
BenchmarkTools.Trial: 10000 samples with 246 evaluations.
Range (min … max): 307.423 ns … 5.995 μs ┊ GC (min … max): 0.00% … 91.68%
Time (median): 346.510 ns ┊ GC (median): 0.00%
Time (mean ± σ): 369.997 ns ± 263.214 ns ┊ GC (mean ± σ): 4.77% ± 6.22%
▁ █▃ ▄▆
▃██▇██▇▆███▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▁▂ ▃
307 ns Histogram: frequency by time 629 ns <
Memory estimate: 936 bytes, allocs estimate: 18.
Logarithmic scale
Since posterior probabilities are often unnormalized, one is not interested in the particular values, but only in relative differences. But then, any strictly monotonically increasing function can be applied to compare relative differences. A convenient choice for such a strictly monotonically increasing function is the natural logarithm.
Note that because of proportionality, there is a constant $\alpha > 0$ such that
\[p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m) = \alpha \cdot p_0(\lambda\mid \{x_i\}_{i=1}^N, m) \prod_{i=1}^n \ell_i(y_i\mid \lambda, x_i,m)\ .\]
Applying the natural logarithm leads to:
\[\begin{aligned} \ln (p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m)) &= \ln \left(\alpha \cdot p_0(\lambda\mid \{x_i\}_{i=1}^N, m) \prod_{i=1}^n \ell_i(y_i\mid \lambda, x_i,m) \right) \\ &= \ln(p_0(\lambda\mid \{x_i\}_{i=1}^N, m)) + \sum_{i=1}^N \ln(\ell_i(y_i\mid \lambda, x_i,m)) + \ln(\alpha) \end{aligned}\]
Using the logarithm allowed to exchange the multiplication of small numbers for an addition in the logarithmic scale, at the cost of having to calculate the logarithm of every value. However, the cost of calculating the logarithm is the worst case scenario. In many cases, it is possible if not easier to implement the logarithm of the involved densities (e.g. for the normal distribution, Laplace distribution, etc.).
To shorten the notation, we write $L_p = \ln \circ\ p$ for the posterior, $L_i= \ln \circ\ \ell_i$ for the likelihoods and $L_0 =\ln \circ\ p_0$ for the prior. Then the log-posterior reads
\[L_p(\lambda \mid \{x_i\}_{i=1}^N, \{y_i\}_{i=1}^N, m) = L_0(\lambda\mid \{x_i\}_{i=1}^N, m) + \sum_{i_1}^n L_i(y_i\mid \lambda, x_i,m) + \text{const.}\]
Effect of the logarithmic scale
Using the logarithm has two effects. First of all, the product becomes a sum. This alone would suffice to prevent the rounding to zero problem:
sum(small_values)
2.1188407174266987e-27
In addition, the logarithm has the effect of compressing the number scale for numbers larger than 1 and to stretch out the number scale for numbers between 0 and 1:
log_values = log.(small_values)
10-element Vector{Float64}:
-61.418938533204674
-72.91893853320467
-85.41893853320467
-98.91893853320467
-113.41893853320467
-128.91893853320468
-145.41893853320468
-162.91893853320468
-181.41893853320468
-200.91893853320468
Of course, the sum is still non-zero:
sum(log_values)
-1251.6893853320469
While the use of higher precision floating point numbers (BigFloat
) lead to a huge performance loss, the log scale method dose not impair performance:
@benchmark sum(log_values)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 12.566 ns … 628.090 ns ┊ GC (min … max): 0.00% … 91.52%
Time (median): 14.248 ns ┊ GC (median): 0.00%
Time (mean ± σ): 15.603 ns ± 9.610 ns ┊ GC (mean ± σ): 0.89% ± 1.57%
▃▅▆███▇▆▅▅▄▃▂▁ ▂▂▁▃▄▃▂▂▂▂▁▁ ▁▁ ▁ ▁▁▃▂ ▂
███████████████▇█████████████████▇███████▇▇█▆▇▆▆▆▆▅▅▆▅▆▄▆▄▅▅ █
12.6 ns Histogram: log(frequency) by time 27.6 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
Logarithmic densities and Distributions.jl
As mentioned in Logarithmic scale it can be easier to implement some density functions in a logarithmic scale. This is not only true for the implementations from scratch, but also for Distributions.jl
. Observe that "far" away from the mean, the pdf of a normal distribution is rounded to zero:
pdf(Normal(0,1),100)
0.0
Obviously, the logarithm cannot be applied to this. However, Distributions.jl
offers a logpdf
function:
logpdf(Normal(0,1),100)
-5000.918938533205
This allows values even further away from the mean:
logpdf(Normal(0,1),10^20)
-3.0157549656954986e37