Floating point, Julia, and why computers are bad at math

Purpose

The purpose of this notebook is to cover the quirks of floating point arithmetic, how computers do math, and things to look out for when implementing algorithms.

Takeaways

  • Real numbers are not real
  • Computers make mistakes when they do math
  • The code you write isn't the code that gets executed
  • Algorithms are not just O(operation), they are also O(memory)
  • There are different types of memory!
  • CPUs learn!
In [1]:
using LinearAlgebra
using Plots
using Random
using Test
Random.seed!(1234);

Section 1: Julia

In [2]:
1 + 1
Out[2]:
2
In [3]:
println("Hello, World!, said the 🐱π.")
Hello, World!, said the 🐱π.
In [4]:
x = [1, 2, 3]
Out[4]:
3-element Array{Int64,1}:
 1
 2
 3
In [5]:
A = [1 2; 3 4]
Out[5]:
2×2 Array{Int64,2}:
 1  2
 3  4
In [6]:
eigvals(A)
Out[6]:
2-element Array{Float64,1}:
 -0.3722813232690143
  5.372281323269014 

Why is Julia fast?

In [7]:
f(x) = 2x + 1
Out[7]:
f (generic function with 1 method)
In [8]:
@code_warntype f(1)
Body::Int64
1 ─ %1 = (Base.mul_int)(2, x)::Int64
 %2 = (Base.add_int)(%1, 1)::Int64
└──      return %2
In [9]:
@code_warntype f(1.0im + 1)
Body::Complex{Float64}
1 ─ %1 = (Base.getfield)(x, :re)::Float64
 %2 = (Base.mul_float)(2.0, %1)::Float64
 %3 = (Base.getfield)(x, :im)::Float64
 %4 = (Base.mul_float)(2.0, %3)::Float64
 %5 = (Base.add_float)(1.0, %2)::Float64
 %6 = %new(Complex{Float64}, %5, %4)::Complex{Float64}
└──      return %6
In [10]:
using Zygote  # A package for automatic differentiation

f'(3)
Out[10]:
2
In [11]:
@code_native f'(6)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ interface.jl:50 within `#38'
	movl	$2, %eax
	retl
	nopw	%cs:(%eax,%eax)
; └

What's an integer

In [12]:
bitstring(Int8(1))
Out[12]:
"00000001"
In [13]:
length(bitstring(Int8(1)))
Out[13]:
8
In [14]:
bitstring(123456)
Out[14]:
"0000000000000000000000000000000000000000000000011110001001000000"
In [15]:
"""
    string_to_int(s::String; is_big_endian::Bool = true)

Return the integer corresponding to a string `s` 
representing the bits of an integer.

If `is_big_endian`, assumes bits are stored from most-signifcant
bit to least. If `!is_big_endian`, assumes bits are stored in 
little-Endian, i.e., least-significant bit to most.
"""
function string_to_int(s::String; is_big_endian::Bool = true)
    y = 0
    for (i, x) in enumerate(is_big_endian ? reverse(s) : s)
        if x == '1'
            y += 2^(i-1)
        end
    end
    return y
end

@testset "big_endian" begin 
    for i = 0:1024
        @test string_to_int(bitstring(i)) == i
    end
end

@testset "little_endian" begin 
    @test string_to_int(bitstring(0); is_big_endian = false) == 0
    for i = 1:1024
        @test string_to_int(bitstring(i); is_big_endian = false) != i
    end
end
Test Summary: | Pass  Total
big_endian    | 1025   1025
Test Summary: | Pass  Total
little_endian | 1025   1025
Out[15]:
Test.DefaultTestSet("little_endian", Any[], 1025, false)

What do floats float on anyway

In [16]:
0.1 + 0.2  0.3
Out[16]:
true
In [17]:
eps(Float64)
Out[17]:
1.4901161193847656e-8
In [18]:
?isapprox
search: isapprox

Out[18]:
isapprox(x, y; rtol::Real=atol>0 ? 0 : √eps, atol::Real=0, nans::Bool=false, norm::Function)

Inexact equality comparison: true if norm(x-y) <= max(atol, rtol*max(norm(x), norm(y))). The default atol is zero and the default rtol depends on the types of x and y. The keyword argument nans determines whether or not NaN values are considered equal (defaults to false).

For real or complex floating-point values, if an atol > 0 is not specified, rtol defaults to the square root of eps of the type of x or y, whichever is bigger (least precise). This corresponds to requiring equality of about half of the significand digits. Otherwise, e.g. for integer arguments or if an atol > 0 is supplied, rtol defaults to zero.

x and y may also be arrays of numbers, in which case norm defaults to vecnorm but may be changed by passing a norm::Function keyword argument. (For numbers, norm is the same thing as abs.) When x and y are arrays, if norm(x-y) is not finite (i.e. ±Inf or NaN), the comparison falls back to checking whether all elements of x and y are approximately equal component-wise.

The binary operator is equivalent to isapprox with the default arguments, and x ≉ y is equivalent to !isapprox(x,y).

Note that x ≈ 0 (i.e., comparing to zero with the default tolerances) is equivalent to x == 0 since the default atol is 0. In such cases, you should either supply an appropriate atol (or use norm(x) ≤ atol) or rearrange your code (e.g. use x ≈ y rather than x - y ≈ 0). It is not possible to pick a nonzero atol automatically because it depends on the overall scaling (the "units") of your problem: for example, in x - y ≈ 0, atol=1e-9 is an absurdly small tolerance if x is the radius of the Earth in meters, but an absurdly large tolerance if x is the radius of a Hydrogen atom in meters.

Examples

jldoctest
julia> 0.1 ≈ (0.1 - 1e-10)
true

julia> isapprox(10, 11; atol = 2)
true

julia> isapprox([10.0^9, 1.0], [10.0^9, 2.0])
true

julia> 1e-10 ≈ 0
false

julia> isapprox(1e-10, 0, atol=1e-8)
true

isapprox(x::FixedPoint, y::FixedPoint; rtol=0, atol=max(eps(x), eps(y)))

For FixedPoint numbers, the default criterion is that x and y differ by no more than eps, the separation between adjacent fixed-point numbers.

In [19]:
my_float = 0.1
Out[19]:
0.1
In [20]:
bitstring(my_float)
Out[20]:
"0011111110111001100110011001100110011001100110011001100110011010"

Floats have three components:

$$sign * significand * 2^{exponent}$$
In [21]:
Base.sign(my_float)
Out[21]:
1.0
In [22]:
Base.significand(my_float)
Out[22]:
1.6
In [23]:
Base.exponent(my_float)
Out[23]:
-4
In [24]:
1.0 * (1.6) * (2)^(-4)
Out[24]:
0.1
In [25]:
BigFloat(0.1)
Out[25]:
0.1000000000000000055511151231257827021181583404541015625

Floats are stored as a string of bits. A Float64 has 64 bits:

In [26]:
my_float_bits = bitstring(my_float)
Out[26]:
"0011111110111001100110011001100110011001100110011001100110011010"

The first bit is the sign bit:

In [27]:
s_my_sign = my_float_bits[1]
Out[27]:
'0': ASCII/Unicode U+0030 (category Nd: Number, decimal digit)

Sign bits are stored: $$-1^{bit}$$

In [28]:
my_sign = (-1) ^ parse(Int, s_my_sign)
Out[28]:
1

Bits 2 through 12 (i.e., 11 bits) are the exponent:

In [29]:
s_my_exponent = my_float_bits[2:12]
Out[29]:
"01111111011"

Exponents are stored with a bias, which depends on the number bits in the floating point number. For Float64, the bias is 1023.

In [30]:
my_exponent = string_to_int(s_my_exponent) - 1023
Out[30]:
-4

The remaining bits (13--64) (i.e., 52 bits) encode the significand.

In [31]:
s_my_significand = my_float_bits[13:end]
Out[31]:
"1001100110011001100110011001100110011001100110011010"

Given a bitstring s, the significand is stored as follows: $$significand = 1 + \sum_i s_i \times 2^{-i}. $$

In [32]:
1//2
Out[32]:
1//2
In [33]:
"""
    string_to_inv_rational(s::String)

Return Σᵢ sᵢ * 2⁻ⁱ. 

Note that bits are still stored from most-significant to least
(big Endian), but because the coefficient is decreasing, we walk
fowards through the string.
"""
function string_to_inv_rational(s::String)
    y = 0 // 1
    for (i, x) in enumerate(s)
        if x == '1'
            y += 1 // 2^i
        end
    end
    return y
end

@test string_to_inv_rational("000") == 0 // 8
@test string_to_inv_rational("001") == 1 // 8
@test string_to_inv_rational("010") == 2 // 8
@test string_to_inv_rational("011") == 3 // 8
@test string_to_inv_rational("100") == 4 // 8
@test string_to_inv_rational("101") == 5 // 8
@test string_to_inv_rational("110") == 6 // 8
@test string_to_inv_rational("111") == 7 // 8
Out[33]:
Test Passed
In [34]:
my_significand = 1 + string_to_inv_rational(s_my_significand)
Out[34]:
3602879701896397//2251799813685248
In [35]:
BigFloat(my_significand)
Out[35]:
1.600000000000000088817841970012523233890533447265625

Putting this all together gives:

In [36]:
my_sign * my_significand * 2.0^my_exponent
Out[36]:
0.1
In [37]:
BigFloat(my_sign * my_significand * 2.0^my_exponent)
Out[37]:
0.1000000000000000055511151231257827021181583404541015625

Because of the way they are stored, floats cannot represent all real numbers. Instead, they approximate the real numbers by a finite set of values, they spacing between which depends on where the value is in the number line.

In [38]:
typemin(Float64)
Out[38]:
-Inf
In [39]:
nextfloat(typemin(Float64))
Out[39]:
-1.7976931348623157e308
In [40]:
x = nextfloat(typemin(Float64))
nextfloat(x) - x
Out[40]:
1.99584030953472e292
In [41]:
nextfloat(x)
Out[41]:
-1.7976931348623155e308
In [42]:
x = Float64(0.0)
nextfloat(x) - x
Out[42]:
5.0e-324
In [43]:
x = typemin(Float16)
y = Float64[x]
while x < typemax(Float16)
    x = nextfloat(x)
    push!(y, x)
end
y
Out[43]:
63489-element Array{Float64,1}:
   -Inf  
 -65504.0
 -65472.0
 -65440.0
 -65408.0
 -65376.0
 -65344.0
 -65312.0
 -65280.0
 -65248.0
 -65216.0
 -65184.0
 -65152.0
      ⋮  
  65184.0
  65216.0
  65248.0
  65280.0
  65312.0
  65344.0
  65376.0
  65408.0
  65440.0
  65472.0
  65504.0
    Inf  
In [44]:
plot(
    y[2:end], diff(y),
    title = "Float16",
    xlabel = "Float16(x)",
    ylabel = "Gap between two nearest values",
    legend = false,
    w = 2,
    size = (600, 300)
)
Out[44]:
- 5.0×10 4 - 2.5×10 4 0 2.5×10 4 5.0×10 4 0 10 20 30 Float16 Float16(x) Gap between two nearest values
In [45]:
plot(
    y[2:end], abs.(diff(y) ./ y[2:end]),
    title = "Float16",
    xlabel = "Float16(x)",
    ylabel = "Relative gap between two nearest values",
    legend = false,
    w = 2,
    ylims=(0, 0.001),
    size = (600, 300)
)
Out[45]:
- 5.0×10 4 - 2.5×10 4 0 2.5×10 4 5.0×10 4 0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 Float16 Float16(x) Relative gap between two nearest values

For Float64, the relative gap is on the order of 1e-16.

In [46]:
rel_float(x) = log10(big(nextfloat(Float64(x))) / Float64(x))
@show rel_float(1.0)
@show rel_float(1e6)
@show rel_float(1e100)
rel_float(1.0) = 9.643274665532870036582915124251528998657489145735719797126908901529315556125987e-17
rel_float(1.0e6) = 5.055853187842897632763057245199072744713910974049673247098526072324069356667919e-17
rel_float(1.0e100) = 8.436903800587370737577946320094566189509738835395676919057547237940321570207997e-17
Out[46]:
8.436903800587370737577946320094566189509738835395676919057547237940321570207997e-17
In [ ]:

What happens if the result of a computation falls in one of the "missing" intervals of the real line?

In [47]:
bitstring(Float16(0.1, RoundNearest))
Out[47]:
"0010111001100110"
In [48]:
bitstring(Float16(0.1, RoundDown))
Out[48]:
"0010111001100110"
In [49]:
bitstring(Float16(0.1, RoundUp))
Out[49]:
"0010111001100111"

Takeaways

  • Floating point arithmetic has relative accuracy.

Benchmarking

In [50]:
using BenchmarkTools

const sum_x = rand(2^16)
Out[50]:
65536-element Array{Float64,1}:
 0.5321679144582991 
 0.7440687931830725 
 0.17420924655027847
 0.6095413423260092 
 0.5811121916495186 
 0.4618600032955602 
 0.8548072895438457 
 0.6479584788429675 
 0.7848631306989902 
 0.08043609303278632
 0.5322608515836502 
 0.569827952518936  
 0.5581170907968327 
 ⋮                  
 0.29092695640622646
 0.9273910898003137 
 0.38793845448930275
 0.9398412289343301 
 0.21916729753000674
 0.37539994947327227
 0.1842597010979654 
 0.8896881002123835 
 0.44235584572264686
 0.7111890556918798 
 0.040817896891157  
 0.5373686333531669 
In [51]:
sum_1(x::Vector{T}) where {T} = sum(x)
@btime sum_1($sum_x)
  8.968 μs (0 allocations: 0 bytes)
Out[51]:
32737.35345461974
In [52]:
function sum_2(x::Vector{T}) where {T}
    y = zero(T)
    for i = 1:length(x)
        y += x[i]
    end
    return y
end
@btime sum_2($sum_x)
  69.193 μs (0 allocations: 0 bytes)
Out[52]:
32737.353454619715
In [53]:
function sum_3(x::Vector{T}) where {T}
    @assert mod(length(x), 4) == 0
    y = zero(T)
    i = 1
    while i < length(x)
        y += x[i] + x[i + 1] + x[i + 2] + x[i + 3]
        i += 4
    end
    return y
end
@btime sum_3($sum_x)
  26.008 μs (0 allocations: 0 bytes)
Out[53]:
32737.353454619926
In [54]:
function sum_4(x::Vector{T}) where {T}
    @assert mod(length(x), 4) == 0
    y = zero(T)
    i = 1
    @inbounds while i < length(x)
        y += x[i] + x[i + 1] + x[i + 2] + x[i + 3]
        i += 4
    end
    return y
end
@btime sum_4($sum_x)
  17.825 μs (0 allocations: 0 bytes)
Out[54]:
32737.353454619926

Has anyone noticed something weird about the sum totals?

In [55]:
@show sum_1(sum_x) - sum_2(sum_x)
@show sum_1(sum_x) - sum_3(sum_x)
@show sum_1(sum_x) - sum_4(sum_x)
sum_1(sum_x) - sum_2(sum_x) = 2.546585164964199e-11
sum_1(sum_x) - sum_3(sum_x) = -1.8553691916167736e-10
sum_1(sum_x) - sum_4(sum_x) = -1.8553691916167736e-10
Out[55]:
-1.8553691916167736e-10

What's going on?

In [56]:
1 + 1e-16 == 1
Out[56]:
true
In [57]:
run(`python -c 'print(1.0 + 1e-16 == 1.0)'`)
True
Out[57]:
Process(`python -c 'print(1.0 + 1e-16 == 1.0)'`, ProcessExited(0))

Floating point numbers are bad at doing math when they are far apart from each other!

In [58]:
function KahanSum(x)
    y = 0.0
    c = 0.0
    for i = 1:length(x)
        z = x[i] - c
        t = y + z
        c = (t - y) - z
        y = t
    end
    return y
end

sum(sum_x) - KahanSum(sum_x)
Out[58]:
3.637978807091713e-12

Explanation

  • In each iteration:
    • y is big compared to x
    • x is similar to c
    • y is similar to t
  • If t > y + z, then we have "overshot" the sum.
    • (t - y) - z > 0
    • ⟹ we subtract a positive c on the next iteration correct for the over-shoot
  • If t < y + z, then we have "undershot" the sum.
    • (t - y) - z < 0
    • ⟹ we subtract a negative c on the next iteration correct for the over-shoot

Another solution: Arbitrary precision arithmetic

In [59]:
big(0.1)
Out[59]:
0.1000000000000000055511151231257827021181583404541015625
In [60]:
precision(big(0.1))
Out[60]:
256
In [61]:
sum(big.(sum_x)) - sum_2(big.(sum_x))
Out[61]:
0.0
In [62]:
@btime sum(big.(sum_x))
@btime sum(sum_x)
  5.034 ms (131076 allocations: 7.50 MiB)
  8.955 μs (0 allocations: 0 bytes)
Out[62]:
32737.35345461974
In [63]:
f(k::T) where {T} = abs.(k * T(0.1) - sum(T(0.1) for i = 1:k))
x32 = Float32(10).^(1:8)
x64 = Float64(10).^(1:8)
plot(
    xlabel = "log10(k)",
    ylabel = "log10(f(k))",
    xscale = :log10, 
    yscale = :log10,
    legend = :bottomright
)
plot!(x64, f.(x32), label = "Float32", w = 3)
plot!(x64, f.(x64), label = "Float64", w = 3)
Out[63]:
10 2 10 4 10 6 10 8 10 - 15 10 - 10 10 - 5 10 0 10 5 log10(k) log10(f(k)) Float32 Float64

Take-aways

  • Real numbers aren't real
  • Computers are good at math when numbers are similar orders of magnitude
  • Float64 can compare across ~16 orders of magnitude
  • It's easy to lose a few bits of precision when doing calculations
    • Let's say 4 for safety
  • If you have a tolerance of 10e-6, don't use numbers larger than 10e+6!

Performance: Part II

In [64]:
function sum_matrix_col_row(X)
    I, J = size(X)
    y = 0.0
    for i = 1:I
        for j = 1:J
            y += X[i, j]
        end
    end
    return y
end

function sum_matrix_row_col(X)
    I, J = size(X)
    y = 0.0
    for j = 1:J
        for i = 1:I
            y += X[i, j]
        end
    end
    return y
end
Out[64]:
sum_matrix_row_col (generic function with 1 method)
In [65]:
N = 10_000
X = rand(N, N)
@btime sum_matrix_col_row($X)
@btime sum_matrix_row_col($X)
  809.678 ms (0 allocations: 0 bytes)
  122.252 ms (0 allocations: 0 bytes)
Out[65]:
4.9999472317076415e7

What's happening here?

  • Computers don't just have RAM
  • They have caches as well
    • Caches are closer to the CPU, but have much less capacity
    • L1 ~ 32 KB ~ 4096 Float64s
    • L2 ~ 256 KB ~ 32,768 Float64s
    • L3 ~ 4 MB ~ 524,288 Float64s
  • When we ask for x[i], the computer expects that we will probably want to access x[i+1], x[i+2], ..., so it fetches a block of memory once, rather than one-by-one.
  • Julia stores matrices in column-major order (i.e., column by column), so the CPU will pull in the next values in the list!
  • If we iterate row-by-row, then the next value isn't in the cache, and we have to perform a new memory fetch.

Want to see something crazy?

Computers can learn! Not Julia, but the actual CPU.

In the following benchmark, we look at a simple challenging of returning all the indices for which an element in an array is true:

In [66]:
x = rand(Bool, 5)
Out[66]:
5-element Array{Bool,1}:
 false
  true
 false
  true
  true
In [67]:
findall(x)
Out[67]:
3-element Array{Int64,1}:
 2
 4
 5

Specifically, we're going to time how fast the computer is if the array has a periodical structure, so [true, false, true, false, ...] or [true true false, true, true, false, ...].

As we will see, my macbook can learn really long periods!

In [68]:
using Printf, BenchmarkTools

const CPU_SPEED_GHZ = Sys.cpu_info()[1].speed / 1_000
const CPU_MODEL = Sys.cpu_info()[1].model;

function print_line(n, time, reference_time, cycle_time, penalty)
    @printf(
        "Period %5d: %7.2f us = %7.2f cycles per idx. Relative time %6.2f%%\n", 
        n, 
        time * 10^6, 
        time * cycle_time, 
        time / reference_time * 100
#         100 * (time - reference_time) * cycle_time / penalty 
    )
end

begin 
    # Calibrate performance with 30,000 values.
    N = 30_000
    # list = [true, false, true, false, ..., true, false]
    list = fill(false, N)
    list[1:2:end] .= true 
    true_false_time = @belapsed findall($list)
    # Now make list totally random
    list .= rand(Bool, N)
    random_time = @belapsed findall($list)
    # Adjustment to convert time to CPU cycles
    time_to_cycle = 10^9 / N * CPU_SPEED_GHZ
    # Compute how many cycles it takes if we mis-predict a branch
    penalty = 2 * (random_time - true_false_time) * time_to_cycle
    @printf(
        "\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
        CPU_MODEL, penalty / CPU_SPEED_GHZ, penalty
    )

    print_line(30_000, random_time, random_time, time_to_cycle, penalty)
    for n in [10_000, 5_000, 3_000, 2_500, 1_000, 500, 100, 2]
        # Generate a random Boolean pattern of length n, so now 
        #     list = [pattern, pattern, ..., pattern]
        pattern = rand(Bool, n)
        for i = 1:n:N 
            list[i:(i + n - 1)].= pattern
        end
        # Time how long to find all `true`s
        time = @belapsed findall($list)

        print_line(n, time, random_time, time_to_cycle, penalty)
    end
end

Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz; branch-miss penalty:  6.1 ns = 14.1 cycles

Period 30000:  118.55 us =    9.09 cycles per idx. Relative time 100.00%
Period 10000:  110.94 us =    8.51 cycles per idx. Relative time  93.58%
Period  5000:   98.21 us =    7.53 cycles per idx. Relative time  82.84%
Period  3000:   79.36 us =    6.08 cycles per idx. Relative time  66.94%
Period  2500:   65.48 us =    5.02 cycles per idx. Relative time  55.23%
Period  1000:   27.55 us =    2.11 cycles per idx. Relative time  23.24%
Period   500:   26.50 us =    2.03 cycles per idx. Relative time  22.35%
Period   100:   26.39 us =    2.02 cycles per idx. Relative time  22.26%
Period     2:   31.09 us =    2.38 cycles per idx. Relative time  26.22%
In [69]:
using Statistics
function run_experiment(N, n, K)
    list = fill(false, N)
    pattern = rand(Bool, n)
    for i = 1:n:N
        list[i:(i + n - 1)] .= pattern
    end
    list
    GC.gc()
    times = [@elapsed findall(list) for _ in 1:K]
    plot!(
        times, 
        ylims=(0, 0.0002),
        legend = false
    )
end


plot()
for i = 1:30
    run_experiment(30000, 100, 50)
end
plot!(xlabel = "Repetition", ylabel = "Execution time")
Out[69]:
0 10 20 30 40 50 0.00000 0.00005 0.00010 0.00015 0.00020 Repetition Execution time

Takeaways

  • Real numbers are not real
  • Computers make mistakes when they do math
  • The code you write isn't the code that gets executed
  • Algorithms are not just O(operation), they are also O(memory)
  • There are different types of memory!
  • CPUs learn!