Floating point, Julia, and why computers are bad at math¶

Purpose¶

The purpose of this notebook is to cover the quirks of floating point arithmetic, how computers do math, and things to look out for when implementing algorithms.

Takeaways¶

Real numbers are not real
Computers make mistakes when they do math
The code you write isn't the code that gets executed
Algorithms are not just O(operation), they are also O(memory)
There are different types of memory!
CPUs learn!

using LinearAlgebra
using Plots
using Random
using Test
Random.seed!(1234);

Section 1: Julia¶

1 + 1

2

println("Hello, World!, said the 🐱π.")

Hello, World!, said the 🐱π.

x = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

A = [1 2; 3 4]

2×2 Array{Int64,2}:
 1  2
 3  4

eigvals(A)

2-element Array{Float64,1}:
 -0.3722813232690143
  5.372281323269014

Why is Julia fast?¶

f(x) = 2x + 1

f (generic function with 1 method)

@code_warntype f(1)

Body::Int64
1 ─ %1 = (Base.mul_int)(2, x)::Int64
│   %2 = (Base.add_int)(%1, 1)::Int64
└──      return %2

@code_warntype f(1.0im + 1)

Body::Complex{Float64}
1 ─ %1 = (Base.getfield)(x, :re)::Float64
│   %2 = (Base.mul_float)(2.0, %1)::Float64
│   %3 = (Base.getfield)(x, :im)::Float64
│   %4 = (Base.mul_float)(2.0, %3)::Float64
│   %5 = (Base.add_float)(1.0, %2)::Float64
│   %6 = %new(Complex{Float64}, %5, %4)::Complex{Float64}
└──      return %6

using Zygote  # A package for automatic differentiation

f'(3)

2

@code_native f'(6)

	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ interface.jl:50 within `#38'
	movl	$2, %eax
	retl
	nopw	%cs:(%eax,%eax)
; └

What's an integer¶

bitstring(Int8(1))

"00000001"

length(bitstring(Int8(1)))

8

bitstring(123456)

"0000000000000000000000000000000000000000000000011110001001000000"

"""
    string_to_int(s::String; is_big_endian::Bool = true)

Return the integer corresponding to a string `s` 
representing the bits of an integer.

If `is_big_endian`, assumes bits are stored from most-signifcant
bit to least. If `!is_big_endian`, assumes bits are stored in 
little-Endian, i.e., least-significant bit to most.
"""
function string_to_int(s::String; is_big_endian::Bool = true)
    y = 0
    for (i, x) in enumerate(is_big_endian ? reverse(s) : s)
        if x == '1'
            y += 2^(i-1)
        end
    end
    return y
end

@testset "big_endian" begin 
    for i = 0:1024
        @test string_to_int(bitstring(i)) == i
    end
end

@testset "little_endian" begin 
    @test string_to_int(bitstring(0); is_big_endian = false) == 0
    for i = 1:1024
        @test string_to_int(bitstring(i); is_big_endian = false) != i
    end
end

Test Summary: | Pass  Total
big_endian    | 1025   1025
Test Summary: | Pass  Total
little_endian | 1025   1025

Test.DefaultTestSet("little_endian", Any[], 1025, false)

What do floats float on anyway¶

0.1 + 0.2 ≈ 0.3

true

√eps(Float64)

1.4901161193847656e-8

?isapprox

search: isapprox

isapprox(x, y; rtol::Real=atol>0 ? 0 : √eps, atol::Real=0, nans::Bool=false, norm::Function)

jldoctest
julia> 0.1 ≈ (0.1 - 1e-10)
true

julia> isapprox(10, 11; atol = 2)
true

julia> isapprox([10.0^9, 1.0], [10.0^9, 2.0])
true

julia> 1e-10 ≈ 0
false

julia> isapprox(1e-10, 0, atol=1e-8)
true

isapprox(x::FixedPoint, y::FixedPoint; rtol=0, atol=max(eps(x), eps(y)))

my_float = 0.1

0.1

bitstring(my_float)

"0011111110111001100110011001100110011001100110011001100110011010"

Floats have three components:

$$sign * significand * 2^{exponent}$$

Base.sign(my_float)

1.0

Base.significand(my_float)

1.6

Base.exponent(my_float)

-4

1.0 * (1.6) * (2)^(-4)

0.1

BigFloat(0.1)

0.1000000000000000055511151231257827021181583404541015625

Floats are stored as a string of bits. A Float64 has 64 bits:

my_float_bits = bitstring(my_float)

"0011111110111001100110011001100110011001100110011001100110011010"

The first bit is the sign bit:

s_my_sign = my_float_bits[1]

'0': ASCII/Unicode U+0030 (category Nd: Number, decimal digit)

Sign bits are stored: $$-1^{bit}$$

my_sign = (-1) ^ parse(Int, s_my_sign)

1

Bits 2 through 12 (i.e., 11 bits) are the exponent:

s_my_exponent = my_float_bits[2:12]

"01111111011"

Exponents are stored with a bias, which depends on the number bits in the floating point number. For Float64, the bias is 1023.

my_exponent = string_to_int(s_my_exponent) - 1023

-4

The remaining bits (13--64) (i.e., 52 bits) encode the significand.

s_my_significand = my_float_bits[13:end]

"1001100110011001100110011001100110011001100110011010"

Given a bitstring s, the significand is stored as follows: $$significand = 1 + \sum_i s_i \times 2^{-i}. $$

1//2

1//2

"""
    string_to_inv_rational(s::String)

Return Σᵢ sᵢ * 2⁻ⁱ. 

Note that bits are still stored from most-significant to least
(big Endian), but because the coefficient is decreasing, we walk
fowards through the string.
"""
function string_to_inv_rational(s::String)
    y = 0 // 1
    for (i, x) in enumerate(s)
        if x == '1'
            y += 1 // 2^i
        end
    end
    return y
end

@test string_to_inv_rational("000") == 0 // 8
@test string_to_inv_rational("001") == 1 // 8
@test string_to_inv_rational("010") == 2 // 8
@test string_to_inv_rational("011") == 3 // 8
@test string_to_inv_rational("100") == 4 // 8
@test string_to_inv_rational("101") == 5 // 8
@test string_to_inv_rational("110") == 6 // 8
@test string_to_inv_rational("111") == 7 // 8

Test Passed

my_significand = 1 + string_to_inv_rational(s_my_significand)

3602879701896397//2251799813685248

BigFloat(my_significand)

1.600000000000000088817841970012523233890533447265625

Putting this all together gives:

my_sign * my_significand * 2.0^my_exponent

0.1

BigFloat(my_sign * my_significand * 2.0^my_exponent)

0.1000000000000000055511151231257827021181583404541015625

Here's another explanation: https://www.exploringbinary.com/why-0-point-1-does-not-exist-in-floating-point/

Because of the way they are stored, floats cannot represent all real numbers. Instead, they approximate the real numbers by a finite set of values, they spacing between which depends on where the value is in the number line.

typemin(Float64)

-Inf

nextfloat(typemin(Float64))

-1.7976931348623157e308

x = nextfloat(typemin(Float64))
nextfloat(x) - x

1.99584030953472e292

nextfloat(x)

-1.7976931348623155e308

x = Float64(0.0)
nextfloat(x) - x

5.0e-324

x = typemin(Float16)
y = Float64[x]
while x < typemax(Float16)
    x = nextfloat(x)
    push!(y, x)
end
y

63489-element Array{Float64,1}:
   -Inf  
 -65504.0
 -65472.0
 -65440.0
 -65408.0
 -65376.0
 -65344.0
 -65312.0
 -65280.0
 -65248.0
 -65216.0
 -65184.0
 -65152.0
      ⋮  
  65184.0
  65216.0
  65248.0
  65280.0
  65312.0
  65344.0
  65376.0
  65408.0
  65440.0
  65472.0
  65504.0
    Inf

plot(
    y[2:end], diff(y),
    title = "Float16",
    xlabel = "Float16(x)",
    ylabel = "Gap between two nearest values",
    legend = false,
    w = 2,
    size = (600, 300)
)

plot(
    y[2:end], abs.(diff(y) ./ y[2:end]),
    title = "Float16",
    xlabel = "Float16(x)",
    ylabel = "Relative gap between two nearest values",
    legend = false,
    w = 2,
    ylims=(0, 0.001),
    size = (600, 300)
)

For Float64, the relative gap is on the order of 1e-16.

rel_float(x) = log10(big(nextfloat(Float64(x))) / Float64(x))
@show rel_float(1.0)
@show rel_float(1e6)
@show rel_float(1e100)

rel_float(1.0) = 9.643274665532870036582915124251528998657489145735719797126908901529315556125987e-17
rel_float(1.0e6) = 5.055853187842897632763057245199072744713910974049673247098526072324069356667919e-17
rel_float(1.0e100) = 8.436903800587370737577946320094566189509738835395676919057547237940321570207997e-17

8.436903800587370737577946320094566189509738835395676919057547237940321570207997e-17

What happens if the result of a computation falls in one of the "missing" intervals of the real line?

bitstring(Float16(0.1, RoundNearest))

"0010111001100110"

bitstring(Float16(0.1, RoundDown))

"0010111001100110"

bitstring(Float16(0.1, RoundUp))

"0010111001100111"

Takeaways¶

Floating point arithmetic has relative accuracy.

Benchmarking¶

using BenchmarkTools

const sum_x = rand(2^16)

65536-element Array{Float64,1}:
 0.5321679144582991 
 0.7440687931830725 
 0.17420924655027847
 0.6095413423260092 
 0.5811121916495186 
 0.4618600032955602 
 0.8548072895438457 
 0.6479584788429675 
 0.7848631306989902 
 0.08043609303278632
 0.5322608515836502 
 0.569827952518936  
 0.5581170907968327 
 ⋮                  
 0.29092695640622646
 0.9273910898003137 
 0.38793845448930275
 0.9398412289343301 
 0.21916729753000674
 0.37539994947327227
 0.1842597010979654 
 0.8896881002123835 
 0.44235584572264686
 0.7111890556918798 
 0.040817896891157  
 0.5373686333531669

sum_1(x::Vector{T}) where {T} = sum(x)
@btime sum_1($sum_x)

  8.968 μs (0 allocations: 0 bytes)

32737.35345461974

function sum_2(x::Vector{T}) where {T}
    y = zero(T)
    for i = 1:length(x)
        y += x[i]
    end
    return y
end
@btime sum_2($sum_x)

  69.193 μs (0 allocations: 0 bytes)

32737.353454619715

function sum_3(x::Vector{T}) where {T}
    @assert mod(length(x), 4) == 0
    y = zero(T)
    i = 1
    while i < length(x)
        y += x[i] + x[i + 1] + x[i + 2] + x[i + 3]
        i += 4
    end
    return y
end
@btime sum_3($sum_x)

  26.008 μs (0 allocations: 0 bytes)

32737.353454619926

function sum_4(x::Vector{T}) where {T}
    @assert mod(length(x), 4) == 0
    y = zero(T)
    i = 1
    @inbounds while i < length(x)
        y += x[i] + x[i + 1] + x[i + 2] + x[i + 3]
        i += 4
    end
    return y
end
@btime sum_4($sum_x)

  17.825 μs (0 allocations: 0 bytes)

32737.353454619926

Has anyone noticed something weird about the sum totals?¶

@show sum_1(sum_x) - sum_2(sum_x)
@show sum_1(sum_x) - sum_3(sum_x)
@show sum_1(sum_x) - sum_4(sum_x)

sum_1(sum_x) - sum_2(sum_x) = 2.546585164964199e-11
sum_1(sum_x) - sum_3(sum_x) = -1.8553691916167736e-10
sum_1(sum_x) - sum_4(sum_x) = -1.8553691916167736e-10

-1.8553691916167736e-10

What's going on?¶

1 + 1e-16 == 1

true

run(`python -c 'print(1.0 + 1e-16 == 1.0)'`)

True

Process(`python -c 'print(1.0 + 1e-16 == 1.0)'`, ProcessExited(0))

Floating point numbers are bad at doing math when they are far apart from each other!

One solution: Kahan summation¶

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

function KahanSum(x)
    y = 0.0
    c = 0.0
    for i = 1:length(x)
        z = x[i] - c
        t = y + z
        c = (t - y) - z
        y = t
    end
    return y
end

sum(sum_x) - KahanSum(sum_x)

3.637978807091713e-12

Explanation¶

In each iteration:
- y is big compared to x
- x is similar to c
- y is similar to t
If t > y + z, then we have "overshot" the sum.
- ⟹(t - y) - z > 0
- ⟹ we subtract a positive c on the next iteration correct for the over-shoot
If t < y + z, then we have "undershot" the sum.
- ⟹(t - y) - z < 0
- ⟹ we subtract a negative c on the next iteration correct for the over-shoot

Another solution: Arbitrary precision arithmetic¶

big(0.1)

0.1000000000000000055511151231257827021181583404541015625

precision(big(0.1))

256

sum(big.(sum_x)) - sum_2(big.(sum_x))

0.0

@btime sum(big.(sum_x))
@btime sum(sum_x)

  5.034 ms (131076 allocations: 7.50 MiB)
  8.955 μs (0 allocations: 0 bytes)

32737.35345461974

f(k::T) where {T} = abs.(k * T(0.1) - sum(T(0.1) for i = 1:k))
x32 = Float32(10).^(1:8)
x64 = Float64(10).^(1:8)
plot(
    xlabel = "log10(k)",
    ylabel = "log10(f(k))",
    xscale = :log10, 
    yscale = :log10,
    legend = :bottomright
)
plot!(x64, f.(x32), label = "Float32", w = 3)
plot!(x64, f.(x64), label = "Float64", w = 3)

Take-aways¶

Real numbers aren't real
Computers are good at math when numbers are similar orders of magnitude
Float64 can compare across ~16 orders of magnitude
It's easy to lose a few bits of precision when doing calculations
- Let's say 4 for safety
If you have a tolerance of 10e-6, don't use numbers larger than 10e+6!

Performance: Part II¶

function sum_matrix_col_row(X)
    I, J = size(X)
    y = 0.0
    for i = 1:I
        for j = 1:J
            y += X[i, j]
        end
    end
    return y
end

function sum_matrix_row_col(X)
    I, J = size(X)
    y = 0.0
    for j = 1:J
        for i = 1:I
            y += X[i, j]
        end
    end
    return y
end

sum_matrix_row_col (generic function with 1 method)

N = 10_000
X = rand(N, N)
@btime sum_matrix_col_row($X)
@btime sum_matrix_row_col($X)

  809.678 ms (0 allocations: 0 bytes)
  122.252 ms (0 allocations: 0 bytes)

4.9999472317076415e7

What's happening here?¶

Computers don't just have RAM
They have caches as well
- Caches are closer to the CPU, but have much less capacity
- L1 ~ 32 KB ~ 4096 Float64s
- L2 ~ 256 KB ~ 32,768 Float64s
- L3 ~ 4 MB ~ 524,288 Float64s
When we ask for x[i], the computer expects that we will probably want to access x[i+1], x[i+2], ..., so it fetches a block of memory once, rather than one-by-one.
Julia stores matrices in column-major order (i.e., column by column), so the CPU will pull in the next values in the list!
If we iterate row-by-row, then the next value isn't in the cache, and we have to perform a new memory fetch.

Want to see something crazy?¶

Computers can learn! Not Julia, but the actual CPU.

In the following benchmark, we look at a simple challenging of returning all the indices for which an element in an array is true:

x = rand(Bool, 5)

5-element Array{Bool,1}:
 false
  true
 false
  true
  true

findall(x)

3-element Array{Int64,1}:
 2
 4
 5

Specifically, we're going to time how fast the computer is if the array has a periodical structure, so [true, false, true, false, ...] or [true true false, true, true, false, ...].

As we will see, my macbook can learn really long periods!

using Printf, BenchmarkTools

const CPU_SPEED_GHZ = Sys.cpu_info()[1].speed / 1_000
const CPU_MODEL = Sys.cpu_info()[1].model;

function print_line(n, time, reference_time, cycle_time, penalty)
    @printf(
        "Period %5d: %7.2f us = %7.2f cycles per idx. Relative time %6.2f%%\n", 
        n, 
        time * 10^6, 
        time * cycle_time, 
        time / reference_time * 100
#         100 * (time - reference_time) * cycle_time / penalty 
    )
end

begin 
    # Calibrate performance with 30,000 values.
    N = 30_000
    # list = [true, false, true, false, ..., true, false]
    list = fill(false, N)
    list[1:2:end] .= true 
    true_false_time = @belapsed findall($list)
    # Now make list totally random
    list .= rand(Bool, N)
    random_time = @belapsed findall($list)
    # Adjustment to convert time to CPU cycles
    time_to_cycle = 10^9 / N * CPU_SPEED_GHZ
    # Compute how many cycles it takes if we mis-predict a branch
    penalty = 2 * (random_time - true_false_time) * time_to_cycle
    @printf(
        "\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
        CPU_MODEL, penalty / CPU_SPEED_GHZ, penalty
    )

    print_line(30_000, random_time, random_time, time_to_cycle, penalty)
    for n in [10_000, 5_000, 3_000, 2_500, 1_000, 500, 100, 2]
        # Generate a random Boolean pattern of length n, so now 
        #     list = [pattern, pattern, ..., pattern]
        pattern = rand(Bool, n)
        for i = 1:n:N 
            list[i:(i + n - 1)].= pattern
        end
        # Time how long to find all `true`s
        time = @belapsed findall($list)

        print_line(n, time, random_time, time_to_cycle, penalty)
    end
end


Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz; branch-miss penalty:  6.1 ns = 14.1 cycles

Period 30000:  118.55 us =    9.09 cycles per idx. Relative time 100.00%
Period 10000:  110.94 us =    8.51 cycles per idx. Relative time  93.58%
Period  5000:   98.21 us =    7.53 cycles per idx. Relative time  82.84%
Period  3000:   79.36 us =    6.08 cycles per idx. Relative time  66.94%
Period  2500:   65.48 us =    5.02 cycles per idx. Relative time  55.23%
Period  1000:   27.55 us =    2.11 cycles per idx. Relative time  23.24%
Period   500:   26.50 us =    2.03 cycles per idx. Relative time  22.35%
Period   100:   26.39 us =    2.02 cycles per idx. Relative time  22.26%
Period     2:   31.09 us =    2.38 cycles per idx. Relative time  26.22%

using Statistics
function run_experiment(N, n, K)
    list = fill(false, N)
    pattern = rand(Bool, n)
    for i = 1:n:N
        list[i:(i + n - 1)] .= pattern
    end
    list
    GC.gc()
    times = [@elapsed findall(list) for _ in 1:K]
    plot!(
        times, 
        ylims=(0, 0.0002),
        legend = false
    )
end


plot()
for i = 1:30
    run_experiment(30000, 100, 50)
end
plot!(xlabel = "Repetition", ylabel = "Execution time")

Takeaways¶

Real numbers are not real
Computers make mistakes when they do math
The code you write isn't the code that gets executed
Algorithms are not just O(operation), they are also O(memory)
There are different types of memory!
CPUs learn!

Addenda: cosmic rays¶

https://www.johndcook.com/blog/2019/05/20/cosmic-rays-flipping-bits/

Floating point, Julia, and why computers are bad at math¶

Purpose¶

Takeaways¶

Section 1: Julia¶

Why is Julia fast?¶

What's an integer¶

What do floats float on anyway¶

Examples¶

Takeaways¶

Benchmarking¶

Has anyone noticed something weird about the sum totals?¶

What's going on?¶

One solution: Kahan summation¶

Explanation¶

Another solution: Arbitrary precision arithmetic¶

Take-aways¶

Performance: Part II¶

What's happening here?¶

Want to see something crazy?¶

Takeaways¶

Addenda: cosmic rays¶