The same behavior can be observed using
numpy directly. However, this only appears when using
dtype=np.float32 rather than
dtype=np.float64. Change your
np.float64 to correct the problem.
In order to understand why, you must understand how are floating point numbers stored in memory. Let us consider
a when represented with single precision and with double precision:
import numpy as np a = -1.55786165e+14 a_single = np.array([a], dtype=np.float32) a_double = np.array([a], dtype=np.float64) a_single, a_double, a # The line above prints: # (-155786160000000.0, -155786165000000.0, -155786165000000.0)
As you can see,
a is truncated when using single precision. But why is that?
The base-2 logarithm of
abs(a) is between 47 and 48. Hence,
a can be written as
-1 * 2^47 * 1.x. When representing floating points number, one has to encode the exponent (48) and the fraction (x).
In our case,
.x would be approximately equal to:
-a / pow(2, 47) - 1
which is equal to
0.1069272787267437. Now, what we want is writing this number as a sum of negative power of 2, starting from -1. This means that if we use
N bits to represent it, we will store in memory the integer part of
0.1069272787267437 * pow(2, N).
In single precision, we use
N = 23 bits to represent this number. Since the integer part of
0.1069272787267437 * pow(2, 23) is 896971, whose binary expansion is
11011010111111001011, which is 20-bits long, the number stored in memory is
When using double precision however, the number stored in memory is
0001101101011111100101100000110100101110100000000000. Note that the large number of trailing zeroes may indicate that the exact value of
a is stored (since we don’t need more precision to represent it), which is the case here.
That said, this explains why
a while represented as a single precision float is truncated. The same reasoning works for adding
a. Since the exponent of the resulting float will be
47, it means that the smallest possible precision you can claim in single precision is
2^47 * 2^-23=2^24, while the smallest possible precision you can claim in double precision is
2^47 * 2^-52=2^-5. Since you are working with integers, this explains why you get an exact result with double precision and an incorrect one with single precision.
CLICK HERE to find out more related problems solutions.