How Floating Point Numbers Work

Original article was published on Deep Learning on Medium

How Floating Point Numbers Work

With Applications to Deep Learning and Digital Photography

It is a pesky fact that computers work in binary approximations while humans tend to think in terms of exact values. This is why, in your high school physics class, you may have experienced “rounding error” when computing intermediate numerical values in your solutions and why, if you open a python terminal and compute 0.1 * 3, you will get a weird result.¹

>>> 0.1 + 0.1 + 0.1

this makes floating point numbers an example of a leaky abstraction. Normally, python and numerical computing libraries like numpy or PyTorch handle this behind the scenes. But understanding the details can help you avoid otherwise unexpected errors and speed up many machine learning computations. For example, Google’s Tensor Processing Units (TPUs) use a modified floating point format to substantially improve computational efficiency while trying to maintain good results.

In this article we’ll dig into the nuts and bolts of floating point numbers, cover the edge cases (numerical underflow and overflow), and close with applications: TPU’s bfloat16 format and HDR imaging. The main background assumed is that you understand how to count in binary, as well as how binary fractions work.

Representing Integers

Let’s briefly review counting to 5 in binary: 0, 1, 10, 11, 100, 101. Got it? This is great for an unsigned integer; one which is never negative. For example, if we have an 8 bit unsigned integer, we can represent numbers between 00000000 and 11111111. In decimal, that’s between 0 and 2⁸-1=255. For example, most standard image formats are 8-bit color, which is why the “RGB” values go from 0 to 255.

Note also that we would typically abbreviate this with a hexadecimal (base 16) representation: 0x00 to 0xFF. The 0x prefix means “this is a hex number”. The hexadecimal digits are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F; so F is essentially short for the four bits “1111” (both 0xF and 1111 are 15 in base-10). Also 8 bits are a byte, so our number is a measly 1 byte. But we won’t focus too much on hexadecimal in this article.

Signed Integers

Now, you will notice that with an unsigned int, we can’t represent simple numbers like -2. One way you could try to solve this is to make the first bit represent the sign. Say “0” means negative and “1” means positive. Thinking about 4-bit numbers, 0111 would be -7, while 1111 would be +7. However, this has some weird features. For example, 0000 is “-0” while 1000 is “+0”. This is not great: comparing two numbers for equality would get tricky; plus we are wasting space.

The standard solution to this is to use Two’s Complement, which is what most implementations use for signed integers. (There is also a little-used One’s Complement). However, this isn’t what we are going to need for floating point numbers, so we won’t delve into it.

Let’s consider instead a biased representation of a signed 8-bit integer. It’s biased because, well it’s off by a bit. Instead of letting 00000000 represent 0, we will instead use 01111111 to represent 0. This would normally represent 127 in base 10. But we have biased our representation by 127. That means that 00000000 represents –127, while 11111111 represents 128.

Double Precision Floating Point Numbers

Since most recently produced personal computers use a 64 bit processor, it’s pretty common for the default floating-point implementation to be 64 bit. This is called “double precision” because it is double of the previous-standard 32-bit precision (common computers switched to 64 bit processors sometime in the last decade).

Scientific Notation

For context, the basic idea of a floating point number is to use the binary-equivalent of scientific notation. Your high-school science teachers hopefully drilled into you exactly how to do this (along with a whole bunch about those dreaded signficant figures – sigfigs). For example, the scientific representation of 8191.31 is:

Scientific Notation for a Miscellaneous Decimal Number

You should notice three key elements. First, a sign (is the number + or -?). Second, we always write the number with a single digit (between 1 and 9 inclusive), followed by a decimal point, followed by a number of digits. Compare that to the below, which are not in scientific notation even though they are true mathematical facts.

Representations NOT in Scientific Notation

With that in mind, let’s think about what will change when we go to binary. First of all, instead of using 10 as the base of the exponent (also called the radix), we’ll want to use 2. Secondly, instead of decimal fractions, we’ll want to use binary fractions.

Binary versus Decimal Scientific Notation

Please note that I have chosen to write the radix (2 or 10) and their exponents (1 or 0 respectively) in their decimal forms while the numbers on the left hand side and the significands are in binary or decimal respectively.

The binary number 1101 is 13 in base 10. And 13/16 is 0.8125. This is a binary fraction. If you haven’t played with these yet, you should convince yourself of the following:

Some simple binary fractions

This is the binary version of the fact that 0.3 is 3/10 and 0.55 is 55/100 (which can be further simplified, of course).

Great. We are now ready to dig into the details of floating point numbers

The IEEE 754 Standard

Here is the diagram for the “IEEE 754” standard commonly implemented. The first bit is the sign. 0 is positive and 1 is negative (the opposite of what we naïvely suggested above). There are 11 bits for the exponent and 52 or 53 (depending how you count) bits for the fraction, also called the “mantissa” or “significand”. The sign just works like the flag we saw above, so we’ll go into each of the last two in some depth.

IEEE 754 double-precision floating point number (Wikipedia)

The Exponent

The exponent is an 11-bit biased (signed) integer like we saw before, but some caveats. The bias is 2¹⁰–1=1023, so that the 11 bits 01111111111 represent 0.

This would normally mean that the largest possible exponent is represented by the 11 bits 11111111111 (representing 2¹¹–1–1023=1024) and the smallest possible exponent is represented by the 11 bits 00000000000 (representing –1023).

However, as we will discuss:

  • The exponent represented by 11111111111 is reserved for infinities and NaNs.
  • The 00000000000 exponent is reserved for representing 0 and something else we’ll get to.

This means that the exponent can, in normal circumstances, be between –1022 and 1023 (2046 possible values).

The Significand

The 52-bit significand represents a binary fraction. If you review the scientific notation section above, you’ll see that whenever we write a binary number in “binary scientific notation,” the leading digit is always 1. (In base 10 it could be between 1 and 9, but 2–9 aren’t binary digits). Since we know the leading digit will always be 1 (with some caveats to be discussed), we don’t need to actually store it on the computer (this would be wasteful). This is why I said the significand is 53 bits “depending on how you count.”

In other words, the 52 bits stored on the computer represent the 52 bits that come after the decimal point (or maybe we should call it a “binary point”). A leading 1 is always assumed.

Normal Numbers

I keep mentioning some caveats, and I intend to put them off for as long as possible. A “normal number” is a non-zero number that doesn’t use any of these caveats, and we are in a position to give some examples.

Recall the three components:

  • 1 bit for the sign
  • 11 bits for the exponent, which is (in decimal) between –1022 and +1023. It is represented as a biased integer in the binary encoding.
  • 52 bits for the significand.

How would we represent the decimal number 1?

Well, the sign is positive, so the sign bit is 0. (Think of 1 as a flag for “negative”). The exponent is 0. Remembering that the biased representation means we add 1023, we get the binary representation 01111111111. Finally, all the fraction bits are 0. Easy:

Floating-Point Representation of 1

I’ve written the binary floating-point representation with a space separating the three parts. As usual, the radix and exponent in the “binary scientific” representation are actually in base 10.

What about a harder example, like 3? 3 is 1.5 times 2 (in decimal), so turning that into a binary fraction, we have 1.1. The exponent 2¹ is represented as 10000000000 accounting for bias.

What’s the largest (normal) number we can get? We should make the exponent 11111111110 (we can’t make it all ones, that’s reserved), which in decimal is 1023.

The Maximum Floating Point Value

We can compute this:

but we can also take advantage of the fact that Python has native arbitrary-precision integer arithmetic to gratuitously write out all 309 digits in base 10:

>>> 2 ** 1024 - 2 ** 971179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368

The smallest possible float is just the negative of this. But what is the smallest positive (normal) float? We already said the smallest positive exponent is –1022. Make the significand all 0s, and that means the smallest positive normal floating point number is:

Again, arbitrary precision integer arithmetic means we can exploit the middle fraction to easily get an exact decimal value in all its glory.

>>> numerator = 5 ** 1022
>>> print('0', str(numerator).rjust(1022, '0'), sep='.')

You know, just in case you were curious. By the way, you can check all of this on your python + hardware setup with:

>>> import sys
>>> sys.float_info
sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)

and essentially every other programming language has a similar feature.

Infinities and NaNs

Okay, here’s where things get weird. If all of the exponent bits are 1, then the number represented is either infinite or not a number (NaN):

  • If the fraction bits are all 0, the number is infinite. The sign bit controls whether it is –∞ or +∞.
  • If the fraction bits are not all 0, the “number” is not a number (NaN). Depending on the first bit it can be either a quiet NaN or a signaling NaN. a quiet NaN propagates (add another number to it and you just get NaN). A signaling NaN is supposed to “throw an error”, roughly speaking.² The remaining bits are typically not used.

The thing I initially found surprising about this is that this is a hardware implementation on commonly used chips. This means, for example, you can use it on a GPU. Why would you want to do that? Well, consider the convenient fact that e to the power of –∞ is 0.

>>> from math import exp
>>> minus_infinity = float('-inf')
>>> exp(minus_infinity)

In the paper that introduced the transformer architecture for NLP tasks (the one used by BERT, GPT-2, and their more recent cousins), the training was autoregressive which meant that in the attention module’s softmax layers, certain outputs were required to be 0. But if you look at the formula for the softmax and recall that your high school math teacher told you that “there is no number such that exponentiating to it is 0,” you will see it’s tricky to make a softmax return 0. Unless of course, you make (minus) infinity a number!

And, crucially, this is a hardware implementation. If it was a gimmicky Python (or PyTorch, or Numpy) workaround that represented numbers as an object which might sometimes contain a floating point number, this would substantially slow down numerical computations.

Also, the unending complexity of computer hardware is always impressive.


But wait, there’s more! We haven’t even described how to represent 0 yet. Using our exponents and our fraction bits, we were only able to make a very small positive number, not actually 0. The solution of course is that if the exponent bits are all 0 and so is the fraction, then the number is 0. In other words, if the exponent bits are 00000000000 and the fraction bits are also all zeros. Note this means that 0 is “signed” – there is both +0 and –0. In Python, they are stored differently, but they are equal to each other.

>>> zero = 0.0
>>> print(zero, -zero)
0.0 -0.0>>> zero == -zeroTrue

There are a few edge cases where things get weird though. When trying to compute an angle with atan2, you will see that they are in fact represented differently:

>>> from math import atan2
>>> zero = 0.0
>>> print(atan2(zero, zero),
>>> atan2(zero, -zero))
0.0 3.141592653589793

Subnormal Numbers

The final case to cover is when all the exponent bits are 0, but the fraction bits are not 0. If we have a representation that doesn’t use some possible bit sequences, we are wasting space. So why not use it to represent even smaller numbers? These numbers are called subnormal (or denormal) numbers.

Basically, the rule is that the exponent is still considered to have its minimal value (–1022) and instead of our “binary scientific” notation always starting with a 1 (as in 1.001), we assume instead that it starts with a 0. So we can have 0.001 times 2 to the power of –1022. This lets us represent numbers up with an exponent 52 less (as small as –1074). Thus:

>>> 2 ** -1074
>>> 2 ** -1075
>>> 2 ** -1075 == 0

The benefits of subnormal numbers are that, when you subtract two different normal floats, you are guaranteed to get a non-zero result. The cost is lost precision (there is no precision stored in the leading 0s – remember how sigfigs work?). This is called gradual underflow. As floats get smaller and smaller, they gradually lose precision.

Without subnormal numbers you would have to flush to zero, losing all your precision at once and significantly increasing the chance that you’ll accidentally end up dividing by 0. However, subnormal numbers significantly slow down calculations.


Okay, we spent all this time talking about floating point numbers. Besides some weird edge case about 0.1 * 3 that never really comes up, who cares?

Tensor Processing Units (TPUs)

Besides the 64-bit float we explored at length, there are also 32-bit floats (single precision) and 16-bit floats (half-precision) commonly available. PyTorch and other numerical computing libraries tend to stick to 32-bit floats by default. Half the size means the computations can be done faster (half as many bits to crunch).

But lower precision comes with a cost. With a standard half-precision float (5 exponent bits, 10 significand bits), the smallest number bigger than 1 is about 1.001. You can’t represent the integer 2049 (you have to pick either 2050 or 2048; and no decimals in between either). 65535 is the largest possible number (or close, depending on precise implementation details).

Google’s Tensor Processing Units instead use a modified 16-bit format for multiplication as part of their many optimizations for deep-learning tasks. The 8-bit exponent with 7-bit significand has just as many exponent bits as a 32-bit floating point number. And it turns out that in deep learning applications, this matters more than the significand bits. Also, when multiplying, the exponents can be added (easy) while the significand bits have to be multiplied (harder). Making the significand smaller makes the silicon that multiplies floats about 8 times smaller.

Plus, the TPU float format flushes to zero instead of using subnormal numbers to boost speed.

High Dynamic Range (HDR) Images

If you read the Google blog post about their custom 16-bit float format, you’ll see they talk about “dynamic range.” In fact, this something similar is going on with HDR images (like the ones you can capture on your phone).

A standard image uses an 8-bit RGB encoding. Those 8 bits represent an unsigned integer between 0 and 255. The problem with this is that the relative precision (% jump between consecutive values) is much worse when its darker. For example, between a (decimal) pixel value of 10 and 11, there is a 10% jump! But for bright values, the relative difference between 250 and 251 is just 0.4%.

Now the human eye is more sensitive to changes in brightness with dark tones than with bright ones. Meaning the fixed-precision representation is the opposite of what we’d want. Thus, a standard digital or phone camera shooting a JPEG or similar adjusts its sensitivity by recording relatively more precision in the darker tones using a gamma encoding.

The downside to this is that, even if you add bits (say with a 16-bit RBG image), you don’t necessarily gain as much precision in the parts of your image that are bright.

So, an HDR image uses floating point numbers to represent the pixels! This allows a high “dynamic” range (the exponent can be high or low) while still maintaining relative precision across all brightness scales. Perfect for keeping the data from scenes with high contrast. For example in the Radiance HDR format, the exponent is shared across the three colors (channels) in each pixel.


This might be more than you ever wanted to know about floating point numbers. With any luck, you won’t encounter too much numerical under-flow or over-flow that can’t be solved with a simple log-sum-exp or arbitrary-precision integers. But if you do, you’ll be well-prepared! Hopefully, you are also well-positioned to think about just how much precision you need in your machine-learning models as well.