Mixed-precision training for deep neural networks

Original article can be found here (source): Deep Learning on Medium

We have considered IEEE754 half-, single-, and double-precision formats, as you can see above. We can represent any finite number by four integer components:
1. Sign (S) 2. Base 3. Significand (m) 4. Exponent (e).
We take the base as 2 (binary). With these components, we evaluate any numerical value as,

In neural networks, irrespective of the frameworks we use, it stores all the parameters using the single-precision (binary32/FP32) format. Let us look at the single-precision format in more detail, with an example.

Sign-bit: 0 for positive values and 1 for negative values.

Exponent bits: They can represent 0 to 255 values (8-bits). The exponents can be negative as well, showing very smaller numerical values. That is why the range of exponents is that of signed integer (-127 to 127). We add 127 to the actual exponent value.

For example, if the actual exponent value is -12, the represented value will be -12+127 = 115. Similarly, for 0, the represented value will be 127 and for -127, the represented value will be 0. The value 127 is the bias that we add. The range of exponents is -126 to 127.
Thus, the range of numerical values that we can represent in single-precision is,

This is important to remember, and we’ll refer to this range of values later.

Did you know?
In single-precision format, an exponent value of -127 (represented value all 0s) and 128 (represented value all 1s) are used to represent NAN and inf values.

Mantissa bits: The first 23-bits which are the values after the radix point in the normalised format. We’ll see this with an example.

From the above normalised value, we can see that,
Sign = 0 (positive)
m = 1 , 1 < |2|
Actual exponent = 1
Represented exponent = 1 + bias = 1 + 127 = 128 (decimal) = 10000000 (binary)
Mantissa bits = 10010010000111111011010 (23-bits after the radix point)
Therefore, in Single-precision, the value of π is represented as,

I hope this example has made the representation of values in single-precision (FP32) very clear.
Using the single-precision format gives a very high amount of precision, but it also increases the computation time and the amount of memory required to store the parameters.