The training and reasoning of large models often involve the concept of accuracy, and there are many types, and they are also divided into different formats at the same level of accuracy. In addition, there are also the concepts of multi-precision and mixed-precision in practical use scenarios.
Common precision
Floating-point accuracy: double precision (FP64), single precision (FP32, TF32), half precision (FP16, BF16), 8-bit accuracy (FP8), 4-bit accuracy (FP4, NF4) Quantification accuracy: INT8, INT4 (also INT3/INT5/INT6)
A floating-point number consists of three parts: sign, exponential, and mantissa bits. The larger the exponential bit, the larger the range of numbers that can be represented. The larger the mantissa digit, the higher the accuracy of the number.
The table summarizes it as follows
| format | symbol bit | Exponential bit | decimal place | Total digits | | FP64 | 1 | 11 | 52 | 64 | | FP32 | 1 | 8 | 23 | 32 | | TF32 | 1 | 8 | 10 | 19 | | BF16 | 1 | 8 | 7 | 16 | | FP16 | 1 | 5 | 10 | 16 | | FP8 E4M3 | 1 | 4 | 3 | 8 | | FP8 E5M2 | 1 | 5 | 2 | 8 | | FP4 | 1 | 2 | 1 | 4 |
FP32: 32-bit floating-point number, 4 bytes per data TF32: 19-bit floating-point number, each data is 2 bytes FP16: 16-bit floating-point number, 2 bytes per data BF16: 16-bit floating-point number, each data is 2 bytes Int8: 8-bit integer, each data accounts for 1 byte Int4: 4-bit integers, each data is 0.5 bytes
Why so much precision
Because of cost and accuracy. We all know that high accuracy is definitely more accurate, but it will also bring higher computing and storage costs. Lower precision reduces calculation accuracy but can improve computational efficiency and performance. So a variety of different precisions allow you to choose the most suitable one in different situations. Double precision is more accurate than single precision expression, but it takes up twice as much storage and takes more time to compute.
Why do large models need to be quantified?
1. Reduce memory usage Large models usually use 32-bit floating-point numbers (FP32) or 16-bit floating-point numbers (FP16) to represent weights and activation values. By quantizing, these high-precision values can be converted into lower-precision representations (e.g., 8-bit integers, INT8), significantly reducing the storage space of the model. This is important for deployment on resource-limited devices such as mobile devices, embedded systems, etc.
2. Accelerate reasoning speed Quantized models can run more efficiently on hardware. Many modern hardware (such as GPUs, TPUs, NPUs, etc.) have specialized optimization support for low-precision computing, enabling faster quantization operations. Additionally, low-precision calculations often involve fewer bit operations, reducing computational complexity and thus speeding up inference.
3. Reduce power consumption The quantized model not only reduces the need for computing resources but also reduces power consumption. This is especially important for battery-powered devices such as smartphones, IoT devices, etc., where low power consumption means longer battery life.
4. Easy to deploy edge devices Many large models were initially trained and deployed in the cloud, but with the development of edge computing, more and more application scenarios require models to be deployed on edge devices. With limited computing power and storage resources on edge devices, quantization can help these models run more efficiently on edge devices.
5. Reduce bandwidth requirements In the process of distributed inference or model updates, quantization can reduce the bandwidth required for model transfer. This is useful for environments with limited network bandwidth, such as IoT devices in remote areas.
6. Maintain model performance Although quantization introduces a certain loss of precision, the original performance of the model can be largely preserved through appropriate quantization methods (such as mixed-precision quantization, post-training quantization, quantitative perception training, etc.). Therefore, in practical applications, quantification can strike a good balance between performance and efficiency.
Memory reference
| type | Every billion parameters need to occupy memory | | float32 | 4G | | fp16/bf16 | 2G | | int8 | 1G | | int4 | 0.5G |
FP64 (Double Precision)
64-bit floating-point, typically a double-precision binary floating-point format defined by IEEE 754, has:
1-digit symbol 11-digit index 52 decimal places
Range: ~2.23e-308 ... ~1.80e308 with full 15-17 decimal precision.
Usage:
This format is used for scientific calculations that require high precision. It is not typically used for deep learning computations. Software support: Represents the double type on most C/C++ systems. Supported in TensorFlow (e.g. tf.float64) / PyTorch (e.g. torch.float64 or torch.double). Hardware support: Typically supported in x86 CPUs. Most GPUs, especially gaming GPUs, including the RTX series, are severely limited in FP64 performance (usually 1/32 of FP32 performance instead of 1/2). Recent unrestricted FP64-supported GPUs include GP100/100/102/104 in Tesla P40/P4 and Quadro GP100, GV100 in Tesla V100/Quadro GV100/Titan V, and the recently announced GA100 in A100 (interestingly, the new Ampere architecture has a third generation) compared to the FP64-enabled Tensor Cores The new IEEE-compliant FP64 processing is now included, which delivers 2.5x the FP64 performance of V100.
FP32 (Full Accuracy)
This format has long been a workhorse for deep learning. Another IEEE 754 format, single-precision floating-point has:
1-digit symbol 8-digit index 23 decimal places Ideally, both training and inference should be done in FP32, but FP32 is twice as slow as FP16/BF16, so mixed-precision methods are often used in practice, where FP32 weights are used as the exact "master weight", FP16/BF16 weights are used for forward and backward propagation calculations to improve training speed, and finally FP32 sovereign weights are updated with FP16/BF16 gradients in the gradient update phase.
During training, the sovereign weight is always FP32. In practice, half-precision weights often provide similar accuracy to FP32 when inference - because exact FP32 weights are only needed when the model gradient is updated. This means that we can use half-precision weights when inference, so we can get the same result with only half the GPU memory.
Range: ~1.18e-38 ... ~3.40e38 with an accuracy of 6-9 significant decimals.
Usage:
The standard type of neural network computing for a long time. Weights, activations, and other values in neural networks have long defaulted to FP32. For many scientific calculations, especially iterative ones, the accuracy is not enough, leading to error accumulation. Software support: Represents the float type on most C/C++ systems. Supported in TensorFlow (e.g. tf.float32) / PyTorch (e.g. torch.float32 or torch.float). Hardware support: Typically supported in x86 CPUs. Typically supported by NVIDIA/AMD GPU.
FP16 (Half-Accuracy)
Similarly, the IEEE 754 standard format, the half-precision floating-point format has:
1-digit symbol 5-digit index 10 decimal places The FP16 number has a much lower numerical range than FP32. Therefore, FP16 is at risk of overflow (when used to represent very large numbers) and underflow (when used to represent very small numbers). For example, when you do 10k * 10k, the end result should be 100M, which FP16 cannot represent because the maximum number FP16 can represent is 64k. So you end up with NaN (Not a Number) in neural network calculations, because the calculations are done in layered and batch order, so once NaN appears, all previous calculations are destroyed. In general, this can be mitigated by loss scaling, but this approach doesn't always work.
Range: ~5.96e−8 (6.10e−5) ... 65504, with an accuracy of 4 significant decimal digits.
Usage:
Deep learning tends to use FP16 instead of FP32, as lower-precision computations don't seem to matter for neural networks. The extra precision doesn't do anything, and at the same time it's slower, requiring more memory and slowing down communication. Can be used for training, usually using mixed-precision training (TensorFlow/PyTorch). Can be used for post-training quantization to speed up inference (TensorFlow Lite). Other formats used for post-training quantization include integers INT8 (8-bit integers), INT4 (4-bit), and even INT1 (binary values). Software support: Not currently in the C/C++ standard (but there is a short float proposal). Some C/C++ systems support __fp16 types. Otherwise, it can be used with special libraries. Supported in TensorFlow (e.g. tf.float16) / PyTorch (e.g. torch.float16 or torch.half). Hardware support: x86 CPUs are not supported (as a unique type). Support on older gaming GPUs is poor (32/1 performance for FP64, see the post on GPUs for more details). It is currently well supported on modern GPUs, such as the NVIDIA RTX series.
BFLOAT16 (Half-Precision)
Another 16-bit format originally developed by Google is called "Brain Floating Point Format", or "bfloat16" for short. The name comes from Google Brain.
The original IEEE FP16 was designed without deep learning applications in mind, and its dynamic range was too narrow. BFLOAT16 solves this problem, providing the same dynamic range as the FP32.
Therefore, BFLOAT16 have:
1-digit symbol 8-digit index 7 decimal places
The bfloat16 format is truncated to IEEE 754 FP32, allowing for fast conversion to and from IEEE 754 FP32. When converting to the bfloat16 format, exponential bits are preserved, while the mantissa fields can be reduced by truncation.
Range: ~1.18e-38 ... ~3.40e38 with 3 significant decimal places. Usage:
Now it seems to be replacing FP16. Unlike FP16, which often requires special processing through techniques such as loss scaling, BF16 is almost a direct replacement for FP32 when training and running deep neural networks. Software support: Not in the C/C++ standard. Can be used with special libraries. Supported in TensorFlow (e.g. tf.bfloat16) / PyTorch (e.g. torch.bfloat16).
TF32
TensorFloat-32 or TF32 is the new math mode in NVIDIA A100 GPUs.
Using the same 10-bit mantissas as half-precision (FP16) math, the TF32 proves to have enough headroom to meet the precision requirements of AI workloads. And TF32 uses the same 8-bit index as FP32, so it can support the same numerical range.
Technically, it is a 19-bit format. Think of it as an extended precision BFLOAT16, such as "BFLOAT19" or a decreased precision like FP32.
So, TF32 has:
1-digit symbol 8-digit index 10 decimal places The advantage of TF32 is that it has the same format as FP32. When calculating the inner product with TF32, the mantissa of the input operand is rounded from 23 to 10 bits. The rounding operands are multiplied exactly and accumulated in normal FP32.
The TF32 Tensor Core runs on FP32 inputs and generates results in FP32 without code changes. Non-matrix operations continue to use FP32. This provides an easy way to accelerate FP32 input/output data in deep learning frameworks and HPC.
Range: ~1.18e-38 ... ~3.40e38 with an accuracy of 4 significant decimal places. Usage:
One of the great things about TF32 is that it only needs compiler support at the deepest level, i.e., inside the CUDA compiler. The rest of the code just sees FP32 with less precision, but the same dynamic range. Using TF32 is mainly to make calls to the library to show if it is working properly. The presence of the TF32 allows for quick plug-ins, taking advantage of the speed of the Tensor Cores without much work. Formats like FP16 and BFLOAT16 require more tweaking as they involve different bit layouts. But using these formats reduces memory bandwidth, allowing for faster execution. For comparison, the A100's peak performance is:
FP32 without tensor cores: 19.5 TFLOPS TF32 Tensor Cores: 156 TFLOPS (so using TF32 instead of FP32 makes it easy to increase speed). FP16/BF16 Tensor Cores: 312 TFLOPS (therefore, a reasonable switch to FP16/BF16 can bring more speed gains, but at a higher cost). Software support: Not in the C/C++ standard. CUDA 11 supported. Hardware support: GPU: NVIDIA A100 is the first model to be supported
FP8
Introduced by the H100 GPU, it enables greater matrix multiplication and convolution, but with lower precision.
The FP100 data types supported by H8 are actually 2 different data types that can be used for different parts of neural network training:
E4M3 - consists of 1 symbolic bit, 4 exponential bits, and 3 decimal places. It can store up to +/-448 and nan. E5M2 - consists of 1 sign bit, 5 exponential bits, and 2 decimal places. It can store values up to +/-57344, +/-inf, and nan. The trade-off of increasing dynamic range is that the stored values are less accurate.
Structure of floating-point data types. All values displayed (in FP16, BF16, FP8 E4M3, and FP8 E5M2) are the closest representation of the value 0.3952.
Both types can be used during training a neural network. In general, forward activation and weights require higher precision, so it is best to use the E4M3 data type during forward pass. However, in backpropagation, the gradient flowing through the network is generally less susceptible to loss of precision but requires a higher dynamic range. Therefore, it is best to store them using the E5M2 data format. The H100 TensorCore supports any combination of these types as input, allowing us to store each tensor with its preferred precision.
Reference:
The hyperlink login is visible.
The hyperlink login is visible. |