Brain Numbers

David M. Berry


Due to limitations on space I have had to cut this from my forthcoming Artificial Intelligence and Critical Theory (MUP) book. But I didn't want to lose the information, and I think others will also find the overview useful, so I have pasted it into this blog post. 



In 2017, Google's Brain team designed a 16-bit floating-point format for training neural networks on their AI chips called Tensor Processing Units (TPUs). They called it bfloat16, brain float, and the name has stuck. BF16 is now the default numerical type for most large-scale AI training. Every large language model we interact with was almost certainly trained in brain numbers.

The standard 32-bit floating-point number, FP32, uses 8 bits for the exponent and 23 for the mantissa, giving roughly 7.2 significant decimal digits, which works out at 4.3 billion representable values. BF16 keeps the same 8 exponent bits but reduces the mantissa to 7, yielding about 2.4 decimal digits and 65,000 representable values.[1] The conversion is done by literally chopping off the lower 16 bits of the FP32 encoding. No complex algorithm or processing is used, just a simple cut.

What is kept in this new number format is range, but what is lost is precision. This is a very interesting point, as it were. A BF16 number can represent values from roughly 10 to the minus 38 to 10 to the 38, the same span as FP32. But within that span, it cannot distinguish between values that differ by less than about 1%. If a weight is 1.0, the next representable value is 1.0078125. This is a discrete gap in the numbers. The brain number does not know the difference, and the network does not seem to care.

Neural networks are able to cope with imprecision in ways that scientific computing is not. For example, in a climate simulation, or a structural engineering model, or even in a financial risk calculation, all three require precision because the phenomena they model are sensitive to small numerical differences. These can be thought of as the butterfly effect. If the number is wrong, the bridge might fall down. Neural networks operate under a completely different logic.[2] The so-called "loss function", the signal that guides training, rewards prediction of the next token, not fidelity to any underlying numerical truth. The gradients that flow backwards through the network can be coarsened by a factor of 65,000 to one, relative to FP32, and the model still seems to converge to the same geometry.

But this is only the beginning of the precision cascade. FP32 tends to be for research. BF16 gets used for training. In contrast, for deployment, the numbers get smaller still. FP8, standardised by NVIDIA, Arm, and Intel in 2022, halves BF16 again, giving you 256 representable values per parameter. INT8 quantisation maps the learned weights to integers. INT4 compresses further, to 16 values per parameter. Dettmers and colleagues ran 35,000 experiments and found a key economic size. They discovered that a 60-billion-parameter model quantised to 4 bits outperforms a 30-billion-parameter model at 8 bits, with the same memory use (Dettmers et al. 2023). Coarser and larger appears to beat finer and smaller. Below 4 bits, the quality falls off a cliff. Above it, you are paying for precision the manifold does not really use.

The geometry of meaning is in a substrate that seems to have been deliberately impoverished. The model that writes email, that summarises texts, that generates images, is operating in a numerical space where each parameter can take one of 16 values. However, the distance between adjacent representable numbers in that space is not small. In fact, it is large enough to lose entire distinctions that, in a more precise numerical format, would be preserved. Which distinctions survive this compression and which are silently lost is not well understood, because we cannot yet interpret the weights very well. The epistemological opacity of the manifold is made more complicated by the engineering opacity of quantisation. It is like two layers of illegibility are stacked on top of each other, and there is no method for determining what was lost in either.[3]

The manifold's geometry, the high-dimensional space in which all meaning is supposed to reside as position, never used the precision that the mathematical formalism implies. The cosine similarities, the attention scores, the embedding coordinates, these are not points in a continuous mathematical space. They are positions on a coarse grid, a grid that gets coarser at every stage of the pipeline from training to deployment. The vector space the mathematics presupposes is, we might say, a regulative ideal. This is why the concept of the manifold, as I develop it elsewhere, is important for understanding the materiality of artificial intelligence (Berry 2026). What the user encounters is this manifold, instantiated through brain numbers, then quantised again for the economics of inference. The geometry of meaning reaches you doubly impoverished, first by the economics of training, then by the economics of being cheap enough to offer as a product.

The brain is not a floating-point processor, but I think the naming reveals what the designers think they are building, and what they think thinking is. They have built a system that works with roughly two significant decimal digits and which does not need precision because precision is, presumably, not what intelligence requires. Perhaps that tells us less about the precision of the system than about the coarseness of the ideology.



Notes

[1] In reality it is 7 mantissa bits plus a hidden leading bit, giving 8 bits of precision. The hidden bit is a convention of IEEE 754 floating-point representation, not an additional bit stored in memory. The truncation remains what it is, the lower 16 bits of the FP32 encoding.

[2] In a climate simulation, the gap between 1.0 and 1.0078125 can be the difference between a storm and a clear sky. In a large language model, the same gap is the difference between two slightly different shades of helpfulness. Scientific computing models the world, and the world tends to show the mistakes made by imprecision. The neural network models the plausibility of the world, and we might say that plausibility is cheap to approximate.

[3] If the manifold already forgets the temporal depth of its training data, dissolving the sedimented history of cultural production into geometric positions from which no original can be recovered, then quantisation is a second forgetting, one that operates on the geometry itself. The first forgetting is semantic, and what was meaning becomes a coordinate. The second is numerical, and what was a coordinate becomes an even coarser coordinate.



Bibliography

Berry, D.M. (2026) ‘Vector Theory’, Stunlaw. Available at: https://stunlaw.blogspot.com/2026/02/vector-theory.html.

Dettmers, T., Pagnoni, A., Holtzman, A. and Zettlemoyer, L. (2023) 'The case for 4-bit precision: k-bit inference scaling laws', Proceedings of the 40th International Conference on Machine Learning (ICML). https://arxiv.org/abs/2212.09720.

 

Comments