And they use full size floats for training.
[0] "full precision" in ML usually means 16 bit floats like bfloat16
In fact, it can be slower because hardware is probably not optimized for the 1-bit case, so there may be a lot of low-hanging fruit for hardware designers and we may see improvements in the next iteration of hardware.
The CPU can handle up to word size bits at once. I believe they mean that a lot of assembly was written for integer math and not bit math. Word size 4+ However, it is unlikely we’ll see improvements in this area because by definition, using 64-bit floats uses max word size. So… that’s the max throughput. Sending 1 bit vs 64 bits would be considerably slower so this entire approach is funny.
Today with models that are actually useful training costs matters much less than inference costs. A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs.
So, I strongly disagree unless we're talking about the five or six companies that already spend tens of millions on training and keep repeating that. Outside of them, the medium to large models are done infrequently or one off by a small number of other companies. Then, most of us are stuck with their pretraining efforts because we can't afford it ourselves.
On my end, I'd rather see a model that drops pretraining costs to almost nothing but costs 10-32x more to do inference. My uses would produce mere MB of output vs hundreds of GB to TB that pretraining requires. A competitive use that costs 32x current prices would probably be profitable for me. Optimizations, which are plentiful for inference, might bring it down further.
"A 10x increase in training costs is not necessarily prohibitive if you get a 10x decrease in inference costs."
Given millions and up, I'd like that to be 10x cheaper while inference was 10x more expensive. Then, it could do research or coding for me at $15/hr instead of $1.50/hr. I'd just use it carefully with batching.
They cost roughly the same, with the backwards pass being maybe 50% more expensive. So let's say three times the cost of a forward pass.
You can't make training faster by making inference slower.
That leaves computation and memory use of two passes plus interlayer communication.
I think backpropagation doesn't occur in the brain since it appears to use local learning but global optimization probably happens during sleep/dreaming. I have a lot of papers on removing backpropagation, Hebbien learning, and "local, learning rules."
From there, many are publishing how to do training at 8-bit and below. A recent one did a mix of low-bit training with sub-1-bit storage for weights. The NoLayer architecture might address interlayer better.
People keep trying to build analog accelerators. There are mismatches between their features and hardware. Recent works have come up with analog NN's that work well with analog hardware.
A combination of those would likely get cost down dramatically on both inference and training. Also, energy use would be lower.
There's nothing to backpropagate with a step function. The derivative is zero everywhere.
Analog NN's, including spiking ones, share some of those properties. Several chips, like TrueNorth, are designed to take advantage of that on biological side. Others, like Mythic AI's, are accelerating normal types of ML systems.
TLDR abstract of a draft paper I wrote years ago, for those interested in the real limits of quantized networks:
We investigate the storage capacity of single‐layer threshold neurons under three synaptic precision regimes—binary (1‐bit), ternary (≈1.585‐bit), and quaternary (2‐bit)—from both information‐theoretic and algorithmic standpoints. While the Gardner bound stipulates maximal loads of α=0.83, 1.5 and 2.0 patterns per weight for the three regimes, practical algorithms only reach α_alg≈0.72, 1.0 and 2.0, respectively. By converting these densities into storage‐efficiency metrics—bits of synaptic memory per stored pattern—we demonstrate that only quaternary weights achieve the theoretical optimum in realistic settings, requiring exactly 1 bit of memory per pattern. Binary and ternary schemes incur 39 % and 58 % overheads, respectively.
So this is only for inference. Also activations aren't quantized, I think?
The very last conclusion: "Future work will focus on the implementation of binary normalization layers using single-bit arrays operations, as well as on quantizing layer activations to 8 or 16-bit precision. These improvements are expected to further enhance the efficiency and performance of the binary neural network models."
If you use quantized differentiation you can get away with using integers for gradient updates. Explaining how takes a paper and in the end it doesn't even work very well.
At university, way back at the end of the last AI winter, I ended up using genetic algorithms to train the models. It was very interesting because weights were trained along with hyper parameters. It was no where near practical because gradient descent is so much better at getting real world results in reasonable time frames - surprisingly because it's more memory efficient.