Quick Link: "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits"

2024-03-15

The link: The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.

The key part of the abstract: In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.

None

Big if broadly applicable (from the software engineering point of view) because... it would offer a direct path to significantly improve efficiency in training and using large models, a nontrivial development cost these days in the foundation model space.

Big if broadly applicable (from the technology industry point of view) because... it'd open up demand for a new and very specialized type of chips (if you're either Nvidia or trying to compete with them).

Big if broadly applicable (from the theoretical AI point of view) because... if a radically simplified configuration space gets us the same performance, then understanding where that performance comes from and how to improve it becomes easier — my bet is that shifting from full-precision weights to {-1, 0, 1} allows the use of different mathematical tools.

A skeptical note: This paper looks very interesting and the team behind it is solid, so I'm not skeptical on the specific experiments and claims. That this can be done on all LLMs (for some reasonable definition of "all") is a much stronger claim, and I'd need to see much more experimental work before I let go of all those bits. (Hence the "Big if..." prefaces above.) And from a purely conceptual point of view it's wild: moving to, say, 8-bits, sure. But I wouldn't be surprised if networks trained in this way work in qualitatively different ways than those trained with full-precision weights. If this works in general, of course.

A couple of final notes on wrong reasons for skepticism: I don't think most of the industry has a lot of short-term incentives to explore this possibility, so it'll take a while either way. This paper comes from Microsoft Research Asia; I know Microsoft doesn't have the largest hype footprint in the AI space, but Microsoft Research has always been very good at this sort of thing, even if the company itself hasn't always taken the most advantage of it.