Generative AI is taking the world by storm. OpenAI's GPT-4 can pass a bar exam, Midjourney images can win art prizes, and Sora creates bafflingly realistic videos from text, to name a few recent highlights. Artificial general intelligence has gone from a Kurzweilian pipe dream to a multi-billion-dollar industry, and there is a distinct possibility that, in the not-too-distant future, some of our colleagues may be silicon-based life forms. (Humans are likely to remain essential, however, due to their ability to make coffee.)
Given all the excitement, money, and promise (which remains significant even when normalized for hype), it's natural to ask if quantum computers can help with generative AI. Indeed, there is a small but rapidly growing field (e.g. 1, 2, 3, 4, 5, 6, 7) exploring quantum transformers, that is, quantum variants of the machine learning architecture used in GPT-4 and its relatives. These end-to-end quantum pipelines either import ideas from (classical) transformers (e.g. implementing "quantum" self-attention) or try to replicate the transformer architecture wholesale, inside a quantum computer.
In this blog post, I want to critically examine this enterprise, from the viewpoint of quantum machine learning (QML) and quantum computing more generally. I'll start by considering the compute requirements for training large language models (LLMs), and from some back-of-the-envelope calculations, argue that even with a mild quadratic speedup, we could train GPT-4 in an afternoon, and perhaps (eventually) for less than it cost OpenAI. I'll then explain the catch—the number of qubits needed to embed words—and why this actually usefully narrows our focus and suggests directions for research.
The compute bottleneck
Let's start with the clearest argument for quantum interventions in the AI realm: the compute bottleneck. An LLM like GPT-4 is very expensive to train. We start by curating a bunch of text, splitting it into a sequence of D separate tokens (i.e. words), and then straining these tokens through an elaborate transformer architecture with N parameters we update by gradient descent as we go. The total number of training steps is n_\text{ML} \sim 6ND, with approximately 6 floating point add/multiply operations per token per parameter. (See this paper for more careful bookkeeping, including the mechanics of batching tokens which we have ignored.)
According to rumors, GPT-4 has N = 1.76 \times 10^{12} parameters and was trained on a dataset of D = 13\times 10^{12} tokens. A natural unit for compute is the petaFLOP-millennium (or PF-millennium), meaning 10^{15} floating point operations per second, running for a thousand years:
Note that FLOPs is the plural for "floating point operations," not "floating point operations per second" (FLOPS). If the rumors are true, then GPT-4 took around 4 PF-millennia to train! To give a sense of scale, if I tried to do this on my laptop, it would take half a million years. It's not the best pitch for a startup. On a log scale, this is how the training time on GPUs and a laptop would compare:
Sam Altman revealed that all that compute set the company back more than USD 100 million. Although GPT-4 has been a smash hit and well worth the investment, the cost of training a terascale LLM is obviously prohibitively high for most players. AI companies large and small are now eagerly seeking alternatives to these multi-million dollar computes.
QML to the rescue?
One possible alternative is quantum machine learning. Quantum computers can, in some cases, exponentially reduce the number of steps needed for a calculation, so it's natural to hope they might help widen the LLM training bottleneck. To see how this might work, let's first convert FLOPs into an analogous quantum unit. One simple option is to use circuit layer operations, or CLOPs, which are essentially just the number of layers in our circuit. We'll take the total number of CLOPs to scale as some root of the number of classical FLOPs. To obtain the total time, we can estimate the CLOPs per second.
This will depend on the type of quantum computer we have, but to make our lives easy, we'll just pick one where the CLOPs per second are easy to estimate. As Terry Rudolph explains, for matter-based quantum computers there is a simple hack: the Heisenberg uncertainty principle. Recall that, in textbooks, uncertainty in energy and time are related as follows:
where h = 6.63\times 10^{-36} \text{ J} \cdot\text{s} is Planck's constant. In the context of quantum computing, this can be rigorously interpreted as the statement that, to rotate a state into an orthogonal state using a Hamiltonian with an energy spread \Delta E, it will take a minimum time of h/4\Delta E. For photonic quantum computers, there is no \Delta E we can use to set a timescale, and we need to think about the architecture more carefully. For superconducting qubits, trapped ions, etc, we can use \Delta E \sim k_\text{B} T, where T is the operating temperature and k_\text{B} \approx 1.38 \times 10^{-23} \text{ J/K} is the Boltzmann constant. The CLOPS (CLOPs per second) go as
For a superconducting quantum computer operating at 10 \text{ mK}, we get optimal CLOPS of 1 \text{ GHz}. Real devices don't achieve this in practice, but let's ignore that for now and continue living in a hypothetical future where we don't worry about noise.
So, suppose we can train a GPT-4 sized model using a quantum computer, and find a "generic" quadratic speedup, like the one associated with Grover's search. Then the number of CLOPs n_\text{QML} is, by assumption,
This corresponds to a total training time of t_\text{QML} = n_\text{QML} \Delta t_\text{QML} \approx 3 \text{ hours}. And we're not even thinking about parallelization! In contrast, it took OpenAI three months and 25,000 top-of-the-line NVIDIA GPUs. Is the conclusion, then, that we should all be working on quantum transformers?
The price is right
Before we rush out and change fields, we should make sure that we're not replacing something slow and expensive with something fast and even more expensive!
The concern is that a quadratic (or generally, polynomial) speedup is only useful when the argument is big, and "big" scales with comparative cost per basic operation. From GPT-4, we can estimate the cost per FLOP:
It's not exactly a fair comparison, but we can use IBM's maximum number of CLOPS (~1500) and cloud pricing (USD1.6 per second) to deduce a cost per CLOP of c_\text{CLOP} = \frac{\text{USD }1.6}{1500} \approx \text{USD } 10^{-3}, so a tenth of a cent. Consider the classical (ML) and quantum (QML) cost of training a model needing x FLOPs, with a polynomial speedup of index m:
The quantum training will be cheaper when
We can picture this as follows:
For a terascale dataset D \sim 10^{13} and the cost per basic operation given above, quantum training with polynomial speedup m becomes cost-effective for
For a quadratic speedup, m = 2, we need 100 trillion parameters (100 times larger than GPT-4) to get value for money. Of course, the cost per CLOP will decrease as quantum computers get bigger, better, and more widely available. If c_\text{CLOP} decreases by a factor of a thousand or so, it is not only faster but cheaper to train GPT-4. As m increases, the threshold for cost-effectiveness lowers, and for large m, it becomes cheaper to train any realistically sized LLM on a quantum computer, even with the current CLOP cost. We probably also need larger m to compensate for the overhead of fault-tolerance.
The embedding bottleneck
We seem to be home free: quantum computers with modest algorithmic improvements can train LLMs more quickly and cheaply than their classical counterparts. This would be a great way to end the post, but sadly, there is a catch. The parameter count N we've been talking about is exclusively for non-embedding parameters; it doesn't count the "fixed" cost of embedding words, attentional queries, and so on, in intermediate vector spaces.
Let's focus on word embedding for simplicity. In modern transformers, the dimension of the word embedding space is d \sim 10^4, and in most proposed quantum transformers, we need at least d qubits. No matter how small N is, we need at least 10^4 error-corrected qubits to pluck the apparently low-hanging fruits of quantum advantage. This is prohibitively large.
Suppose for the sake of argument that Hartmut Neven's "law" is true, and the number of error-corrected qubits doubles every 18 months (akin to Moore's law). We will generously posit a single error-corrected qubit right now. Then we expect it will take around 18 \text{ months} \times \log_2 (10^4) \approx 20 \text{ years} to reach the point we can perform word embeddings for LLMs, using currently available techniques. And in 20 years, our classical methods will be hopelessly superannuated! (For reference, the state-of-the-art in word representations 20 years ago was totally different and used \sim d = 100 features.)
Of course, we could restrict to smaller embedding spaces.
But (a) even the original and outdated word2vec
method, using d =640, would take around 14
years to reach using the estimate above;
and (b) as the
original transformer paper
(see Table 3) shows, even with many
parameters, reducing the embedding dimension seriously harms performance.
Non-embedding parameters get all the press, but the success of
transformers is clearly tied to the scale of embedding.
This bottleneck will affect quantum transformers long before speedups
help with compute.
In 5 years, we can hope for ~30 logical qubits, so this seems like a
good target for compression.
Solving the embedding problem sounds hard, but we are not starting from scratch. Amplitude encoding methods use a logarithmic number of qubits, though at considerably lower accuracy than is needed for state-of-the-art applications. The excellent paper of Guo et al. also proposes a transformer scheme with compressed embedding via block-encoded matrices; as they clearly state, however, the architectural requirements for embedding remain formidable. These issues of performance and architecture are part of the bottleneck!
In a different direction, existing tools from QML, such as feature learning and quantum kernels provide ways to implicitly access and embed data in exponentially large Hilbert spaces. Further, the all-important attentional layers of a transformer architecture can be viewed in terms of kernel machines, suggesting that our quantum toolkit may teach us more about transformers than we expect.
Conclusion
Current research on the QML/transformer overlap tries to port classical transformer routines into the quantum realm and then perform what I like to think of as "advantage hacking," i.e. looking for asymptotic quantum speedups. If the order-of-magnitude calculations above are correct, they suggest that any speedup, if achievable, is likely to significantly ease the compute bottleneck facing LLMs like GPT-4. Great! But speedups are easy; achievability is the real problem. To train or run even a small LLM, we currently need on the order of 10,000 qubits to perform word and attention embedding. This is out of reach for the forseeable future, and by the time the forseeable future arrives, transformers are likely to have fallen by the wayside, like GANs before them.
The embedding bottleneck is the problem of reducing the qubits needed to perform this embedding, while retaining the benefits of the transformer architecture. There are precedents and tools from QML that make this problem feasible; but this change of focus also rests on a different philosophy of quantum computing. Rather than viewing them as general-purpose advantage machines that will be arbitrarily scalable, it takes quantum computers as janky little contraptions with eccentric superpowers we have yet to understand.
Given this philosophy, compressed embedding should be our research goal, rather than advantage. And there's reason to think that, in the process, we might learn something new about transformers! In an interview with Kara Swisher last year, Sam Altman remarked that he's not worried about "language model, startup number 217." What keeps him up at night is the prospect of "three smart people in a garage with some very different theory of how to build AGI." The general-purpose approach sounds like startup number 217. The special-purpose, eccentric superhero is waiting for us in the garage. We just have to get in there and start tinkering, using the sort of toolkit offered by quantum software libraries like PennyLane (hint: look at our performance new datasets!), along with well-founded priors about how quantum computers could be of real utility in AI.
About the author
David Wakeham
Having fun at the intersection of quantum algorithms, symmetry, statistical learning theory and large language models.