Why measuring performance is our biggest blind spot in quantum machine learning

Maria Schuld

Have a look at the following abstract:

Sounds familiar? While it is entirely made up (including the authors – apologies if you do exist!), variations on this theme are posted to the arXiv every week and mark an explosion of the research area of quantum machine learning (QML) in the past few years.

A great share of quantum machine learning research is driven by the question of whether quantum computers will be useful hardware platforms for machine learning once they are available at scale. They can be used to analyse traditional “classical data”, or extract information from quantum states. Considerable progress has been made in this quest, but can we really claim to know that quantum machine learning is “powerful” as the fake abstract suggests – especially if we interpret “powerful” as solving a useful application well?

In this blog post, I want to argue that we have a severe methodological blind spot in our literature, which limits the answers we find to this question. Thinking about this blind spot is like reading an image into a cloud formation — once you are aware of it, you cannot unsee it. In particular, you cannot unsee it in abstracts like the ones sketched above. What I am talking about is the lack of a suitable framework to study the question of what makes a quantum algorithm “good” for machine learning.

The blog is addressed to those of you who do or want to do quantum machine learning research (with or without PennyLane). I hope that after reading it you will recognise that the moments where you feel uncomfortable about how to design your quantum machine learning investigation may actually derive from some larger unresolved methodological challenges the field is facing.

If you find this topic interesting, you can find some aspects discussed in more depth in our new perspective Is quantum advantage the right goal for quantum machine learning?.

How we measure performance

Unsurprisingly, the two most wide-spread figures of merit in quantum machine learning research are borrowed from its parent disciplines: running benchmarks (“we show with simulations that the model learns to classify MNIST data well“) and proving an asymptotic runtime advantage (“we show that the ansatz gives rise to models that are classically intractable“). These methods were adopted without much of a critical debate: Machine learning papers need tables comparing the prediction power of algorithms, and quantum computing papers talk about speedups – that’s just how it is, right?

A few other notions for “good” quantum machine learning started to emerge along the way, especially in the attempt to design new machine learning models based on quantum circuits. The fake abstract implicitly mentions two that I will discuss here: the “expressivity” or generality of a computation (“here we propose a new, highly expressive variational ansatz…“), and that a quantum model emulates a popular concept from classical machine learning (“…that uses convolutional layers“).

When one looks critically at these and other ways to demonstrate the “power” of quantum machine learning in anything but a purely academic setting, one quickly sees some serious flaws. Flaws we cannot escape from easily, but which we can certainly become more aware of.

MNIST, Iris & Co

From machine learning, quantum machine learning adopted the practice of solving benchmark tasks to measure the performance of models. Benchmarks typically measure the error a trained model makes on a test set of data samples when actually running the algorithm and serves as an indication of its generalisation power. They already come with a bag of methodological challenges (see, for example, the panel discussion The Role of Benchmarks in the Scientific Progress of Machine Learning at NeurIPS 2021), such as what a fair comparison is, how we can control the impact of our specific choices for hyperparameters in optimization, or how we avoid overfitting a single dataset over the years.

But unlike classical machine learning, we do not even have the hardware or simulation capability to run benchmarks on relevant scales. We fix this by choosing tiny datasets and few-qubit-experiments, proposing hybrid classical-quantum setups, by downscaling the data, or by focusing on models that can be classically emulated. Then we produce graphs like this one:

In most cases we don’t know (or even ask) what effects these limitations have on the results.

Here is the point: If we use benchmarks we should think about their scope a lot more, choose them more carefully, and put more effort into drawing the right conclusions from them. For example, the made-up abstract above somehow suggests that the good performance on MNIST is evidence that the quantum model “works well”. Does this conclusion really lie in the scope of the benchmark? What was the quantum model compared to, a classical neural net with similar parameters, or similar expressive power, or a similar training time? Was the data down-scaled by another algorithm, and how does that influence the result? How robust is the outcome when we change the ansatz or hyperparameters in training? What settings break the model?

Answering these questions is within reach of numerical experiments, but producing meaningful evidence is an art that requires time and effort. Since we are scientists, I believe that we need to incentivize a more conscious use of benchmarks as a community rather than trying to make the quantum line look better than the classical competitor.

The holy grail of proven speedups

From quantum computing research, we adopted the notion of asymptotic runtime complexity as a figure of merit. It says that an algorithm is good if we can prove that its runtime grows slower with the size of the input – which in this context is usually the size of the dataset – than the runtime of some relevant contender in the asymptotic limit. This certainly makes for great papers for a quantum audience, some of which have shown that there exist problems that are intractable for classical machine learning but possible on quantum computers. But I think one can find many arguments as to why computational runtime complexity is not the right figure of merit for machine learning, especially when it comes to the goal of understanding how we may use quantum hardware for real problems one day:

  • Machine learning problems are messy and, therefore, genuinely difficult for complexity analysis. Realistic machine learning problems try to capture our world and are therefore not as simplistic as prime factorisation or unstructured search. The goal is not to find the correct solution, but a solution that generalises well on unseen data. We also do not often know a lot about the general mathematical structure of the data – think of language translations or stock market prediction – which is arguably why we need machines to do the learning in the first place. Often, a problem has to be reduced to an absurd minimum to make it accessible to complexity considerations.
  • There are often no comprehensive theoretical proofs for the effective runtime of classical algorithms. For example, training neural networks means solving a non-convex optimization problem that is in principle NP-hard, but something in the combination of model, training method and data makes it efficient. If complexity is a blunt tool to describe how the best classical machine learning algorithms perform, how can we use this tool to show that quantum algorithms are better for machine learning in anything but highly artificial comparisons?
  • Most machine learning algorithms already take only linear time in the number of data in practice. It is difficult to think of an alternative: a quadratic scaling becomes problematic for data sets larger than tens of thousands of examples, while sublinear algorithms require assumptions that circumvent the linear-time process of loading the data. But it means that there is not much room for speedups, at least if we want to improve on machine learning for settings where it is already used widely.
  • Only considering what is provable can be severely limiting. Due to the above, if we limit ourselves to models that are accessible by complexity analysis on paper, we shrink our window of investigation to an incredibly small part of the universe of quantum algorithms that are out there. One could argue that most of machine learning used today would not have been developed under this constraint.

To be fair, there may be cases where runtime complexity may be the right figure of merit, for example, when we want to find areas where machine learning has never been attempted because it was deemed to be impossible (again, think of the debate around learning from “quantum data”). This may be important for applications in the very very far future, applications that have not been invented yet. But until these applications are more concrete, I simply doubt that proving cases of classical intractability on paper, as claimed proudly in the fake abstract, tells us a lot about the practical power of quantum machine learning.

Expressivity is not always good

Another figure of merit that appears in the fake abstract is a notion of expressivity: “here we propose a new, maximally expressive variational ansatz“.

Expressivity is a term with many flavours, which can be confusing at first. It is often used when considering a family of quantum models defined by a parametrized quantum circuit (whose internal design is fixed by an ansatz). Training selects the parameter set or family member which solves some machine learning task best.

Expressivity can refer to two very different concepts in this setting:

  1. How large is the class of unitaries that an ansatz can express?
  2. How large is the class of models that the quantum circuit can express?

I would argue that for both concepts, more expressivity is not necessarily better. Let me give a few examples:

  • A very expressive parametrized ansatz is hard to train. This is an important result in the literature on “barren plateaus” (and the fake abstract mentions the feat of having avoided them). Much simplified, variational circuits that can prepare states in widely spread corners of Hilbert space give rise to flat optimization landscapes, and in high dimensions it is difficult to find the right search directions for minima.
  • Expressive model classes can overfit. Traditionally, too much expressivity in a model class was considered harmful for learning because it leads to overfitting. We know now that the picture is much more complex, as in highly overparametrized deep neural networks, the training algorithm can restrict the model class in beneficial ways. A debate about expressivity in quantum machine learning should at least be aware of these complexities.
  • Theoretical statements about large classes of models tend to be loose. For example, deriving a generalisation bound – an upper bound to how much worse a method will perform on unseen data compared to the training data – that is true for a wide range of datasets and models has to account for a lot of “bad” models too, and therefore doesn’t say much about what we’ll observe in practice. Proving something about all quantum models \(tr(\rho(x) M)\) with any measurement \(M\) and any encoding \(\rho(x)\) does therefore only give us limited information.
  • Theoretical statements about large classes of models can even be misleading. An illustrative example is quantum kernels, which compute the similarity of two data points by encoding them into two quantum states and measuring their “overlap” (roughly speaking). Increasingly, researchers warned that this method cannot scale: for large qubit numbers, two quantum states have a high probability of having a zero overlap, and the distance measure will become useless. This statement is true for most encodings in the class of all possible encodings. But quantum – just as classical – kernels should not just be made from arbitrary encodings in the first place.

In summary, expressivity is certainly an interesting measure, but it is not something we should blindly maximise.

Just write “quantum” in front of it

The last figure of merit we’ll discuss here is whether or not it emulates concepts that are currently popular (or “state-of-the-art”, as one likes to say) in classical machine learning. If we can find quantum convolutional neural networks, quantum perceptrons, quantum deep reinforcement learning models, quantum GANs or quantum neural tangent kernels, surely they have to be great?

This, of course, is nothing other than wishful thinking. What works in classical ML does not have to work when combined with quantum circuits. While overparametrised deep neural networks trained by stochastic gradient descent dominate AI, overparametrised deep quantum circuits trained by stochastic gradient descent could be useless on large scales (and anyone training them on small scales knows they are far away from working “out-of-the-box”). Neural tangent kernels help us to understand deep learning, but they may not help us to understand quantum machine learning. Perceptrons are the building blocks of neural networks, but they may not be a suitable mathematical mechanism to build models from quantum theory.

The right figure of merit here should not be that our methods emulate classical methods, but that they target the crucial principles that lead to the success of classical methods. Maybe the right design of quantum machine learning algorithms does not use layer structures. Maybe it does not use gradient descent. Maybe it does not use the concept of training by optimisation at all (an interesting thought voiced by Mikail Belkin, a great machine learning theorist, is that it is unclear why methods like nearest neighbour don’t work better).

Of course, this is an extremely ambitious demand, since the secret sauce of good machine learning is a rather complicated puzzle with many moving parts. But we should not fall prey to spurious beliefs that adding “quantum” to fancy classical terms is a valid argument for good performance.

What this all means for your next paper

Maybe the above convinced you that measuring performance in quantum machine learning is a difficult methodological problem that – while providing the foundation of our research – receives hardly any attention. If so, there are a few things you can do to address this blind spot when reading, reviewing or writing papers.

  1. When you read a paper, look for arguments used by the authors when advertising their methods. Don’t just accept them blindly. Ask which culture of science they come from, if they make sense in the study, and find out if the authors discuss their limitations convincingly.

  2. When you review a paper written by other researchers acknowledge how hard it is to design a study that is aware of methodological issues, and reward those who make an effort to ask uncomfortable questions about their own work.

  3. When you write a paper, be courageous to focus on small but well-posed questions that help our understanding. Resist the pressure to sell quantum machine learning, and consider investigation designs that try to break or test our methods. Be clear about your personal goal: the papers that will be relevant for quantum machine learning in 5 years’ time may be very different from the ones that make it into prestigious journals today.

I believe that these small adjustments in awareness could help to fundamentally reshape the tone in quantum machine learning research. I also believe that thinking about how to define good QML is crucial for innovation in the field and that such innovations will be critical if we want quantum technologies to find their place in machine learning one day.