March 13, 2024

Benchmarking near-term quantum algorithms is an art

Maria Schuld

Joseph Bowles

Shahnawaz Ahmed

In this blog post, the authors of Xanadu’s recently released paper Better than classical? The subtle art of benchmarking quantum machine learning models discuss the subtle art of quantum machine learning benchmarking, and why this is a hugely important topic the field must address.

What is the most difficult, “purest” kind of scientific work? Chances are you are thinking of proofs, mathematical theory and the research of the likes of Stephen Hawking (or, to use an example from quantum computing, Peter Shor). Benchmarking quantum algorithms based on software simulations, on the other hand, would probably rank very low; more thought of as an engineering task that can be given to any graduate student in a Master’s thesis. Find some code online, change it up a bit and generate a few numbers — the only real challenge of it is to generate plots that look good in a paper.

Benchmarking near-term-quantum algorithms is an art

Well, if you thought that, you would be wrong. Like most work with complex data-generating systems, benchmarking in near-term quantum computing is a subtle, underappreciated art that requires utmost scientific rigor — and time — to produce meaningful results. This painful but enlightening insight comes from working on our recent paper Better than classical? The subtle art of benchmarking quantum machine learning models.

We embarked on this study because of a growing suspicion that the overly positive picture from the literature, where quantum machine learning models readily outperform tried-and-tested classical models, was misleading. We also wanted to take up the challenge and follow our own call for better benchmarking in quantum machine learning. What would we find if we systematically tested popular quantum models on binary classification tasks in the fairest way we could imagine? (Spoiler alert: the study did not show systematic evidence that “quantum beats classical” and uncovered a lot of other interesting questions — read more about it here).

In this blog post we want to share with you the 4 most important reasons for why benchmarking quantum algorithms is so difficult and important. Our examples stem from machine learning applications, but the lessons learnt might help in other disciplines as well.

The dramatic effect of small decisions
A million circuits instead of one
Seeing the forest for the trees
"Is quantum better?" is not the most interesting question

The dramatic effect of small decisions

When designing a procedure to benchmark a quantum algorithm, there are hundreds of decisions to take. Some are more general and obvious: What task to test on? What implementation of my algorithm do I choose? What do I compare it to? What scoring metric do I use? Others are more technical: Do I discard training runs where gradient descent didn’t converge? Which subset of the data do I use to initialize a data preprocessing model? How do I choose the batch size during training?

The problem is that these decisions can have quite a large impact on the results. For example, we found that the score we defined in our research — the accuracy of a trained quantum model on a hold-out dataset — changed by up to 15% if we trained an already converged model for just a bit longer, or pre-scaled our data to a different interval, or changed from one standard loss function to another. But even if you wanted to, you cannot test all decisions: even 20 choices of two options each result in over a million experiments you’d have to conduct.

To get reliable results under these conditions, one needs two things: Firstly, an intimate knowledge of the experimental setup, which provides an intuition for what is important and what not. Secondly, a top-notch scientific methodology behind how choices are made and why.

The subtle art of benchmarking: The dramatic effect of small decisions

A million circuits instead of one

Quantum algorithms are notoriously difficult to simulate classically (that is the entire point, of course). But there is a huge difference between running a circuit once, which today’s software tools can proudly scale up to the range of 30 qubits, and testing a quantum machine learning algorithm:

Oftentimes we do not only need to compute the result of a circuit, but its gradient with respect to gate parameters. This poses different challenges to software.
Quantum models can involve multiple circuits. For example, in standard quantum kernel methods the number of circuits grows quadratically with the size of the data.
Test results have to be averaged over several random initialisations of a training procedure.
We have to rerun the overall benchmark for different choices of hyperparameters to determine the ones suited for a task.

Putting these together, to get a single benchmark score we do not run a single circuit, but thousands or even millions of them.

Even worse, some circuits only change in the trainable parameters, while others differ in architecture. Some can be run in parallel, while others depend on each other. And we need to run them forward and “backward” (a fancy term for calculating gradients). To combine the right performance tools, such as parallelisation, compilation or high-performance backends, requires a lot of highly specialized work. For example, we only figured out after weeks that the right order of just-in-time compilation and parallelisation via vmap in jax leads to tremendous speedups. Or that some code editors introduce memory leaks. We also had to rewrite our workflow several times to optimize job acceptance on a cluster, and became masters at assembling Slurm scripts using inspiration from chatGPT. We still don’t know if GPUs would help us. Benchmarking pushed us to become better programmers, as much as it pushes quantum software to its current limits.

The subtle art of benchmarking: A million circuits instead of one

Seeing the forest for the trees

Benchmarking is not a linear, but a cyclic process. A set of results is just an inspiration for the next iteration on benchmark design, to fix bugs, resolve conceptual issues and ask better questions. In every such iteration we get tons of data, full of inconclusive observations. Sometimes the learning rate is an important hyperparameter, in other experiments it does not matter, and on some datasets a training procedure converges, whereas on others it does not. After enough iterations, a signal slowly emerges from the noise and remains robust against changes in the design. That is when data starts to tell us a more general story, or when we start seeing the forest for the trees. And that is when the fun begins.

Once the forest emerges, one can start to formulate open research questions that follow from the picture one sees. In our case, we decided to select five striking observations, and walked a few tiny steps in analyzing them further with simple follow-up experiments enabled by our code. Why were hybrid quantum-classical neural networks always performing so well; can we visualize what the quantum parts are doing to data? Why were the quantum kernel methods behaving similarly to the classical kernel method; do the kernels measure a similar distance? Why were models based on amplitude encoding doing so badly; would more features have helped them? Exploring the forest a little more is not only a means of reaping the fruit from all the hard labor, but also a big service to other researchers looking for research opportunities!

The subtle art of benchmarking: Seeing the forest for the trees

"Is quantum better?" is not the most interesting question

Benchmarks of near-term quantum algorithms often have the goal of assessing whether a quantum method is “better” than some comparable classical algorithm. But the results contain so much more information that can be smoking guns leading to further studies. Why do all models with a certain design feature perform badly? Why do two very different models always perform similarly? Does the performance of a quantum model increase with the dimension of the data inputs?

And of course, we can use benchmarks to study the effect of interventions into algorithmic design. If I remove entanglement, does my quantum model generalise as well as before? Does the performance depend on the type of cost function I choose?

In a sense, such results are much richer than a binary decision on whether something outperforms something else. They allow us to understand a quantum algorithm, which is the first step towards finding better designs.

The subtle art of benchmarking: "Is quantum better?" is not the most interesting question

We hope that the — sometimes tedious, often enlightening — work that went into our benchmark study inspires more independent investigations on model performance and the reproduction of other researcher’s results.

If you, equipped with these insights, want to get your fingers dirty, have a look at (or even contribute to) the software repository, or try your own benchmark using some of the datasets. As for us, we move on humbled and eager to become better “benchmarkers”. As a start, we are already collaborating with the PennyLane development team to push the scales of QML model testing.

About the authors

Maria Schuld

Dedicated to making quantum machine learning a reality one day.

Joseph Bowles

Quantum Machine Learning researcher at Xanadu

Shahnawaz Ahmed

Code. Quantum. ML

Last modified: October 15, 2024

About

Software & Documentation

Resources

Topics

Community & Support