March 06, 2024
HPC4U&ME: Accelerate your quantum research with PennyLane Lightning
In this blog post, the authors of Xanadu’s recent paper Hybrid quantum programming with PennyLane Lightning on HPC platforms discuss how you can incorporate high-performance computing tools for quantum to accelerate your research through PennyLane and the PennyLane Lightning simulator suite.
Quantum computing research can be difficult, and this is even more the case as we scale up the size of the systems we are exploring. Yet, when it comes to classical compute power, we have more and more tools available to us these days; from GPUs to supercomputing centers.
But these technologies can sometimes feel out of reach, and difficult to use. How do we easily take advantage of this classical computing power to accelerate research?
In our latest paper, we showcase the PennyLane Lightning simulator suite, and demonstrate its performance across a variety of problems — from clever algorithms such as the adjoint differentiation method, to scaling up circuit simulations on a single node, and even breaking up workflows into chunks and executing them in parallel across hundreds (yes, hundreds!) of GPUs.
Read on to learn more, and to discover how you can use PennyLane and Lightning to reach new levels of performance.
Contents
- CPUs and GPUs and HPCs, oh my!
- Highly-optimized gate performance
- Using big computers for big workloads
- Great, where do I use this?
- More qubits, more gradients
- How to get started
CPUs and GPUs and HPCs, oh my!
When working with complex problems, once the system becomes big enough or the problem tough enough, finding solutions often means we have to move from pen-and-paper to a computer. Expressing your problem correctly for this to be helpful is the most important step in this transition, but it can also be very important to express it in a way that allows it to be solved efficiently, within a reasonable time, and in a scalable way.
For quantum problems of a given complexity (where the definition of complexity is left as an exercise to the reader 😉), being able to ask our computer for a solution becomes crucial. This is where PennyLane comes in. Our goal with PennyLane is to enable users to easily express their complex quantum problems, painlessly rely on cutting-edge algorithms and implementations to solve them, and enjoy the best performance while doing so.
To this end, automatically installed alongside PennyLane are our suite of high-performance Lightning simulators — modern C++-backed state vector simulators including:
-
lightning.qubit
: PennyLane's 'supported everywhere' simulator. A highly-optimized CPU state vector simulator, compatible with all modern operating systems and architectures. -
lightning.gpu
: A GPU-compatible state vector simulator built on top oflightning.qubit
, directly offloading to the NVIDIA cuQuantum SDK for newer CUDA-based GPUs. -
lightning.kokkos
: Built on top of the Kokkos framework for parallel CPU execution, as well as both AMD and NVIDIA GPUs execution.
We also aim to make Lightning as easy to install as possible, with
binaries provided for most modern operating systems and architectures, including Intel, AMD, Apple,
NVIDIA, IBM, and more. In most cases, it's a simple as pip install pennylane
, whether you are on
your laptop or a supercomputer.
Highly-optimized gate performance
As quantum circuits become deeper and wider (by the number of gates, and the number of qubits), ensuring that we can run these circuits requires us to use bigger and bolder computers. At 30 qubits, we already need 16 GB of memory to represent a state vector, and so using our laptop may not be the ideal solution anymore. In addition, at that size, applying a single-qubit gate requires updates to potentially all of the 2^{30} state vector coefficients. With numbers like those, we really want to make applying these operations as speedy as possible.
Of course, that’s a lot to ask of a computer, or even a researcher working with such a computer, so we can try to remove as much of the pain as possible, letting you focus on the research. For example, compare the runtime averaged gate performance for Hadamard (top), RX (middle), and CNOT (bottom) operations applied to a 30-qubit state vector on Lightning compared to other leading simulator frameworks:
By ensuring that gate applications and outcomes are fast regardless of your workload or computing hardware, our goal is to make sure everyone has access to the best tools for solving their problems, with minimal roadblocks to getting started.
With Lightning, you can write your code on your laptop, and still allow it to scale up to larger machines later when your research problem grows in size.
Using big computers for big workloads
When we think of supercomputers, we typically think of a single simulation, and pushing the number of qubits. But there may be parts of our workflows we can ‘chunk’ and execute in parallel.
Let’s briefly think about what it means to run a large quantum problem, say with over 40 thousand circuits of various sizes. How long does this take to simulate once we require more than a few qubits? Well, we did simulate this, so let's find out!
To develop the PennyLane Lightning suite, we committed ourselves to building tooling that allowed for the solving of a large number of circuits as quickly as possible. These types of workloads appear naturally in quantum algorithm workflows, where transformations of quantum circuits can often generate many sub-circuits that have to subsequently be run, such as with parameter-shifted gradients, quantities such as the metric tensor, and in particular, circuit-cutting techniques, where a large circuit can be sliced up into small chunks that can be solved independently.
We had previously explored circuit cutting with a 79-qubit QAOA MaxCut workload. As the tl;dr, we take a large circuit, slice it into smaller chunks that can fit onto a simulator, run these chunks in parallel, and pull the values back at the end to determine the circuit’s result with some tensor contractions to help. To see how far we could push this problem, we then expanded upon it, getting more than 40 thousand sub-circuits in the 27–29 qubit range, starting with the following graph:
This graph was mapped to a circuit, and we let PennyLane's qml.cut_circuit
transform take over for the decomposition. With that, we
finally reached the stage where we needed a way to run all of these circuits as quickly as possible.
An obvious answer to this problem is to run it on a supercomputer.
Supercomputers are some of the largest computing systems available, with each part able to work independently on different tasks, or them all operating together to solve problems that cannot run on smaller systems. Using these machines often requires subtle (or not so subtle) changes to how we express our problem: we want to make sure we use all of the CPUs, GPUs, and RAM efficiently, that the different machines talk to each other quickly, and that they know how to handle themselves when something goes wrong.
To build and solve our 40-thousand-circuit problem, we used NERSC’s Perlmutter
supercomputer, with lightning.gpu
doing the heavy lifting. Our code is
available on GitHub, and helps set up the
the problem, get it running on the supercomputer, and calculate our answer (namely, the expectation value of the original circuit).
Using hundreds of GPUs, we executed all of the circuits, and finalised the results in under 15 minutes! This also showed us how easily we can scale the performance to solve more circuits per second with more GPUs, letting us tune the resources to suit our problem needs.
Great, where do I use this?
There are multiple ways you can flexibly access a high-performance simulator backend using PennyLane Lightning, such as our lightning.qubit
device, to our performance portable lightning.kokkos
device, or our lightning.gpu
simulator with MPI support.
In addition, all Lightning device backends support the classically-efficient adjoint differentiation method as well as parallelism when taking quantum circuit gradients, allowing you to optimize quantum circuit parameters much faster than other methods. This allows us to express and solve bigger problems, such as using VQE for the ground-state energy of large Hamiltonians in a matter of minutes.
Let’s take an example of this, where our goal is to evaluate small and large circuits (from tens to thousands of gates) with expectation values of small-to-large Hamiltonians (also with tens to thousands of terms). If we have access to both CPUs and GPUs, how do we know when to use which backend? Let's take a look at the VQE data in the following figure, and try to answer that question.
We can see that different devices win the race at different scales — determining the fastest device for your workload depends on many things: your computing hardware (CPU cores, GPU types, memory bandwidth), the number of gates in your circuit, the number of terms in your Hamiltonian, the number of stars in the sky. Well, maybe not the last one, but there are often many more things that affect runtimes that are out of our control, and sometimes we need to do some benchmarking to see which option best fits our workload.
As lightning.qubit
supports parallelized gradient evaluations, we see it taking a wide lead below 20 qubits,
beyond which the GPU-backends catch up and begin to shine. By setting batch_obs=True
for the device, and diff_method="adjoint"
in the QNode, we can see a huge boost in performance over other simulators when calculating the Jacobian. Since the goal of VQE is to
find the minimum energy, having direct access to the Jacobian means we can ask our friendly
optimizer to use it and to find that minimum for us.
This has the advantage over gradient-free optimizers, as the quantum circuit can immediately provide some landscape
information to traverse (namely the direction of the steepest slope), allowing for a better solution in fewer steps. This advantage
quickly becomes enormous as the number of dimensions or parameters grows.
Let's take a look at optimizing \text{H}_2\text{O} using lightning.qubit
.
We can also set the number of threads with the environment variable OMP_NUM_THREADS
, which can be adjusted to suit our system
and problem:
import pennylane as qml from pennylane import numpy as np mol = qml.data.load("qchem", molname="H2O", bondlength=0.958, basis="STO-3G")[0] hf_state, ham = mol.hf_state, mol.hamiltonian wires = ham.wires dev = qml.device("lightning.qubit", wires=wires, batch_obs=True) n_electrons = mol.molecule.n_electrons singles, doubles = qml.qchem.excitations(n_electrons, len(wires)) # Create the QNode @qml.qnode(dev, diff_method="adjoint") def cost(weights): qml.AllSinglesDoubles(weights, wires, hf_state, singles, doubles) return qml.expval(ham) params = qml.numpy.array(np.random.normal(0, np.pi, len(singles) + len(doubles))) opt = qml.AdagradOptimizer(stepsize=0.2) max_iterations = 1000 energies = [np.nan] conv_tolerance = 1e-7 # Loop until converged, or the max iterations are hit for n in range(max_iterations): params, prev_energy = opt.step_and_cost(cost, params) energies.append(prev_energy) if not n % 10: # print every 10 steps print(f"Step={n}, energy={prev_energy}") if np.abs(energies[-1] - energies[-2]) < conv_tolerance: break print(f"Energy={cost(params)}")
We can just as easily swap the device from lightning.qubit
to lightning.kokkos
or lightning.gpu
, and the code will continue
running on whatever hardware we have available. This allows us to easily tailor suit our simulator to our problem, at whatever
scale we are interested in.
More qubits, more gradients
Sometimes you just need more qubits, and hence a bigger state vector simulator; we can help you with that too. For a general set of problems, state vector simulators are our best option, though they come with a curse: namely, every extra qubit means 2× the memory needs and 2× the operations per gate application. Let's suppose we need a 34-qubit state vector. This will require 16\times 2^{34} bytes to store (256 GiB), which can be a lot to ask of our humble laptops. But if we had access to lots of big GPUs, each with 64 GiB of memory — is there a way for us to run our big problem on multiple devices at once?
Sure there is, and answer is MPI, the standard for building supercomputing scale distributed
workloads. MPI allows us to express our problem in a way that uses multiple CPUs and even GPUs all working together to solve our problem.
For lightning.gpu
, we have direct support for building large-scale state vector workloads using MPI across many GPUs. In our paper,
we show using lightning.gpu
on over 500 GPUs running on Perlmutter, where we simulated a workload of up to 41 qubits!
We even added support for calculating quantum circuit gradients at this scale, because why should anybody be limited to quantum circuit gradients that only fit on a laptop!? That's right, they shouldn't! 😎
How to get started
One of the best things about Lightning is that if you use PennyLane, you are already ready to go: lightning.qubit
comes included with
every installation of PennyLane, so you can immediately get started on your laptop, workstation, or even supercomputer.
If you want to use more tailored high-performance backends, you can check our high-performance PennyLane install page, which will help you to get started. We currently have installation instructions for PyPI, conda, Docker, Spack, and the tried-and-true GitHub source installs, as well.
And finally, don't forget to check out our paper and our GitHub repository for all the details on the workflows and benchmarks discussed above.
Happy HPC-ing!
About the authors
Lee O'Riordan
Physicist, purveyor of angular momentum, GPUs, pointy guitars, and computational things. Working on quantum stuff.
Josh Izaac
Josh is a theoretical physicist, software tinkerer, and occasional baker. At Xanadu, he contributes to the development and growth of Xanadu’s open-source quantum software products.