March 11, 2024
PennyLane + AMD = ❤️: Running Lightning in a heterogeneous world
In this blog post, the authors of Xanadu’s recent paper Hybrid quantum programming with PennyLane Lightning on HPC platforms discuss how you can take advantage of AMD CPUs and GPUs to scale up your quantum computing workflows.
Say you want to run the variational quantum eigensolver algorithm on the \text{C}_2\text{H}_4 molecule — which has just under 9000 terms in its Hamiltonian. Throw gradients into the mix, and you have potentially thousands of circuit executions to simulate.
But what if we had access to lots of AMD GPUs, each with over 60 GB of memory — can we run our big problem on multiple GPUs, all at once? With PennyLane thrown into the mix, we can run this entire simulation in little over 6 minutes per iteration.
In our latest paper, we tackle running large-scale quantum circuit optimization workloads on the LUMI supercomputer using over 250 (yes, 250!) AMD GPUs — all using features built directly into PennyLane Lightning.
Read on to learn more about how you can take advantage of AMD hardware, PennyLane and the Lightning plugins to speed up your workflows.
Contents
- AMD and PennyLane
- Fine-grained optimization: Running AVX-512 workloads on AMD CPUs
- Distributed workflows: VQE with MPI for fun
- How to get started
AMD and PennyLane
These days, we tend to equate GPU-superpowered simulations with NVIDIA GPUs and the CUDA software framework. But when we start to explore the world of high-performance computing, we discover quickly that this is not always the case — we live in a heterogenous world, where hardware accelerators come from a variety of vendors, from Intel, NVIDIA, AMD, and more, each providing unique advantages and selling points.
For example, the LUMI supercomputer the EuroHPC Joint Undertaking system at CSC in Finland (number 5 in the TOP500 list of the world's most powerful supercomputers) is powered by AMD Instinct MI250X GPU accelerators and AMD EPYC 7003 series CPUs. Similarly, the Setonix HPE Cray supercomputer at Pawsey Centre, the most powerful supercomputer in the southern hemisphere, has over 180 AMD EPYC 64-Core GPU nodes, each with eight AMD Instinct MI250X GPUs.
When doing quantum research, we want to make sure that our computational workflows are compatible with — and take advantage of! — this heterogeneous landscape, and the PennyLane Lightning suite of simulators are designed to do exactly that:
-
lightning.qubit
: PennyLane's "supported everywhere" simulator. A highly optimized CPU state vector simulator, compatible with all modern operating systems and architectures. -
lightning.gpu
: A GPU-compatible state vector simulator built on top oflightning.qubit
, directly offloading to the NVIDIA cuQuantum SDK for newer CUDA-based GPUs. -
lightning.kokkos
: Built on top of the Kokkos framework for parallel CPU execution, as well as both AMD and NVIDIA GPU execution.
Kokkos is a performance portability framework allowing us to write C++ code that compiles and runs
on a variety of hardware: including CPUs, as well as NVIDIA, AMD, and Intel GPUs. It does this by tying
into other frameworks such as OpenMP, CUDA, HIP, and SYCL. We rewrote the high-performance
lightning.qubit
backend using Kokkos to produce lightning.kokkos
, achieving better performance
portability. lightning.kokkos
is therefore the weapon of choice for executing on the widest
variety of available classical compute hardware.
We also aim to make the Lightning suite as easy to install as possible, with
binaries provided for most modern operating systems and architectures, including Windows, MacOS and Linux, and Intel, AMD, Apple,
NVIDIA, IBM, and more. In most cases, it's as simple as pip install pennylane
, whether you are on
your laptop or a supercomputer.
Fine-grained optimization: Running AVX-512 workloads on AMD CPUs
Making sure we get the best performance for any given workload often requires fine-tuning, optimizing our implementations in place for a given classical architecture, or just straight-up buying bigger hardware. Even though GPUs are often the backbone of many modern large-scale workloads, ensuring the fastest CPU-based performance remains of the utmost importance too. Every HPC system in the land has CPUs, and ignoring them leaves a lot of performance gains to the side.
With lightning.qubit
, we recently showed how our implementations of targeted SIMD kernels gave us
performance gains for quantum gate applications. Here, we'll demonstrate running these
kernels on the latest generation AMD Epyc processors, on an AWS M7a instance. Since these processors
natively support AVX-512, Lightning's highly-optimized gate kernels can shine.
The above figure is a full sweep of all qubit indices for applying a CNOT gate on a 30-qubit state
vector, all running on a single socket with 32 OpenMP threads. From left to right, we start with
lightning.qubit
's default kernel implementation, with added parallelisation over the coefficient
interaction function. Next, we make use of our AVX2 kernel backend which immediately gives us just
under 19% faster runtimes. Finally, by switching on the AVX-512 kernels, we see upwards of 24%
runtime performance improvement over the already speedy default kernels. All of these kernels are available
in lightning.qubit
right now, and will be automatically dispatched to the best performing option
on your given hardware!
Distributed workflows: VQE with MPI for fun
Nothing says "HPC" like running workloads with MPI. MPI stands for Message Passing Interface, and it
is the de facto programming paradigm for distributed workloads. When saying distributed
workloads, it means anything from workloads that include thousands of tasks that can be managed by
independent nodes, to workloads in which unique data structures are partitioned and held in the
memory of several nodes at once (which generally results in communication-intensive algorithms).
The later type is precisely what happens in, say, lightning.gpu
which allows for the
state vector itself to be distributed over an arbitrary number of GPUs as of PennyLane v0.31. Some of the latest
feats of lightning.gpu
were described in our previous post.
In that blog post, we also reported the strong VQE performance achieved using the lightning.kokkos
device with a CUDA backend. VQE problems are usually not limited by the sheer amount of memory
needed to store the state vector but by the number of iterations required to optimize the total
energy. For instance, the largest molecule in our single-node Lightning benchmarks was \text{C}_2\text{H}_2, described with 24 qubits.
We then thought, would it not be great if lightning.kokkos
could also scale with MPI, enabling
finding the ground state of even larger molecules with VQE? For that, we would need a lot of
high-end GPUs, and we had just found the right partner: the LUMI supercomputer at the Finnish IT Center for Science.
LUMI is the fastest supercomputer in Europe, equipped with high-end AMD cards, so surely if that
doesn't do, nothing will.
But why is it difficult to push beyond \text{C}_2\text{H}_2? While executing a 24-qubits circuit is not bad, the \text{C}_2\text{H}_2 Hamiltonian has 6401 terms, which is a lot.
This means a single evaluation of the total energy requires computing a weighted sum of 6401
expectation values after applying a few thousand gates. As if this were not bad enough, each
optimization step requires computing a gradient vector with thousands of elements. One can also try
their luck at using gradient-free methods, but those typically require even more iterations to
converge. Fortunately, lightning.kokkos
implements the so-called
adjoint differentiation method (as do the other Lightning backends), which
gives gradients at a relatively low cost.
In quantum chemistry research, one typically does not stop after optimizing the ground state because that ground state is usually parametrized in terms of the atomic positions, an electric field, and other properties. For example, the total energy as a function of atomic coordinates (considered as classical variables) is a multi-dimensional manifold called the Born–Oppenheimer surface. The knowledge of that surface may tell us a lot about a certain physical system; local minima are stable molecule geometries, and the curvature close to those minima reveals the vibration modes, which can then be used to estimate many thermodynamical quantities like heat capacity. So executing a single circuit is far from enough, we want to execute numerous circuits and fast.
The key software ingredients
Large VQE problems necessitate the computation of thousands of expectation values, which means thousands of so-called backward passes during the gradient computation. Again, we do not merely seek the total energy for a single wavefunction ansatz, we want to optimize the thousands of ansatz parameters which will reveal the true quantum ground state of our molecule. A natural parallelization strategy was therefore to map the various total energy contributions to the available processing units, in the present case hefty AMD Instinct MI250X GPUs. We needed a way to communicate data across nodes prior to backpropagation and to reduce contributions to the total energy and gradient, and hence we needed MPI.
A recent addition to the widely used mpi4py
package is an interface that matches Python's
concurrent.futures
, namely mpi4py.futures
. These Python executor objects define
an interface for submitting tasks of work to run in predetermined locations. Python provides
support for both a thread-pool and a process-pool executor, which work great for local parallelism.
However, sometimes you need things to run outside of one computer: using the MPIPoolExecutor
we
can easily farm out tasks over all available MPI processes.
We wanted to see how this would work, so we programmed a simple solution at the Python level, forked the code, upgraded our serialization classes, added pickle bindings, refactored expectation value and adjoint differentiation to support MPI distribution, and voilà! Hmmm, not so fast; we're not quite there yet.
Bleeding-edge tooling: ROCm like a hurricane!
Of course, using the newest features often demands building things yourself — so that's what we
did. Running on LUMI, we built Lightning-Kokkos to run through the
ROCm toolchain (the AMD-developed SDK for GPU software development, analagous
to CUDA), and generate support directly for the AMD Instinct MI250X GPUs. Next, we built OpenMPI 5.0
ourselves against a multithreaded UCX library and finally built mpi4py
from the latest repository
version. With this tooling in place, we got ourselves some GPUs to talk to.
From here we decided to pick a suitably large problem: running VQE on the ethylene (\text{C}_2\text{H}_4) molecule, which is represented in the PennyLane Datasets as a 28-qubit Hamiltonian with approximately 9000 terms.
We fed the Hamiltonian into PennyLane's gradient machinery to calculate a Jacobian for use with our
optimizer of choice. Running this on over 250 GPUs was no problem at all, yielding a gradient with
respect to over three thousand parameters (for the ansatz wavefunction) in a little over 6
minutes. To get a fair scaling and in the interest of time, we limited ourselves to 10 iterations of
PennyLane's GradientDescentOptimizer
, and the total energy already dropped from -64.658 Hartree (Ha) to -73.739
Ha, with a residual of 0.149 Ha. This is not yet chemical accuracy (~0.002 Ha), which is
usually required in high-quality quantum chemistry research, but we're well on our way there thanks
to Lightning-Kokkos backed by the AMD Instinct MI250X GPUs and its ability to use gradient-based optimisation, with nothing in the way but a few more
iterations.
How to get started
The Lightning simulators enable quantum researchers to push frontiers where performance is a
limiting factor. This is enabled everywhere by lightning.qubit
which is supported on any major
OS (Windows, MacOS, Linux), by the new distributed state vector capabilities of
lightning.gpu
, and by lightning.kokkos
which brings our performance
portability to a new level, supporting multi-core CPUs, NVIDIA GPUs as well as AMD GPUs (to name a
few).
One of the best things about Lightning is that if you use PennyLane, you are already ready to go: lightning.qubit
comes included with
every installation of PennyLane, so you can immediately get started on your laptop, workstation, or even supercomputer.
If you want to use more tailored high-performance backends, you can check our high-performance PennyLane install page, which will help you to get started. We currently have installation instructions for PyPI, conda, Docker, Spack, and the tried-and-true GitHub source installs, as well.
And finally, don't forget to check out our paper and our GitHub
repository for all the details on the workflows
and benchmarks in the paper. All the Lightning MPI development mentioned in this post lives on the
update/pickle_bindings
branch of the PennyLaneAI/pennylane-lightning repository.
Happy HPC-ing!
About the authors
Lee O'Riordan
Physicist, purveyor of angular momentum, GPUs, pointy guitars, and computational things. Working on quantum stuff.
Vincent Michaud-Rioux
I work in the PennyLane performance team where we write, maintain and deliver high-performance quantum simulation backends (the PennyLane-Lightning plugins) that greatly accelerate computations exploiting multi-core CPUs, NVidia and AMD GPUs. I also ...
Josh Izaac
Josh is a theoretical physicist, software tinkerer, and occasional baker. At Xanadu, he contributes to the development and growth of Xanadu’s open-source quantum software products.