March 11, 2024

PennyLane + AMD = ❤️: Running Lightning in a heterogeneous world

Lee O'Riordan

Vincent Michaud-Rioux

Josh Izaac

In this blog post, the authors of Xanadu’s recent paper Hybrid quantum programming with PennyLane Lightning on HPC platforms discuss how you can take advantage of AMD CPUs and GPUs to scale up your quantum computing workflows.

Say you want to run the variational quantum eigensolver algorithm on the $\text{C}_2\text{H}_4$ molecule — which has just under 9000 terms in its Hamiltonian. Throw gradients into the mix, and you have potentially thousands of circuit executions to simulate.

But what if we had access to lots of AMD GPUs, each with over 60 GB of memory — can we run our big problem on multiple GPUs, all at once? With PennyLane thrown into the mix, we can run this entire simulation in little over 6 minutes per iteration.

In our latest paper, we tackle running large-scale quantum circuit optimization workloads on the LUMI supercomputer using over 250 (yes, 250!) AMD GPUs — all using features built directly into PennyLane Lightning.

Read on to learn more about how you can take advantage of AMD hardware, PennyLane and the Lightning plugins to speed up your workflows.

AMD and PennyLane
Fine-grained optimization: Running AVX-512 workloads on AMD CPUs
Distributed workflows: VQE with MPI for fun
- The key software ingredients
- Bleeding-edge tooling: ROCm like a hurricane!
How to get started

AMD and PennyLane

These days, we tend to equate GPU-superpowered simulations with NVIDIA GPUs and the CUDA software framework. But when we start to explore the world of high-performance computing, we discover quickly that this is not always the case — we live in a heterogenous world, where hardware accelerators come from a variety of vendors, from Intel, NVIDIA, AMD, and more, each providing unique advantages and selling points.

For example, the LUMI supercomputer the EuroHPC Joint Undertaking system at CSC in Finland (number 5 in the TOP500 list of the world's most powerful supercomputers) is powered by AMD Instinct MI250X GPU accelerators and AMD EPYC 7003 series CPUs. Similarly, the Setonix HPE Cray supercomputer at Pawsey Centre, the most powerful supercomputer in the southern hemisphere, has over 180 AMD EPYC 64-Core GPU nodes, each with eight AMD Instinct MI250X GPUs.

When doing quantum research, we want to make sure that our computational workflows are compatible with — and take advantage of! — this heterogeneous landscape, and the PennyLane Lightning suite of simulators are designed to do exactly that:

lightning.qubit: PennyLane's "supported everywhere" simulator. A highly optimized CPU state vector simulator, compatible with all modern operating systems and architectures.
lightning.gpu: A GPU-compatible state vector simulator built on top of lightning.qubit, directly offloading to the NVIDIA cuQuantum SDK for newer CUDA-based GPUs.
lightning.kokkos: Built on top of the Kokkos framework for parallel CPU execution, as well as both AMD and NVIDIA GPU execution.

Kokkos is a performance portability framework allowing us to write C++ code that compiles and runs on a variety of hardware: including CPUs, as well as NVIDIA, AMD, and Intel GPUs. It does this by tying into other frameworks such as OpenMP, CUDA, HIP, and SYCL. We rewrote the high-performance lightning.qubit backend using Kokkos to produce lightning.kokkos, achieving better performance portability. lightning.kokkos is therefore the weapon of choice for executing on the widest variety of available classical compute hardware.

We also aim to make the Lightning suite as easy to install as possible, with binaries provided for most modern operating systems and architectures, including Windows, MacOS and Linux, and Intel, AMD, Apple, NVIDIA, IBM, and more. In most cases, it's as simple as pip install pennylane, whether you are on your laptop or a supercomputer.

Fine-grained optimization: Running AVX-512 workloads on AMD CPUs

Making sure we get the best performance for any given workload often requires fine-tuning, optimizing our implementations in place for a given classical architecture, or just straight-up buying bigger hardware. Even though GPUs are often the backbone of many modern large-scale workloads, ensuring the fastest CPU-based performance remains of the utmost importance too. Every HPC system in the land has CPUs, and ignoring them leaves a lot of performance gains to the side.

With lightning.qubit, we recently showed how our implementations of targeted SIMD kernels gave us performance gains for quantum gate applications. Here, we'll demonstrate running these kernels on the latest generation AMD Epyc processors, on an AWS M7a instance. Since these processors natively support AVX-512, Lightning's highly-optimized gate kernels can shine.

More power with PennyLane Lightning! Applying a CNOT gate on a 30-qubit state vector, on a single socket with 32 OpenMP threads.

The above figure is a full sweep of all qubit indices for applying a CNOT gate on a 30-qubit state vector, all running on a single socket with 32 OpenMP threads. From left to right, we start with lightning.qubit's default kernel implementation, with added parallelisation over the coefficient interaction function. Next, we make use of our AVX2 kernel backend which immediately gives us just under 19% faster runtimes. Finally, by switching on the AVX-512 kernels, we see upwards of 24% runtime performance improvement over the already speedy default kernels. All of these kernels are available in lightning.qubit right now, and will be automatically dispatched to the best performing option on your given hardware!

Distributed workflows: VQE with MPI for fun

Nothing says "HPC" like running workloads with MPI. MPI stands for Message Passing Interface, and it is the de facto programming paradigm for distributed workloads. When saying distributed workloads, it means anything from workloads that include thousands of tasks that can be managed by independent nodes, to workloads in which unique data structures are partitioned and held in the memory of several nodes at once (which generally results in communication-intensive algorithms). The later type is precisely what happens in, say, lightning.gpu which allows for the state vector itself to be distributed over an arbitrary number of GPUs as of PennyLane v0.31. Some of the latest feats of lightning.gpu were described in our previous post.

In that blog post, we also reported the strong VQE performance achieved using the lightning.kokkos device with a CUDA backend. VQE problems are usually not limited by the sheer amount of memory needed to store the state vector but by the number of iterations required to optimize the total energy. For instance, the largest molecule in our single-node Lightning benchmarks was $\text{C}_2\text{H}_2$ , described with 24 qubits. We then thought, would it not be great if lightning.kokkos could also scale with MPI, enabling finding the ground state of even larger molecules with VQE? For that, we would need a lot of high-end GPUs, and we had just found the right partner: the LUMI supercomputer at the Finnish IT Center for Science. LUMI is the fastest supercomputer in Europe, equipped with high-end AMD cards, so surely if that doesn't do, nothing will.

But why is it difficult to push beyond $\text{C}_2\text{H}_2$ ? While executing a 24-qubits circuit is not bad, the $\text{C}_2\text{H}_2$ Hamiltonian has 6401 terms, which is a lot. This means a single evaluation of the total energy requires computing a weighted sum of 6401 expectation values after applying a few thousand gates. As if this were not bad enough, each optimization step requires computing a gradient vector with thousands of elements. One can also try their luck at using gradient-free methods, but those typically require even more iterations to converge. Fortunately, lightning.kokkos implements the so-called adjoint differentiation method (as do the other Lightning backends), which gives gradients at a relatively low cost.

In quantum chemistry research, one typically does not stop after optimizing the ground state because that ground state is usually parametrized in terms of the atomic positions, an electric field, and other properties. For example, the total energy as a function of atomic coordinates (considered as classical variables) is a multi-dimensional manifold called the Born–Oppenheimer surface. The knowledge of that surface may tell us a lot about a certain physical system; local minima are stable molecule geometries, and the curvature close to those minima reveals the vibration modes, which can then be used to estimate many thermodynamical quantities like heat capacity. So executing a single circuit is far from enough, we want to execute numerous circuits and fast.

The key software ingredients

Large VQE problems necessitate the computation of thousands of expectation values, which means thousands of so-called backward passes during the gradient computation. Again, we do not merely seek the total energy for a single wavefunction ansatz, we want to optimize the thousands of ansatz parameters which will reveal the true quantum ground state of our molecule. A natural parallelization strategy was therefore to map the various total energy contributions to the available processing units, in the present case hefty AMD Instinct MI250X GPUs. We needed a way to communicate data across nodes prior to backpropagation and to reduce contributions to the total energy and gradient, and hence we needed MPI.

A recent addition to the widely used mpi4py package is an interface that matches Python's concurrent.futures, namely mpi4py.futures. These Python executor objects define an interface for submitting tasks of work to run in predetermined locations. Python provides support for both a thread-pool and a process-pool executor, which work great for local parallelism. However, sometimes you need things to run outside of one computer: using the MPIPoolExecutor we can easily farm out tasks over all available MPI processes.

We wanted to see how this would work, so we programmed a simple solution at the Python level, forked the code, upgraded our serialization classes, added pickle bindings, refactored expectation value and adjoint differentiation to support MPI distribution, and voilà! Hmmm, not so fast; we're not quite there yet.

Bleeding-edge tooling: ROCm like a hurricane!

Of course, using the newest features often demands building things yourself — so that's what we did. Running on LUMI, we built Lightning-Kokkos to run through the ROCm toolchain (the AMD-developed SDK for GPU software development, analagous to CUDA), and generate support directly for the AMD Instinct MI250X GPUs. Next, we built OpenMPI 5.0 ourselves against a multithreaded UCX library and finally built mpi4py from the latest repository version. With this tooling in place, we got ourselves some GPUs to talk to.

From here we decided to pick a suitably large problem: running VQE on the ethylene ( $\text{C}_2\text{H}_4$ ) molecule, which is represented in the PennyLane Datasets as a 28-qubit Hamiltonian with approximately 9000 terms.

We fed the Hamiltonian into PennyLane's gradient machinery to calculate a Jacobian for use with our optimizer of choice. Running this on over 250 GPUs was no problem at all, yielding a gradient with respect to over three thousand parameters (for the ansatz wavefunction) in a little over 6 minutes. To get a fair scaling and in the interest of time, we limited ourselves to 10 iterations of PennyLane's GradientDescentOptimizer, and the total energy already dropped from -64.658 Hartree (Ha) to -73.739 Ha, with a residual of 0.149 Ha. This is not yet chemical accuracy (~0.002 Ha), which is usually required in high-quality quantum chemistry research, but we're well on our way there thanks to Lightning-Kokkos backed by the AMD Instinct MI250X GPUs and its ability to use gradient-based optimisation, with nothing in the way but a few more iterations.

How to get started

The Lightning simulators enable quantum researchers to push frontiers where performance is a limiting factor. This is enabled everywhere by lightning.qubit which is supported on any major OS (Windows, MacOS, Linux), by the new distributed state vector capabilities of lightning.gpu, and by lightning.kokkos which brings our performance portability to a new level, supporting multi-core CPUs, NVIDIA GPUs as well as AMD GPUs (to name a few).

One of the best things about Lightning is that if you use PennyLane, you are already ready to go: lightning.qubit comes included with every installation of PennyLane, so you can immediately get started on your laptop, workstation, or even supercomputer.

If you want to use more tailored high-performance backends, you can check our high-performance PennyLane install page, which will help you to get started. We currently have installation instructions for PyPI, conda, Docker, Spack, and the tried-and-true GitHub source installs, as well.

And finally, don't forget to check out our paper and our GitHub repository for all the details on the workflows and benchmarks in the paper. All the Lightning MPI development mentioned in this post lives on the update/pickle_bindings branch of the PennyLaneAI/pennylane-lightning repository.

Happy HPC-ing!

About the authors

Lee O'Riordan

Physicist, purveyor of angular momentum, GPUs, pointy guitars, and computational things. Working on quantum stuff.

Vincent Michaud-Rioux

I work in the PennyLane performance team where we write, maintain and deliver high-performance quantum simulation backends (the PennyLane-Lightning plugins) that greatly accelerate computations exploiting multi-core CPUs, NVidia and AMD GPUs. I also ...

Josh Izaac

Josh is a theoretical physicist, software tinkerer, and occasional baker. At Xanadu, he contributes to the development and growth of Xanadu’s open-source quantum software products.

Last modified: August 06, 2024

About

Software & Documentation

Resources

Topics

Community & Support