September 12, 2023

Distributing quantum simulations using lightning.gpu with NVIDIA cuQuantum

Shuli Shu

Vincent Michaud-Rioux

Lee O'Riordan

As part of the PennyLane v0.31 release, we recently introduced support for distributed state vector simulations for the lightning.gpu simulator device, which offloads quantum gate calls to cuStateVec (v1.3.0 and later), and other distributed operations to Message Passing Interface (MPI).

The distributed lightning.gpu backend is at feature parity with the single-GPU backend, allowing users' workloads to be scaled up seamlessly. By adding a few addtional arguments when constructing a lightning.gpu device, a distributed simulation can be lauched and optimized easily, leading to better performance. We believe that this new feature will be a valuable complement to your research and development.

This work was completed in partnership with NVIDIA and also described in an NVIDIA blog post.

Background

Simulating quantum systems is a notoriously resource-heavy task. For many quantum circuit workloads, state-vector simulation remains the best method of choice, especially when dealing with large-depth circuits, ones requiring lots of sampling, or where directly examining the state is of importance.

However, state vector simulators have one well-known weakness — the exponential growth of memory requirements with the number of qubits being simulated. Running on a desktop computer, we can often reach at most 30 qubits, and even easily exceed the memory capacity of a single high-performance computing (HPC) node by adding a few more. Memory management is even more important for graphic processing units (GPUs). Given the advantage GPUs provide for improving simulation performance, we may very likely benefit from using multiple GPUs in tandem to store and operate on a single state vector.

Scaling state vector simulations

In a state vector simulator, a complex coefficient is stored for each represented quantum state, and is updated for each operation within a given quantum circuit that interacts with said state.

When performing double-precision calculations, the memory required to store the state vector of an n-qubit system is 2ⁿ⁺⁴ bytes. This exponential memory demand becomes a concern, particularly considering that GPU device memory is limited. Typically, a single GPU offers up to 96 gigabytes of memory. Consequently, the available device memory can only accommodate the state vector storage for 30–32-qubit systems, not even considering the additional memory needs for storing gates, observables and intermediate results. To simulate larger systems, it becomes necessary to distribute the state vector over multiple GPUs, all working together to simulate the same system.

Introducing lightning.gpu with MPI support

As of PennyLane v0.31, lightning.gpu enables distributed state vector simulation using a CUDA-aware Message Passing Interface (MPI) and the NVIDIA cuQuantum software development kit (SDK).

Enabling distributed state vector simulation support is straightforward: just include the mpi=True argument when creating a lightning.gpu device.


dev = qml.device("lightning.gpu", wires = wires, mpi = True)

With this, lightning.gpu will automatically distribute and initialize a complete state vector across multiple GPUs, be they local or remote. Each GPU is assigned a subsystem and a portion of device memory shared as a data buffer to assist with the MPI communications for quantum gate calls.

For performance tuning purposes, users have the option to specify the data buffer size by using the mpi_buf_size=n parameter during the construction of the lightning.gpu device, with n expressed in mebibytes (MiB) of device memory for each GPU.


dev = qml.device("lightning.gpu", wires = wires, mpi = True, mpi_buf_size=n)

All the feature parity supported by lightning.gpu is added into our current distributed state vector backend for both forward and backward pass.

We added support for finite shots on the circuit forward pass, allowing direct use of lightning.gpu with sampling workloads. Samples can be also generated from the state vector and returned to a NumPy-compatible array on the host machine for each MPI process. The probability of the sub-state vector can be evaluated directly and the results for each sub-state will be returned by each MPI process.

Both parameter shift and adjoint backpropagation are supported. Since the memory capability is a major concern for the distributed state vector calculation, users can also choose the way GPU memory is utilized for multiple observables. The adjoint backpropagation for the distributed state vector simulator will require less device memory by adding the batch_obs=True argument when constructing a distributed lightning.gpu device. (Larger quantum systems can still be simulated by adding the batch_obs=True argument, but this may come at the expense of execution time.)


dev = qml.device("lightning.gpu", wires = wires, mpi = True, batch_obs=True)

We would like to mention that having the mpi4py package, which provides Python bindings for MPI and is commonly used for building distributed Python APIs, is necessary for initializing and running distributed state vector simulations with a lightning.gpu device, whereas the lightning.gpu device itself and its backend do not depend on mpi4py. All MPI communication operations inside the lightning.gpu device interface are offloaded to its C++ backend. In its backend, the communication of quantum gate calls and associated data across multi-nodes/multi-GPUs is offloaded to the appropriate kernels and functions in the cuStateVec library. Data communication for other calls, such as quantum circuit differentiation, sampling, and data initialization, is coordinated through the direct use of MPI function calls.

Installing lightning.gpu with MPI

Use of the PennyLane-Lightning-GPU plugin with multi-node/multi-GPU support also requires the installation of the NVIDIA cuQuantum SDK (currently supported cuQuantum version: cuquantum-cu11), mpi4py and CUDA-aware MPI.

CUDA-aware MPI allows data exchange between GPU memory spaces of different nodes without the need for CPU-mediated transfers. Both MPICH and OpenMPI libraries are supported, provided they are compiled with CUDA support.

It is recommended to install NVIDIA cuQuantum and the mpi4py Python package within the Python environment site-packages directory using pip or conda. Please see NVIDIA cuQuantum SDK , mpi4py, MPICH, or OpenMPI install guides for more information.

The following procedure can be used to build a wheel with multi-node/multi-GPU support from the package sources using the direct SDK path:


    cmake -BBuild -DPLLGPU_ENABLE_MPI=on -DCUQUANTUM_SDK=<path to sdk>
    cmake --build ./Build --verbose
    python -m pip install wheel
    python setup.py build_ext --define="PLLGPU_ENABLE_MPI=ON" --cuquantum=<path to sdk>
    python setup.py bdist_wheel

The built wheel can now be installed (see more details in the PennyLane-Lightning-GPU documentation).


    python -m pip install ./dist/PennyLane_Lightning_GPU-*.whl

Demonstration of large-scale simulation

To demonstrate larger systems can be simulated with lightning.gpu with MPI support, and as a multi-node extension of the example from our original PennyLane Lightning-GPU blog post, we first take a sample quantum circuit Jacobian evaluation of a strongly entangling layered (SEL) circuit using adjoint differentiation. The simulations were run on four-GPU nodes of the Perlmutter (NERSC-9) supercomputer. Each GPU node has 4 NVIDIA A100 40GB Tensor Core GPUs.


from mpi4py import MPI
import pennylane as qml
from pennylane import numpy as np
from timeit import default_timer as timer

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Set number of runs for timing averaging
num_runs = 3

# Choose number of qubits (wires) and circuit layers
n_wires = 32
n_layers = 2

# Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device
# mpi=True to switch on distributed simulation
# batch_obs=True to reduce the device memory demand for adjoint backpropagation
dev = qml.device('lightning.gpu', wires=n_wires, mpi=True, batch_obs=True)

# Create QNode of device and circuit
@qml.qnode(dev, diff_method="adjoint")
def circuit_adj(weights):
    qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
    return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])

# Set trainable parameters for calculating circuit Jacobian at the rank=0 process
if rank == 0:
    params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
    params = None

# Broadcast the trainable parameters across MPI processes from rank=0 process
params = comm.bcast(params, root=0)

# Run, calculate the quantum circuit Jacobian and average the timing results
timing = []
for t in range(num_runs):
    start = timer()
    jac = qml.jacobian(circuit_adj)(params)
    end = timer()
    timing.append(end - start)

# MPI barrier to ensure all calculations are done
comm.Barrier()

if rank == 0:
    print("num_gpus: ", size, " wires: ", n_wires, " layers ", n_layers, " time: ", qml.numpy.mean(timing))

PennyLane lightning.gpu — Distributed Jacobian evaluation over 16 GPUs

Using 4 GPU nodes (16 NVIDIA A100 40GB GPUs), this simulation can easily be scaled to up to 32 qubits. For short circuits with fewer qubits, the overheads of MPI communication across GPUs/nodes may tend to dominate, but for simulations in the region of 28 qubits and above, the computation scales almost linearly!

Next, the probability of the distributed state vector simulation is presented. These simulations were also run on up to 64 GPU nodes of the Perlmutter (NERSC-9) supercomputer.


from mpi4py import MPI
import pennylane as qml
import numpy as np
from timeit import default_timer as timer

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Set number of runs for timing averaging
num_runs = 3

# Choose number of qubits (wires) and circuit layers
n_wires = 33
n_layers = 2

# Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device.
# mpi=True to switch on distributed simulation
dev = qml.device('lightning.gpu', wires=n_wires, mpi=True)

# Set target wires for probability calculation
prob_wires = range(n_wires)

# Create QNode of device and circuit
@qml.qnode(dev)
def circuit(weights):
    qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
    return qml.probs(wires=prob_wires)

# Set trainable parameters for calculating circuit Jacobian at the rank=0 process
if rank == 0:
    params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
    params = None

# Broadcast the trainable parameters across MPI processes from rank=0 process 
params = comm.bcast(params, root=0)

# Run, calculate the quantum circuit Jacobian and average the timing results
timing = []
for t in range(num_runs):
    start = timer()
    local_probs = circuit(params)
    end = timer()
    timing.append(end - start)

# MPI barrier to ensure all calculations are done
comm.Barrier()

if rank == 0:
    print("num_gpus: ", size, " wires: ", n_wires, " layers ", n_layers, " time: ", qml.numpy.mean(timing))

PennyLane lightning.gpu — Distributed evaluation of a 33-qubit SEL circuit

The performance of a 33-qubit system distributed across many GPUs was estimated. The simulation results show that the computation can be accelerated by using more GPUs and its scalability is almost linear, with no reduction in parallelism efficiency, even when using 256 GPUs.

Conclusions

With MPI support for lightning.gpu, it becomes possible to explore and execute large-scale workloads that would otherwise be intractable without the availability of multiple nodes/GPUs. Our benchmark tests demonstrate that a large-scale system with over 30 qubits can be reached through distributed simulations, and employing more GPUs can lead to better performance. By adding the complete feature parity supported by lightning.gpu into its MPI backend, all simulations that previously ran using a single GPU can now be extended to multi-node/multi-GPU. Activating and optimizing a distributed simulation is made easy by adding a few additional arguments when constructing a lightning.gpu device.

We encourage you to try out lightning.gpu on multi-node/multi-GPU soon and read NVIDIA's report on this work! As always, we warmly welcome suggestions, discussions, collaborations and contributions. Please don’t hesitate to engage in discussions and development in our GitHub repository.

About the authors

Shuli Shu

Performance of Lightning plug-ins

Vincent Michaud-Rioux

I work in the PennyLane performance team where we write, maintain and deliver high-performance quantum simulation backends (the PennyLane-Lightning plugins) that greatly accelerate computations exploiting multi-core CPUs, NVidia and AMD GPUs. I also ...

Lee O'Riordan

Physicist, purveyor of angular momentum, GPUs, pointy guitars, and computational things. Working on quantum stuff.

Last modified: August 06, 2024

Why PennyLane

Getting Started

Documentation

Ecosystem