September 12, 2023
Distributing quantum simulations using lightning.gpu with NVIDIA cuQuantum
As part of the PennyLane v0.31 release, we recently introduced support for distributed state vector simulations for the lightning.gpu
simulator device, which offloads quantum gate calls to cuStateVec (v1.3.0 and later), and other distributed operations to Message Passing Interface (MPI).
The distributed lightning.gpu
backend is at feature parity with the single-GPU backend, allowing users' workloads to be scaled up seamlessly. By adding a few addtional arguments when constructing a lightning.gpu
device, a distributed simulation can be lauched and optimized easily, leading to better performance. We believe that this new feature will be a valuable complement to your research and development.
This work was completed in partnership with NVIDIA and also described in an NVIDIA blog post.
Background
Simulating quantum systems is a notoriously resource-heavy task. For many quantum circuit workloads, state-vector simulation remains the best method of choice, especially when dealing with large-depth circuits, ones requiring lots of sampling, or where directly examining the state is of importance.
However, state vector simulators have one well-known weakness — the exponential growth of memory requirements with the number of qubits being simulated. Running on a desktop computer, we can often reach at most 30 qubits, and even easily exceed the memory capacity of a single high-performance computing (HPC) node by adding a few more. Memory management is even more important for graphic processing units (GPUs). Given the advantage GPUs provide for improving simulation performance, we may very likely benefit from using multiple GPUs in tandem to store and operate on a single state vector.
Scaling state vector simulations
In a state vector simulator, a complex coefficient is stored for each represented quantum state, and is updated for each operation within a given quantum circuit that interacts with said state.
When performing double-precision calculations, the memory required to store the state vector of an n-qubit system is 2n+4 bytes. This exponential memory demand becomes a concern, particularly considering that GPU device memory is limited. Typically, a single GPU offers up to 96 gigabytes of memory. Consequently, the available device memory can only accommodate the state vector storage for 30–32-qubit systems, not even considering the additional memory needs for storing gates, observables and intermediate results. To simulate larger systems, it becomes necessary to distribute the state vector over multiple GPUs, all working together to simulate the same system.
Introducing lightning.gpu with MPI support
As of PennyLane v0.31, lightning.gpu
enables distributed state vector simulation using a CUDA-aware Message Passing Interface (MPI) and the NVIDIA cuQuantum software development kit (SDK).
Enabling distributed state vector simulation support is straightforward: just include the mpi=True
argument when creating a lightning.gpu
device.
dev = qml.device("lightning.gpu", wires = wires, mpi = True)
With this, lightning.gpu
will automatically distribute and initialize a complete state vector across multiple GPUs, be they local or remote. Each GPU is assigned a subsystem and a portion of device memory shared as a data buffer to assist with the MPI communications for quantum gate calls.
For performance tuning purposes, users have the option to specify the data buffer size by using the mpi_buf_size=n
parameter during the construction of the lightning.gpu
device, with n
expressed in mebibytes (MiB) of device memory for each GPU.
dev = qml.device("lightning.gpu", wires = wires, mpi = True, mpi_buf_size=n)
All the feature parity supported by lightning.gpu
is added into our current distributed state vector backend for both forward and backward pass.
We added support for finite shots on the circuit forward pass, allowing direct use of lightning.gpu
with sampling workloads. Samples can be also generated from the state vector and returned to a NumPy-compatible array on the host machine for each MPI process. The probability of the sub-state vector can be evaluated directly and the results for each sub-state will be returned by each MPI process.
Both parameter shift and adjoint backpropagation are supported. Since the memory capability is a major concern for the distributed state vector calculation, users can also choose the way GPU memory is utilized for multiple observables. The adjoint backpropagation for the distributed state vector simulator will require less device memory by adding the batch_obs=True
argument when constructing a distributed lightning.gpu
device. (Larger quantum systems can still be simulated by adding the batch_obs=True
argument, but this may come at the expense of execution time.)
dev = qml.device("lightning.gpu", wires = wires, mpi = True, batch_obs=True)
We would like to mention that having the mpi4py
package, which provides Python bindings for MPI and is commonly used for building distributed Python APIs, is necessary for initializing and running distributed state vector simulations with a lightning.gpu
device, whereas the lightning.gpu
device itself and its backend do not depend on mpi4py
. All MPI communication operations inside the lightning.gpu
device interface are offloaded to its C++ backend. In its backend, the communication of quantum gate calls and associated data across multi-nodes/multi-GPUs is offloaded to the appropriate kernels and functions in the cuStateVec library. Data communication for other calls, such as quantum circuit differentiation, sampling, and data initialization, is coordinated through the direct use of MPI function calls.
Installing lightning.gpu with MPI
Use of the PennyLane-Lightning-GPU
plugin with multi-node/multi-GPU support also requires the installation of the NVIDIA cuQuantum SDK (currently supported cuQuantum version: cuquantum-cu11), mpi4py
and CUDA-aware MPI.
CUDA-aware MPI allows data exchange between GPU memory spaces of different nodes without the need for CPU-mediated transfers. Both MPICH
and OpenMPI
libraries are supported, provided they are compiled with CUDA support.
It is recommended to install NVIDIA cuQuantum and the mpi4py
Python package within the Python environment site-packages
directory using pip
or conda
. Please see NVIDIA cuQuantum SDK , mpi4py, MPICH, or OpenMPI install guides for more information.
The following procedure can be used to build a wheel with multi-node/multi-GPU support from the package sources using the direct SDK path:
cmake -BBuild -DPLLGPU_ENABLE_MPI=on -DCUQUANTUM_SDK=<path to sdk> cmake --build ./Build --verbose python -m pip install wheel python setup.py build_ext --define="PLLGPU_ENABLE_MPI=ON" --cuquantum=<path to sdk> python setup.py bdist_wheel
The built wheel can now be installed (see more details in the PennyLane-Lightning-GPU documentation).
python -m pip install ./dist/PennyLane_Lightning_GPU-*.whl
Demonstration of large-scale simulation
To demonstrate larger systems can be simulated with lightning.gpu
with MPI support, and as a multi-node extension of the example from our original PennyLane Lightning-GPU blog post, we first take a sample quantum circuit Jacobian evaluation of a strongly entangling layered (SEL) circuit using adjoint differentiation. The simulations were run on four-GPU nodes of the Perlmutter (NERSC-9) supercomputer. Each GPU node has 4 NVIDIA A100 40GB Tensor Core GPUs.
from mpi4py import MPI import pennylane as qml from pennylane import numpy as np from timeit import default_timer as timer comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() # Set number of runs for timing averaging num_runs = 3 # Choose number of qubits (wires) and circuit layers n_wires = 32 n_layers = 2 # Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device # mpi=True to switch on distributed simulation # batch_obs=True to reduce the device memory demand for adjoint backpropagation dev = qml.device('lightning.gpu', wires=n_wires, mpi=True, batch_obs=True) # Create QNode of device and circuit @qml.qnode(dev, diff_method="adjoint") def circuit_adj(weights): qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires))) return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)]) # Set trainable parameters for calculating circuit Jacobian at the rank=0 process if rank == 0: params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires)) else: params = None # Broadcast the trainable parameters across MPI processes from rank=0 process params = comm.bcast(params, root=0) # Run, calculate the quantum circuit Jacobian and average the timing results timing = [] for t in range(num_runs): start = timer() jac = qml.jacobian(circuit_adj)(params) end = timer() timing.append(end - start) # MPI barrier to ensure all calculations are done comm.Barrier() if rank == 0: print("num_gpus: ", size, " wires: ", n_wires, " layers ", n_layers, " time: ", qml.numpy.mean(timing))
Using 4 GPU nodes (16 NVIDIA A100 40GB GPUs), this simulation can easily be scaled to up to 32 qubits. For short circuits with fewer qubits, the overheads of MPI communication across GPUs/nodes may tend to dominate, but for simulations in the region of 28 qubits and above, the computation scales almost linearly!
Next, the probability of the distributed state vector simulation is presented. These simulations were also run on up to 64 GPU nodes of the Perlmutter (NERSC-9) supercomputer.
from mpi4py import MPI import pennylane as qml import numpy as np from timeit import default_timer as timer comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() # Set number of runs for timing averaging num_runs = 3 # Choose number of qubits (wires) and circuit layers n_wires = 33 n_layers = 2 # Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device. # mpi=True to switch on distributed simulation dev = qml.device('lightning.gpu', wires=n_wires, mpi=True) # Set target wires for probability calculation prob_wires = range(n_wires) # Create QNode of device and circuit @qml.qnode(dev) def circuit(weights): qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires))) return qml.probs(wires=prob_wires) # Set trainable parameters for calculating circuit Jacobian at the rank=0 process if rank == 0: params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires)) else: params = None # Broadcast the trainable parameters across MPI processes from rank=0 process params = comm.bcast(params, root=0) # Run, calculate the quantum circuit Jacobian and average the timing results timing = [] for t in range(num_runs): start = timer() local_probs = circuit(params) end = timer() timing.append(end - start) # MPI barrier to ensure all calculations are done comm.Barrier() if rank == 0: print("num_gpus: ", size, " wires: ", n_wires, " layers ", n_layers, " time: ", qml.numpy.mean(timing))
The performance of a 33-qubit system distributed across many GPUs was estimated. The simulation results show that the computation can be accelerated by using more GPUs and its scalability is almost linear, with no reduction in parallelism efficiency, even when using 256 GPUs.
Conclusions
With MPI support for lightning.gpu
, it becomes possible to explore and execute large-scale workloads that would otherwise be intractable without the availability of multiple nodes/GPUs. Our benchmark tests demonstrate that a large-scale system with over 30 qubits can be reached through distributed simulations, and employing more GPUs can lead to better performance. By adding the complete feature parity supported by lightning.gpu
into its MPI backend, all simulations that previously ran using a single GPU can now be extended to multi-node/multi-GPU. Activating and optimizing a distributed simulation is made easy by adding a few additional arguments when constructing a lightning.gpu
device.
We encourage you to try out lightning.gpu
on multi-node/multi-GPU soon and read NVIDIA's report on this work! As always, we warmly welcome suggestions, discussions, collaborations and contributions. Please don’t hesitate to engage in discussions and development in our GitHub repository.
About the authors
Shuli Shu
Performance of Lightning plug-ins
Vincent Michaud-Rioux
I work in the PennyLane performance team where we write, maintain and deliver high-performance quantum simulation backends (the PennyLane-Lightning plugins) that greatly accelerate computations exploiting multi-core CPUs, NVidia and AMD GPUs. I also ...
Lee O'Riordan
Physicist, purveyor of angular momentum, GPUs, pointy guitars, and computational things. Working on quantum stuff.