PennyLane v0.23 introduced support for a new simulator device:
`lightning.gpu`

, which offloads quantum
gate calls to the NVIDIA cuQuantum SDK. Let’s do a
deep dive on why we use GPUs, and how we use them to speed up the simulation of quantum circuits.

## Why are GPUs important for quantum simulation?

GPUs are well known to offer great performance to workloads heavily dependent on linear algebra; in essence, they are floating-point workhorses. For classical machine-learning workloads, especially those in deep learning, GPUs are the only way to go: training is much faster than the equivalent operations on a CPU.

Though GPUs usually have less on-device memory than a CPU node, they have huge on-device memory bandwidth (data transfer rates between the GPU cores and its main global memory can be on the order of 100s of GB per second, and higher for the more specialized on-device memory). GPUs also tend to support a vast number of concurrent threads, which when coupled with such high memory bandwidths allow memory-bound operations (vector-vector products, vector scaling, or pretty much anything involving the same order of memory read operations to compute operations) to run incredibly well on such devices.

Similarly, being tailor-suited signal processors, GPUs have well optimized implementations of foundational mathematical operations. Coupling these with compute-bound problems such as dense matrix-matrix products (with some clever blocking along the way), allows such evaluations in a fraction of the time as would be spent even on a beefy CPU.

As quantum mechanics is most often expressed in the language of linear algebra and complex numbers, we can take the above tools and efficiently apply them to problems expressed in a quantum way: in this instance, simulating quantum circuits.

## Introducing lightning.gpu

`lightning.gpu`

currently works by offloading quantum gate calls to the appropriate kernels and functions in
the cuQuantum cuStateVec library. As quantum
circuit differentiability is an important part of the PennyLane experience, we extend our work
directly from the `lightning.qubit`

device,
which captures and manipulates the associated Numpy data buffer in-place.
This enables direct support for interfacing with machine-learning frameworks such as TensorFlow, PyTorch and JAX.

Since the heavy-lifting in circuit simulation is applying the gates to the state-vector memory buffer, we begin by copying the data directly from our NumPy buffer directly onto the GPU device. As these data now exist on the GPU device, we can directly call all GPU-implemented functions to perform the required manipulations. To ensure the best overall performance, any currently unsupported gate kernels on the GPU are first CPU-generated, transferred to the GPU device and cached on the fly, allowing for later reuse.

Once all operations are applied to the state-vector residing in a GPU data buffer, we can also offload all expectation values calls to be directly evaluated on the GPU device, and end by returning the data to a NumPy-compatible array on the host machine. When we have finished with the GPU portion of the computation, the state-vector GPU data buffer is copied back to the host and is accessible as it originated, as a NumPy compatible array, allowing ease of manipulation for the end-user.

## What about gradients?

To ensure we have the best overall experience for differentiable quantum workloads, `lightning.gpu`

has direct support for two methods of quantum circuit differentiation:
parameter-shift, which works on
simulators and quantum hardware alike, as well as adjoint
backpropagation, which by sacrificing a
little extra compute, becomes much more memory efficient than traditional backpropagation
implementations, allowing more qubits to be simulated. Often with GPUs it is much more efficient to
perform extra computation to avoid any memory-transfer overheads, making the adjoint backpropagation
method a great candidate for differentiation of quantum circuits.

## Benchmarks

As a demonstration, we take a sample quantum circuit Jacobian evaluation and compare the run-times for
`lightning.qubit`

and `lightning.gpu`

, using adjoint backpropagation for both cases, but allowing the
`lightning.qubit`

threading support to take full advantage of the available CPU.
The circuit below evaluates the Jacobian of a strongly entangling layered circuit, and was run on an NVIDIA DGX A100, comparing an A100 80GB GPU to an AMD Epyc 7742 CPU:

```
import pennylane as qml
from timeit import default_timer as timer
# To set the number of threads used when executing this script,
# export the OMP_NUM_THREADS environment variable.
# Choose number of qubits (wires) and circuit layers
wires = 20
layers = 3
# Set number of runs for timing averaging
num_runs = 5
# Instantiate CPU (lightning.qubit) or GPU (lightning.gpu) device
dev = qml.device('lightning.gpu', wires=wires)
# Create QNode of device and circuit
@qml.qnode(dev, diff_method="adjoint")
def circuit(parameters):
qml.StronglyEntanglingLayers(weights=parameters, wires=range(wires))
return [qml.expval(qml.PauliZ(i)) for i in range(wires)]
# Set trainable parameters for calculating circuit Jacobian
shape = qml.StronglyEntanglingLayers.shape(n_layers=layers, n_wires=wires)
weights = qml.numpy.random.random(size=shape)
# Run, calculate the quantum circuit Jacobian and average the timing results
timing = []
for t in range(num_runs):
start = timer()
jac = qml.jacobian(circuit)(weights)
end = timer()
timing.append(end - start)
print(qml.numpy.mean(timing))
```

For short circuits with few numbers of qubits, the overheads of running the GPU device (initialization, memory allocations, initial copies) can tend to dominate, but for simulations in the 20 qubits and above region the GPU device shines! The above figure shows that we see an ever-widening gap for Jacobian evaluations with over an order-of-magnitude timing reduction for large simulations relative to both single and multi-threaded CPU-based simulation. We also expect to see the same behavior for deep quantum circuits, where the initial setup overhead is minimal compared to the gates count required for the simulation.

## What’s next?

Our implementation recently added support for finite-shots on the circuit forward pass, allowing
direct use of `lightning.gpu`

with sampling workloads, as well as with the parameter-shift gradient rule, mimicking gradient calculations as they would be handled on actual quantum hardware.

Given the inherent flexibility of the adjoint backpropagation method for data-parallelism over
observables, we have also recently added direct support to batch the available computation over multiple GPU
devices. This allows gradient workloads with large numbers of expectation value evaluations to
make the best use of all available GPU hardware, and will be featured in our upcoming `v0.25.0`

release.

Finally, scaling out workloads is a big area of interest, and with `lightning.gpu`

we anticipate to
run huge workloads that otherwise would have been intractable without the availability of such
hardware. Stay tuned for more!