April 25, 2023
PennyLane goes Kokkos: A novel hardware-agnostic parallel backend for quantum simulations
It can be very time-consuming to write performant software on a wide variety of platforms. For high-performance computing (HPC) systems, the porting of applications and libraries from older hardware generations can be fraught with problems: new architectures, new compilers, new paradigms, new use cases, new bugs, and new headaches. Due to these problems, designing, implementing, fixing, and releasing a software application requires more effort and careful planning.
For this reason, we have developed lightning.kokkos
, a high-performance and hardware-agnostic simulator device for PennyLane, which comes with built-in support for a variety of CPU- and GPU-based systems.

Performance portability
The notion of “performance portability” is often discussed as a means to develop code that can be repurposed across hardware generations or architectures seamlessly, while at the same time maintaining a useful performance baseline. Many toolkits and standards aim to provide this abstraction, such as SYCL/oneAPI, RAJA, and Kokkos, all of which allow a researcher, scientist, or developer to write their application and target most modern HPC system architectures, without having to provide specific instructions for one hardware type versus another.
Introducing lightning.kokkos
After an evaluation in early 2022, we decided to build a simulator device with portability abstractions, and chose to base it atop Kokkos, as it offered the best supported feature set for our needs (kernels for various linear algebra routines, CPU support with OpenMP for x86_64, aarch64 and ppc64le, and GPU support for CUDA and ROCm/HIP platforms). With this, let us introduce lightning.kokkos
, our newest research device for PennyLane. Given the flexibility enabled by the Kokkos layer, we can write our simulator using a backend-agnostic paradigm, and simply “compile-in” the support for a given system.
To date, we have validated lightning.kokkos
with a variety of backends:
- Serial (CPU): This backend enables a non-parallelized serial runtime for the simulator kernels. Since PennyLane v0.24.0, we also natively support this runtime environment in
lightning.qubit
to handle sparse matrix operations. - OpenMP (CPU): This offers CPU-based multithreaded gate kernels using OpenMP. For workloads involving large numbers of gates and few observables, this offers the best performance on CPU-based systems, and directly supports all OpenMP environment variables for tuning the performance to a given system.
- CUDA (GPU): NVIDIA CUDA GPU devices are supported natively, targeting Pascal-era devices and newer, with support for CUDA 11 with Kokkos v3, and CUDA 12 with Kokkos v4. We aim to enable Kokkos v4 support soon, to allow for the NVIDIA H100 GPUs to also be targeted.
- ROCm/HIP (GPU): The AMD ROCm™️ GPU framework can be targeted natively and shows exceptional device utilization for dense circuit simulations. We have to date validated AMD Instinct™️ MI100 and MI200 series GPUs with ROCm from 4.5 up to 5.4, with both device generations showing exceptional gains compared to the OpenMP CPU backend.
This flexibility to support whichever architectures are available ensures that PennyLane can run optimally on all classical CPU and GPU hardware platforms.
The OpenMP-backed CPU device has recently been made available via PyPI, and it can be installed through pip install pennylane-lightning-kokkos
. This allows a user to immediately take advantage of well-optimized parallel kernels, including support for the adjoint differentiation method that makes lightning.qubit
and lightning.gpu
so powerful for trainable quantum circuit workloads.
Performance of lightning.kokkos
As an example, we demonstrate lightning.kokkos
running natively on a single AMD Instinct™️ MI250X GPU die, compared to the OpenMP backend and lightning.qubit
, both running on an AMD Epyc 7763 CPU. All of the following benchmarks were run using PennyLane v0.29, and they followed the workload described in the blog post Lightning-fast simulations with PennyLane and the NVIDIA cuQuantum SDK.

The performance of the AMD GPU here is significantly better than the OpenMP-backed CPU simulator beyond 18 qubits, which is where we expect the GPUs to begin to shine. In addition, we can also see that the SIMD-focused lightning.qubit
is faster than lightning.kokkos
up to 20 qubits, after which the OpenMP-backed lightning.kokkos
CPU device edges ahead. Given the variety of different types of workloads a user may have in mind, taking the time to compare different backends across devices can be helpful to determine the best recommendation for a given problem.
As hybrid quantum workloads look forward to a future in which quantum and classical devices will have become tightly coupled in a heterogeneous execution environment, we find comparisons like these highly important to infer tooling maturity, support and performance.
Installing lightning.kokkos
The OpenMP-backed CPU device is a great candidate to replace lightning.qubit
for workloads with high qubit counts and systems with many available CPU cores. It can be installed with
python -m pip install pennylane-lightning-kokkos
For GPU support, you can build and install directly from source, following these installation guidelines.
For more explicit control of the build options, we suggest installing lightning.kokkos
in a dedicated Spack environment. For instance, let's build and install with support for AMD MI200 series GPUs (amdgpu_target=gfx90a
). We can create and activate a Spack environment with the required package as follows:
spack env create pll-kokkos-amd spack env activate pll-kokkos-amd spack add py-pennylane-lightning-kokkos +rocm amdgpu_target=gfx90a spack install
Similarly, for NVIDIA A100 series GPU support:
spack env create pll-kokkos-nv spack env activate pll-kokkos-nv spack add py-pennylane-lightning-kokkos +cuda cuda_arch=80 spack install
You can simply substitute the Kokkos-supported architecture strings for both AMD and NVIDIA GPUs.
Note that, since Spack compiles all dependencies from the ground up, the above may take some time, and it is mostly targeted towards supercomputing systems.
Once installed, we can create a 20-qubit device as dev = qml.device("lightning.kokkos", wires = 20)
and use it like any other PennyLane device.
As always, you can find all the details in the PennyLane docs. We hope you try out lightning.kokkos
soon!
About the authors
Lee O'Riordan
Physicist, purveyor of angular momentum, GPUs, pointy guitars, and computational things. Working on quantum stuff.
Shuli Shu
Performance of Lightning plug-ins
Vincent Michaud-Rioux
I work in the PennyLane performance team where we write, maintain and deliver high-performance quantum simulation backends (the PennyLane-Lightning plugins) that greatly accelerate computations exploiting multi-core CPUs, NVidia and AMD GPUs. I also ...