Loop Boundary Optimization

Korbinian Kottmann

Compilation/
Loop Boundary Optimization

Loop Boundary Optimization

Loop boundary commutation

A special property of quantum circuits is that all gates have an inverse (adjoint) operation. We can abstract away a loop boundary commutation pass that transforms a program


for i in range(n):
    qml.G(..., wires=0)
    qml.F(..., wires=0)

to a new program where we have moved (or commuted) the gate G from one end of the loop to the other


qml.G(..., wires=0)

for i in range(n):
    qml.F(..., wires=0)
    qml.G(..., wires=0)

adjoint(qml.G(..., wires=0))

Loop boundary optimization

Loop boundary commutation can be used in conjunction with pattern-based optimization passes. We refer to this combined action as loop boundary optimization (LBO). Take for example the following quantum program.


def q_program1():
    for i in range(3):
        qml.Hadamard(0)
        qml.T(0)
        qml.Hadamard(0)

Defining qml.G = qml.Hadamard and qml.F = qml.Hadamard ∘ qml.T and performing loop boundary commutation, we obtain:


def q_program1_commuted():
    qml.Hadamard(0)

    for i in range(3):
        qml.T(0)
        qml.Hadamard(0)
        qml.Hadamard(0)

    adjoint(qml.Hadamard(0))

Using the fact that the Hadamard gate is self-adjoint, we can cancel the inverses to obtain


def q_program1_optimized():
    qml.Hadamard(0)

    for i in range(3):
        qml.T(0)

    qml.Hadamard(0)

Note that in this example we are only concerned with cancelling self-inverse gates at the loop boundaries. In particular, further optimizations such as writing $T^3 = S T$ are not being considered here.

We can also have several consecutive cancellations at the loop boundaries.


def q_program2(angles):
    for i in range(3):
        qml.PauliY(0)
        qml.Hadamard(0)
        qml.CNOT((0, 1))
        qml.RZ(angles[i], 0)
        qml.CNOT((0, 1))
        qml.Hadamard(0)
        qml.PauliY(0)

We obtain the following optimized version after performing loop boundary optimization.


def q_program2_optimized(angles):
    qml.PauliY(0)
    qml.Hadamard(0)
    qml.CNOT((0, 1))

    for i in range(3):
        qml.RZ(angles[i], 0)

    qml.CNOT((0, 1))
    qml.Hadamard(0)
    qml.PauliY(0)

Again, further optimizations such as summarizing the loop as a single RZ(sum(angles)) or using the fact that RZ(0) can be commuted through the control qubit of CNOT are not part of this simple example.

Combination with merge-rotation and cancel inverses

Passes like catalyst.passes.merge_rotations or catalyst.passes.cancel_inverses make use of loop boundary optimization.

Consider, for example, the following circuit that has qml.RX gates at the loop boundaries. We can use catalyst.passes.merge_rotations, which performs loop boundary optimization to simplify the program. We output the compiled circuit in form of its MLIR program.


from catalyst import *
import pennylane as qml

# autograph allows us to use Python loops and still compile them
# keep_intermediate is required to inspect intermediate compilation results
@qjit(autograph=True, keep_intermediate=True)
@passes.merge_rotations
@qml.qnode(qml.device("null.qubit", wires=1))
def circuit(n_iter: int):
    for _ in range(n_iter):
        qml.RX(0.1, wires=0)
        qml.T(0)
        qml.RX(0.2, wires=0)
    return qml.probs()

print("initial IR:\n", circuit.mlir, "\n---\n")
print("optimized IR:\n", debug.get_compilation_stage(circuit, "EnforceRuntimeInvariantsPass"), "\n---\n")


initial IR:
 module @circuit_merge_rotations {
  func.func public @jit_circuit_merge_rotations(%arg0: tensor<i64>) -> tensor<2xf64> attributes {llvm.emit_c_interface} {
    %0 = catalyst.launch_kernel @module_circuit_merge_rotations::@circuit_merge_rotations(%arg0) : (tensor<i64>) -> tensor<2xf64>
    return %0 : tensor<2xf64>
  }
  module @module_circuit_merge_rotations {
    module attributes {transform.with_named_sequence} {
      transform.named_sequence @__transform_main(%arg0: !transform.op<"builtin.module">) {
        %0 = transform.apply_registered_pass "merge-rotations" to %arg0 : (!transform.op<"builtin.module">) -> !transform.op<"builtin.module">
        transform.yield 
      }
    }
    func.func public @circuit_merge_rotations(%arg0: tensor<i64>) -> tensor<2xf64> attributes {diff_method = "parameter-shift", llvm.linkage = #llvm.linkage<internal>, qnode} {
      %c1 = arith.constant 1 : index
      %c0 = arith.constant 0 : index
      %c0_i64 = arith.constant 0 : i64
      quantum.device shots(%c0_i64) ["/Users/davidi/work/catalyst/frontend/catalyst/utils/../../../runtime/build/lib/librtd_null_qubit.dylib", "NullQubit", "{'shots': 0}"]
      %0 = quantum.alloc( 1) : !quantum.reg
      %extracted = tensor.extract %arg0[] : tensor<i64>
      %1 = arith.index_cast %extracted : i64 to index
      %2 = scf.for %arg1 = %c0 to %1 step %c1 iter_args(%arg2 = %0) -> (!quantum.reg) {
        %6 = quantum.extract %arg2[ 0] : !quantum.reg -> !quantum.bit
        %out_qubits = quantum.static_custom "RX" [1.000000e-01] %6 : !quantum.bit
        %out_qubits_0 = quantum.custom "T"() %out_qubits : !quantum.bit
        %out_qubits_1 = quantum.static_custom "RX" [2.000000e-01] %out_qubits_0 : !quantum.bit
        %7 = quantum.insert %arg2[ 0], %out_qubits_1 : !quantum.reg, !quantum.bit
        scf.yield %7 : !quantum.reg
      }
      %3 = quantum.extract %2[ 0] : !quantum.reg -> !quantum.bit
      %4 = quantum.compbasis %3 : !quantum.obs
      %5 = quantum.probs %4 : tensor<2xf64>
      quantum.dealloc %2 : !quantum.reg
      quantum.device_release
      return %5 : tensor<2xf64>
    }
  }
  func.func @setup() {
    quantum.init
    return
  }
  func.func @teardown() {
    quantum.finalize
    return
  }
} 
---
optimized IR:
 module @circuit_merge_rotations {
  func.func public @jit_circuit_merge_rotations(%arg0: tensor<i64>) -> tensor<2xf64> attributes {llvm.emit_c_interface} {
    %0 = call @circuit_merge_rotations_0(%arg0) : (tensor<i64>) -> tensor<2xf64>
    return %0 : tensor<2xf64>
  }
  func.func public @circuit_merge_rotations_0(%arg0: tensor<i64>) -> tensor<2xf64> attributes {diff_method = "parameter-shift", llvm.linkage = #llvm.linkage<internal>, qnode} {
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %c0_i64 = arith.constant 0 : i64
    quantum.device shots(%c0_i64) ["/Users/davidi/work/catalyst/frontend/catalyst/utils/../../../runtime/build/lib/librtd_null_qubit.dylib", "NullQubit", "{'shots': 0}"]
    %0 = quantum.alloc( 1) : !quantum.reg
    %extracted = tensor.extract %arg0[] : tensor<i64>
    %1 = arith.index_cast %extracted : i64 to index
    %2 = quantum.extract %0[ 0] : !quantum.reg -> !quantum.bit
    %out_qubits = quantum.custom "RX"() %2 : !quantum.bit
    %3 = quantum.insert %0[ 0], %out_qubits : !quantum.reg, !quantum.bit
    %4 = scf.for %arg1 = %c0 to %1 step %c1 iter_args(%arg2 = %3) -> (!quantum.reg) {
      %10 = quantum.extract %arg2[ 0] : !quantum.reg -> !quantum.bit
      %out_qubits_1 = quantum.custom "T"() %10 : !quantum.bit
      %11 = quantum.insert %arg2[ 0], %out_qubits_1 : !quantum.reg, !quantum.bit
      scf.yield %11 : !quantum.reg
    }
    %5 = quantum.extract %4[ 0] : !quantum.reg -> !quantum.bit
    %out_qubits_0 = quantum.custom "RX"() %5 : !quantum.bit
    %6 = quantum.insert %4[ 0], %out_qubits_0 : !quantum.reg, !quantum.bit
    %7 = quantum.extract %6[ 0] : !quantum.reg -> !quantum.bit
    %8 = quantum.compbasis %7 : !quantum.obs
    %9 = quantum.probs %8 : tensor<2xf64>
    quantum.dealloc %6 : !quantum.reg
    quantum.device_release
    return %9 : tensor<2xf64>
  }
  func.func @setup() {
    quantum.init
    return
  }
  func.func @teardown() {
    quantum.finalize
    return
  }
} 
---

A close examination of the MLIR result reveals the following optimized program structure:


def q_program1(n_iter: int):
    qml.RX(0.1, wires=0)

    for i in range(n_iter):
        qml.T(0)
        qml.RX(0.3, wires=0)

    qml.RX(-0.1, wires=0)

Unrolling loops

One might wonder whether loop boundary optimization is not the same as loop unrolling.

Take our initial example:


def q_program1():
    for i in range(3):
        qml.Hadamard(0)
        qml.T(0)
        qml.Hadamard(0)

See that one may obtain the same optimized program from unrolling the program to


[qml.Hadamard(0), qml.T(0), qml.Hadamard(0), qml.Hadamard(0), qml.T(0), qml.Hadamard(0), qml.Hadamard(0), qml.T(0), qml.Hadamard(0)]

and then using the property that consecutive Hadamards cancel, $H(0) H(0) = \mathbb{I}$ . This, however, is not what happens in LBO. In particular, LBO enables the removal of these redundant gate sequences without actually unrolling the loop. While it's true that for many examples like these we could obtain a similarly optimized circuit by unrolling the loop first and then applying optimization patterns, there are some important differences:

Unrolling loops can drastically increase the program size, thus making compilation more computationally expensive. At large scales, this can translate to spending years vs minutes on compilation tasks [1].
Even temporary unrolling and "re-rolling" of a loop is a non-trivial task, as raising abstractions (like picking out a looping pattern from arbitrary code) is always more difficult than lowering abstractions.
Loop unrolling becomes harder and less effective when loop bounds are not statically known. If we don't know the number of iterations exactly, it's impossible to generate a perfectly flat sequence of instructions from it. Instead, compilers may unroll loops by a small factor (e.g., 2, 4, or 8), which is not only significantly less effective than LBO, but also introduces additional conditionals to handle cases where the iteration count is not a perfect multiple of these factors.

Why PennyLane

Getting Started

Documentation

Ecosystem

Loop Boundary Optimization

Loop boundary commutation

Loop boundary optimization

Combination with merge-rotation and cancel inverses

Unrolling loops