- Compilation/
Loop Boundary Optimization
Loop Boundary Optimization
Loop boundary commutation
A special property of quantum circuits is that all gates have an inverse (adjoint) operation. We can abstract away a loop boundary commutation pass that transforms a program
for i in range(n): qml.G(..., wires=0) qml.F(..., wires=0)
to a new program where we have moved (or commuted) the gate G
from one end of the loop to the other
qml.G(..., wires=0) for i in range(n): qml.F(..., wires=0) qml.G(..., wires=0) adjoint(qml.G(..., wires=0))
Loop boundary optimization
Loop boundary commutation can be used in conjunction with pattern-based optimization passes. We refer to this combined action as loop boundary optimization (LBO). Take for example the following quantum program.
def q_program1(): for i in range(3): qml.Hadamard(0) qml.T(0) qml.Hadamard(0)
Defining qml.G = qml.Hadamard
and qml.F = qml.Hadamard ∘ qml.T
and performing loop boundary commutation, we obtain:
def q_program1_commuted(): qml.Hadamard(0) for i in range(3): qml.T(0) qml.Hadamard(0) qml.Hadamard(0) adjoint(qml.Hadamard(0))
Using the fact that the Hadamard
gate is self-adjoint, we can cancel the inverses to obtain
def q_program1_optimized(): qml.Hadamard(0) for i in range(3): qml.T(0) qml.Hadamard(0)
Note that in this example we are only concerned with cancelling self-inverse gates at the loop boundaries. In particular, further optimizations such as writing are not being considered here.
We can also have several consecutive cancellations at the loop boundaries.
def q_program2(angles): for i in range(3): qml.PauliY(0) qml.Hadamard(0) qml.CNOT((0, 1)) qml.RZ(angles[i], 0) qml.CNOT((0, 1)) qml.Hadamard(0) qml.PauliY(0)
We obtain the following optimized version after performing loop boundary optimization.
def q_program2_optimized(angles): qml.PauliY(0) qml.Hadamard(0) qml.CNOT((0, 1)) for i in range(3): qml.RZ(angles[i], 0) qml.CNOT((0, 1)) qml.Hadamard(0) qml.PauliY(0)
Again, further optimizations such as summarizing the loop as a single RZ(sum(angles))
or using the fact that RZ(0)
can be commuted through the control qubit of CNOT
are not part of this simple example.
Combination with merge-rotation and cancel inverses
Passes like catalyst.passes.merge_rotations or catalyst.passes.cancel_inverses make use of loop boundary optimization.
Consider, for example, the following circuit that has qml.RX
gates at the loop boundaries. We can use catalyst.passes.merge_rotations, which performs loop boundary optimization to simplify the program.
We output the compiled circuit in form of its MLIR program.
from catalyst import * import pennylane as qml # autograph allows us to use Python loops and still compile them # keep_intermediate is required to inspect intermediate compilation results @qjit(autograph=True, keep_intermediate=True) @passes.merge_rotations @qml.qnode(qml.device("null.qubit", wires=1)) def circuit(n_iter: int): for _ in range(n_iter): qml.RX(0.1, wires=0) qml.T(0) qml.RX(0.2, wires=0) return qml.probs() print("initial IR:\n", circuit.mlir, "\n---\n") print("optimized IR:\n", debug.get_compilation_stage(circuit, "EnforceRuntimeInvariantsPass"), "\n---\n")
initial IR: module @circuit_merge_rotations { func.func public @jit_circuit_merge_rotations(%arg0: tensor<i64>) -> tensor<2xf64> attributes {llvm.emit_c_interface} { %0 = catalyst.launch_kernel @module_circuit_merge_rotations::@circuit_merge_rotations(%arg0) : (tensor<i64>) -> tensor<2xf64> return %0 : tensor<2xf64> } module @module_circuit_merge_rotations { module attributes {transform.with_named_sequence} { transform.named_sequence @__transform_main(%arg0: !transform.op<"builtin.module">) { %0 = transform.apply_registered_pass "merge-rotations" to %arg0 : (!transform.op<"builtin.module">) -> !transform.op<"builtin.module"> transform.yield } } func.func public @circuit_merge_rotations(%arg0: tensor<i64>) -> tensor<2xf64> attributes {diff_method = "parameter-shift", llvm.linkage = #llvm.linkage<internal>, qnode} { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %c0_i64 = arith.constant 0 : i64 quantum.device shots(%c0_i64) ["/Users/davidi/work/catalyst/frontend/catalyst/utils/../../../runtime/build/lib/librtd_null_qubit.dylib", "NullQubit", "{'shots': 0}"] %0 = quantum.alloc( 1) : !quantum.reg %extracted = tensor.extract %arg0[] : tensor<i64> %1 = arith.index_cast %extracted : i64 to index %2 = scf.for %arg1 = %c0 to %1 step %c1 iter_args(%arg2 = %0) -> (!quantum.reg) { %6 = quantum.extract %arg2[ 0] : !quantum.reg -> !quantum.bit %out_qubits = quantum.static_custom "RX" [1.000000e-01] %6 : !quantum.bit %out_qubits_0 = quantum.custom "T"() %out_qubits : !quantum.bit %out_qubits_1 = quantum.static_custom "RX" [2.000000e-01] %out_qubits_0 : !quantum.bit %7 = quantum.insert %arg2[ 0], %out_qubits_1 : !quantum.reg, !quantum.bit scf.yield %7 : !quantum.reg } %3 = quantum.extract %2[ 0] : !quantum.reg -> !quantum.bit %4 = quantum.compbasis %3 : !quantum.obs %5 = quantum.probs %4 : tensor<2xf64> quantum.dealloc %2 : !quantum.reg quantum.device_release return %5 : tensor<2xf64> } } func.func @setup() { quantum.init return } func.func @teardown() { quantum.finalize return } } --- optimized IR: module @circuit_merge_rotations { func.func public @jit_circuit_merge_rotations(%arg0: tensor<i64>) -> tensor<2xf64> attributes {llvm.emit_c_interface} { %0 = call @circuit_merge_rotations_0(%arg0) : (tensor<i64>) -> tensor<2xf64> return %0 : tensor<2xf64> } func.func public @circuit_merge_rotations_0(%arg0: tensor<i64>) -> tensor<2xf64> attributes {diff_method = "parameter-shift", llvm.linkage = #llvm.linkage<internal>, qnode} { %c1 = arith.constant 1 : index %c0 = arith.constant 0 : index %c0_i64 = arith.constant 0 : i64 quantum.device shots(%c0_i64) ["/Users/davidi/work/catalyst/frontend/catalyst/utils/../../../runtime/build/lib/librtd_null_qubit.dylib", "NullQubit", "{'shots': 0}"] %0 = quantum.alloc( 1) : !quantum.reg %extracted = tensor.extract %arg0[] : tensor<i64> %1 = arith.index_cast %extracted : i64 to index %2 = quantum.extract %0[ 0] : !quantum.reg -> !quantum.bit %out_qubits = quantum.custom "RX"() %2 : !quantum.bit %3 = quantum.insert %0[ 0], %out_qubits : !quantum.reg, !quantum.bit %4 = scf.for %arg1 = %c0 to %1 step %c1 iter_args(%arg2 = %3) -> (!quantum.reg) { %10 = quantum.extract %arg2[ 0] : !quantum.reg -> !quantum.bit %out_qubits_1 = quantum.custom "T"() %10 : !quantum.bit %11 = quantum.insert %arg2[ 0], %out_qubits_1 : !quantum.reg, !quantum.bit scf.yield %11 : !quantum.reg } %5 = quantum.extract %4[ 0] : !quantum.reg -> !quantum.bit %out_qubits_0 = quantum.custom "RX"() %5 : !quantum.bit %6 = quantum.insert %4[ 0], %out_qubits_0 : !quantum.reg, !quantum.bit %7 = quantum.extract %6[ 0] : !quantum.reg -> !quantum.bit %8 = quantum.compbasis %7 : !quantum.obs %9 = quantum.probs %8 : tensor<2xf64> quantum.dealloc %6 : !quantum.reg quantum.device_release return %9 : tensor<2xf64> } func.func @setup() { quantum.init return } func.func @teardown() { quantum.finalize return } } ---
A close examination of the MLIR result reveals the following optimized program structure:
def q_program1(n_iter: int): qml.RX(0.1, wires=0) for i in range(n_iter): qml.T(0) qml.RX(0.3, wires=0) qml.RX(-0.1, wires=0)
Unrolling loops
One might wonder whether loop boundary optimization is not the same as loop unrolling.
Take our initial example:
def q_program1(): for i in range(3): qml.Hadamard(0) qml.T(0) qml.Hadamard(0)
See that one may obtain the same optimized program from unrolling the program to
[qml.Hadamard(0), qml.T(0), qml.Hadamard(0), qml.Hadamard(0), qml.T(0), qml.Hadamard(0), qml.Hadamard(0), qml.T(0), qml.Hadamard(0)]
and then using the property that consecutive Hadamards cancel, . This, however, is not what happens in LBO. In particular, LBO enables the removal of these redundant gate sequences without actually unrolling the loop. While it's true that for many examples like these we could obtain a similarly optimized circuit by unrolling the loop first and then applying optimization patterns, there are some important differences:
-
Unrolling loops can drastically increase the program size, thus making compilation more computationally expensive. At large scales, this can translate to spending years vs minutes on compilation tasks [1].
-
Even temporary unrolling and "re-rolling" of a loop is a non-trivial task, as raising abstractions (like picking out a looping pattern from arbitrary code) is always more difficult than lowering abstractions.
-
Loop unrolling becomes harder and less effective when loop bounds are not statically known. If we don't know the number of iterations exactly, it's impossible to generate a perfectly flat sequence of instructions from it. Instead, compilers may unroll loops by a small factor (e.g., 2, 4, or 8), which is not only significantly less effective than LBO, but also introduces additional conditionals to handle cases where the iteration count is not a perfect multiple of these factors.