How to choose your optimizer
Variational hybrid algorithms are a staple of the NISQ era, since they allow us to test new ideas and applications of quantum computing that do not need faulttolerant quantum computation nor a large number of qubits.
These algorithms usually seek to optimize a cost function \(C(\theta)\) of the form
This equation describes, formally, the operations that we need to perform through our hybrid algorithm. In particular, we are interested in minimizing a function of the parameters \(\theta\) made up of \(k\) different terms. Each term is obtained by measuring on the quantum computer the expectation value of the observable \(O_{k}\) after applying the unitary gate \(U(\theta)\) on the quantum register in the state \(\rho_{k}\). Typically \(f_{k}\) and \(O_{k}\) define our problem of interest while \(U(\theta)\) is the unitary ansatz that will prepare the trial state for which we calculate the cost function.
If we still do not know which ansatz to use, we can have a look at the precoded templates provided by PennyLane.
Further, it is important to realize that the choice of optimizer will depend on all these quantities, as they will affect the optimization landscape complexity and the computational demand of evaluating the gradient and the function itself.
In this howto, we first take a look at the basic usage of PennyLane’s builtin optimizers. Then, we go through some of these optimizers and highlight their usefulness according to the complexity of the problem.
Optimization with PennyLane in a nutshell
Before we dive into the question of finding the perfect optimizer, let’s take a brief look at how a simple optimization is structured in PennyLane.
The basic ingredients are: a quantum function to be optimized, a device to evaluate our function, and one of PennyLane’s optimizers.
First we show the circuit which will be used in our examplecode below. In particular we have reported, in brackets, the values for the optimal parameters entering the definition of the rotation gates.
In this particular example, referring to the previous equation, we highlight that \(k\) is equal to 1 and the observable of interest (\(O_k\)) is a simple Hamiltonian \(H = 1*X_0 + 2*Z_1 + 3*Y_0 X_1\). The subscripts specify the qubits on which each Pauli operator acts. Furthermore, for this specific example, the (only) function \(f_k\) acts on the computed expectation value by returning its value.
import pennylane as qml
from pennylane import numpy as np
# Define your Hamiltonian
coeffs = [1., 2., 3.]
observables = [qml.PauliX(0), qml.PauliZ(1), qml.PauliY(0) @ qml.PauliX(1)]
hamiltonian = qml.Hamiltonian(coeffs, observables)
# This is our cost function!
@qml.qnode(qml.device("default.qubit", wires=2))
def quant_fun(parameters):
qml.RX(parameters[0], wires=[0])
qml.RY(parameters[1], wires=[1])
qml.CNOT(wires=[0,1])
qml.RY(parameters[0], wires=[0])
return qml.expval(hamiltonian)
theta = np.array([0.0, 0.0], requires_grad=True) # Initial guess parameters
angle = [theta] # Store the values of the circuit parameter
cost = [quant_fun(theta)] # Store the values of the cost function
opt = qml.GradientDescentOptimizer() # Our optimizer!
max_iterations = 100 # Maximum number of calls to the optimizer
conv_tol = 1e06 # Convergence threshold to stop our optimization procedure
for n in range(max_iterations):
theta, prev_cost = opt.step_and_cost(quant_fun, theta)
cost.append(quant_fun(theta))
angle.append(theta)
conv = np.abs(cost[1]  prev_cost)
if n % 10 == 0:
print(f"Step = {n}, Cost function = {cost[1]:.8f} ")
if conv <= conv_tol:
break
print("\n" f"Final value of the cost function = {cost[1]:.8f} ")
print("\n" f"Optimal value of the first circuit parameter = " + str(angle[1][0]))
print("\n" f"Optimal value of the second circuit parameter = " + str(angle[1][1]))
Here quant_fun
corresponds to the cost function \(C(\theta)\) we want to optimize. For the example above, the quantum register is initialized in the state \(0\rangle^{\otimes 2}\), corresponding to what we previously called \(\rho_k\). Finally, hamiltonian
is the operator that corresponds to \(O_{k}\).
Which outputs the following:
>>> Step = 0, Cost function = 1.95959868
>>> Step = 10, Cost function = 1.45001807
>>> Step = 20, Cost function = 0.67351257
>>> Step = 30, Cost function = 0.45838600
>>> Step = 40, Cost function = 1.82725556
>>> Step = 50, Cost function = 2.99032196
>>> Step = 60, Cost function = 3.65093838
>>> Step = 70, Cost function = 3.92970151
>>> Step = 80, Cost function = 4.03027845
>>> Step = 90, Cost function = 4.06417188
>>> Final value of the cost function = 4.07464171
>>> Optimal value of the first circuit parameter = 2.18238511
>>> Optimal value of the second circuit parameter = 0.
Once placed within the framework of quantum differentiable programming, any of our code, no matter how complicated, can be traced back to this simpler structure presented above.
Having now done this, we can stop and wonder: which optimizer can solve our problem in time and with the desired accuracy?
Hard problems require clever choices of the optimizer
Optimization problems in multidimensional spaces are daunting tasks. Choosing the best strategy to find a solution in terms of accuracy and computation time is a real art. This is particularly true for the optimization landscapes of variational hybrid algorithms, whose cost functions display many local minima and vast regions of flat optimization landscape.
While these are difficult problems to tackle, we have a wide choice of tools to face these challenges. We will try to understand which weapon to choose by looking at two reference metrics: the number of variables to be optimized and the time required to evaluate our cost function.
Furthermore, as new, more powerful and precise hardware continues to be developed, it is exciting to note that even the notions of what ‘many variables’ or a ‘long computation times’ mean are subject to change, giving us some leeway on how to tackle these challenges.
We can think of three different scenarios for the task that the optimizer should solve.

Few variables + easytoevaluate function/gradients: If our problem falls into this category, we should consider ourselves lucky: it is likely that any of the available optimizers will be able to provide us with the desired solution within a reasonable time!
Nevertheless, not all the choices are equivalent. As we said at the beginning, even solving simplelooking problems can be interesting, especially if we are able to test them in presence of simulated noise or real quantum computers!
In this case the most advantageous option would be to go with a gradientfree optimizer. The idea is that these optimizers, which generally converge more slowly than gradientbased ones, are better suited to NISQ devices, as they provide a noiseresistant and computationally cheap strategy to update the parameters at the next iteration (Shaydulin et al., (2019)).
PennyLane provides two different solutions, namely the RotoselectOptimizer and the RotosolveOptimizer. In the former, in our circuit to be trained, it will be possible to associate to each parameter a corresponding gate of the type \(R_x\), \(R_y\) or \(R_z\). The optimizer will update each rotation parameter independently in each iteration while keeping the others fixed. But we have some additional flexibility. This update, as proposed by Ostaszewski et al., (2021), may not only modify the parameters’ values but also the change the rotation axis onthefly, replacing a gate with another rotation among the set \(\{R_x, R_y, R_z\}\).
On the other hand, the
RotosolveOptimizer
will perform the same optimization on the variational parameters (i.e. changing one parameter at a time) but maintain the circuit structure fixed. In return, we can now use any kind of parameterized gate to build our ansatz. 
Many variables + easytoevaluate function/gradients or few variables + hardtoevaluate function/gradients: This category corresponds to most of the cases we encounter in classical optimization, for which it is advisable to adopt gradient based optimizers. In the case of variational hybrid algorithms, these are also the obvious choices, as exact gradients of quantum functions can be computed automatically by means of the parametershift rule (Schuld et al., 2018).
All the gradientbased optimizers provided by PennyLane update the variational parameters moving along the direction indicated by the gradient at the specified point. The simplest version is implemented in GradientDescentOptimizer where the optimizer takes a step in this direction by an amount called learning rate:
$$ \theta^{(t+1)} = \theta^{(t)}  \eta \nabla C(\theta^{(t)}) $$Here, \(\theta\) represents any one of the circuit parameters at different iterations, \(\eta\) is the learning rate and \(\nabla C(\theta)\) is the gradient of the cost function. By adopting this strategy we are ensured to reach local minima.Further, we can speed up the convergence rate of the optimization making the learning rate dependent on the previous step of the optimization:
$$ \theta^{(t+1)} = \theta^{(t)}  \eta(\theta^{(t1)}, \nabla C(\theta^{(t1)})) \nabla C(\theta^{(t)}) . $$This last option is featured in PennyLane with the AdagradOptimizer and the AdamOptimizer.

Many variables + hardtoevaluate function/gradients: This last category is one in which we would never want to find ourselves. But isn’t it when the going gets tough that things get interesting? To deal with these situations we need more sophisticated strategies to cope with the inherent computational complexity.
A possible option is to begin our optimization by using a stochastic gradient descent strategy, (Sweke et al., 2019). In this framework the cost of evaluating the gradient at each iteration is significantly reduced by evaluating only a random subset of the components of the gradient. When dealing with highdimensional optimization problems, this algorithm has been proven to be not only more computationally efficient, but also able to avoid saddle points and local minima. It also features superior convergence properties compared to vanilla gradient descent.
Unfortunately, a highdimensional problem is not only a difficult task from a computational point of view, but it also eases the emergence of barren plateaus, (McClean et al., 2018).
Specifically, for a large class of quantum circuits, the average value of the gradient objective function tends to zero and as the Hilbert dimension increases the more states will correspond to flat optimization landscape regions.
For this reason we may want to move our optimization strategy on a different landscape trying to avoid these regions where our gradientbased search gets stuck.
Luckily, for the second phase of our optimization, PennyLane has a builtin optimizer that suit our needs: the QNGOptimizer.
We can now improve the quality of our optimization landscape by no longer moving in the direction that corresponds to the steepest descent in parameter space, but by moving along the steepest direction in the Hilbert space where our quantum computer’s wavefunction lives. For this reason, this algorithm is called quantum natural gradient.
To do this the parameter update rule is modified in this way:
$$ \theta^{(t+1)} = \theta^{(t)}  \eta g^{+}(\theta^{(t)}) \nabla C(\theta^{(t)}) $$Where \(g(\theta^{(t)})\) is the metric tensor in the space of quantum states. In this context, it allows to rotate the gradient vector in a direction aware of the underlying curvature of the Hilbert space.
This twophase optimization strategy will allow you to get the best out of PennyLane’s functionalities exploiting automatic differentiability, the computational efficiency of stochastic gradient descent and increased robustness in the presence of barren plateaus with the quantum natural gradient.
Now we are all set! Whether our circuit is deep or shallow, messy or wellstructured we are ready to optimize it.