{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# This cell is added by sphinx-gallery\n# It can be customized to whatever you like\n%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"::: {#kernel_based_training}\n:::\n\nKernel-based training of quantum models with scikit-learn\n=========================================================\n\n::: {.meta}\n:property=\\\"og:description\\\": Train a quantum machine learning model\nbased on the idea of quantum kernels. :property=\\\"og:image\\\":\n\n:::\n\n::: {.related}\ntutorial\\_variational\\_classifier Variational classifier\n:::\n\n*Author: PennyLane dev team. Posted: 3 Feb 2021. Last updated: 3 Feb\n2021.*\n\nOver the last few years, quantum machine learning research has provided\na lot of insights on how we can understand and train quantum circuits as\nmachine learning models. While many connections to neural networks have\nbeen made, it becomes increasingly clear that their mathematical\nfoundation is intimately related to so-called *kernel methods*, the most\nfamous of which is the [support vector machine\n(SVM)](https://en.wikipedia.org/wiki/Support-vector_machine) (see for\nexample [Schuld and Killoran (2018)](https://arxiv.org/abs/1803.07128),\n[Havlicek et al. (2018)](https://arxiv.org/abs/1804.11326), [Liu et al.\n(2020)](https://arxiv.org/abs/2010.02174), [Huang et al.\n(2020)](https://arxiv.org/pdf/2011.01938), and, for a systematic summary\nwhich we will follow here, [Schuld\n(2021)](https://arxiv.org/abs/2101.11020)).\n\nThe link between quantum models and kernel methods has important\npractical implications: we can replace the common [variational\napproach](https://pennylane.ai/qml/glossary/variational_circuit.html) to\nquantum machine learning with a classical kernel method where the\nkernel---a small building block of the overall algorithm---is computed\nby a quantum device. In many situations there are guarantees that we get\nbetter or at least equally good results.\n\nThis demonstration explores how kernel-based training compares with\n[variational\ntraining](https://pennylane.ai/qml/demos/tutorial_variational_classifier.html)\nin terms of the number of quantum circuits that have to be evaluated.\nFor this we train a quantum machine learning model with a kernel-based\napproach using a combination of PennyLane and the\n[scikit-learn](https://scikit-learn.org/) machine learning library. We\ncompare this strategy with a variational quantum circuit trained via\nstochastic gradient descent using\n[PyTorch](https://pennylane.readthedocs.io/en/stable/introduction/interfaces/torch.html).\n\nWe will see that in a typical small-scale example, kernel-based training\nrequires only a fraction of the number of quantum circuit evaluations\nused by variational circuit training, while each evaluation runs a much\nshorter circuit. In general, the relative efficiency of kernel-based\nmethods compared to variational circuits depends on the number of\nparameters used in the variational model.\n\n![](../demonstrations/kernel_based_training/scaling.png){.align-center}\n\nIf the number of variational parameters remains small, e.g., there is a\nsquare-root-like scaling with the number of data samples (green line),\nvariational circuits are almost as efficient as neural networks (blue\nline), and require much fewer circuit evaluations than the quadratic\nscaling of kernel methods (red line). However, with current\nhardware-compatible training strategies, kernel methods scale much\nbetter than variational circuits that require a number of parameters of\nthe order of the training set size (orange line).\n\nIn conclusion, **for quantum machine learning applications with many\nparameters, kernel-based training can be a great alternative to the\nvariational approach to quantum machine learning**.\n\nAfter working through this demo, you will:\n\n- be able to use a support vector machine with a quantum kernel\n computed with PennyLane, and\n- be able to compare the scaling of quantum circuit evaluations\n required in kernel-based versus variational training.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Background\n==========\n\nLet us consider a *quantum model* of the form\n\n$$f(x) = \\langle \\phi(x) | \\mathcal{M} | \\phi(x)\\rangle,$$\n\nwhere $| \\phi(x)\\rangle$ is prepared by a fixed [embedding\ncircuit](https://pennylane.ai/qml/glossary/quantum_embedding.html) that\nencodes data inputs $x$, and $\\mathcal{M}$ is an arbitrary observable.\nThis model includes variational quantum machine learning models, since\nthe observable can effectively be implemented by a simple measurement\nthat is preceded by a variational circuit:\n\n![](../demonstrations/kernel_based_training/quantum_model.png){.align-center}\n\n| \n\nFor example, applying a circuit $G(\\theta)$ and then measuring the\nPauli-Z observable $\\sigma^0_z$ of the first qubit implements the\ntrainable measurement\n$\\mathcal{M}(\\theta) = G^{\\dagger}(\\theta) \\sigma^0_z G(\\theta)$.\n\nThe main practical consequence of approaching quantum machine learning\nwith a kernel approach is that instead of training $f$ variationally, we\ncan often train an equivalent classical kernel method with a kernel\nexecuted on a quantum device. This *quantum kernel* is given by the\nmutual overlap of two data-encoding quantum states,\n\n$$\\kappa(x, x') = | \\langle \\phi(x') | \\phi(x)\\rangle|^2.$$\n\nKernel-based training therefore bypasses the processing and measurement\nparts of common variational circuits, and only depends on the data\nencoding.\n\nIf the loss function $L$ is the [hinge\nloss](https://en.wikipedia.org/wiki/Hinge_loss), the kernel method\ncorresponds to a standard [support vector\nmachine](https://en.wikipedia.org/wiki/Support-vector_machine) (SVM) in\nthe sense of a maximum-margin classifier. Other convex loss functions\nlead to more general variations of support vector machines.\n\n::: {.note}\n::: {.admonition-title}\nNote\n:::\n\nMore precisely, we can replace variational with kernel-based training if\nthe optimisation problem can be written as minimizing a cost of the form\n\n$$\\min_f \\lambda\\; \\mathrm{tr}\\{\\mathcal{M}^2\\} + \\frac{1}{M}\\sum_{m=1}^M L(f(x^m), y^m),$$\n\nwhich is a regularized empirical risk with training data samples\n$(x^m, y^m)_{m=1\\dots M}$, regularization strength\n$\\lambda \\in \\mathbb{R}$, and loss function $L$.\n\nTheory predicts that kernel-based training will always find better or\nequally good minima of this risk. However, to show this here we would\nhave to either regularize the variational training by the trace of the\nsquared observable, or switch off regularization in the classical SVM,\nwhich removes a lot of its strength. The kernel-based and the\nvariational training in this demonstration therefore optimize slightly\ndifferent cost functions, and it is out of our scope to establish\nwhether one training method finds a better minimum than the other.\n:::\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kernel-based training\n=====================\n\nFirst, we will turn to kernel-based training of quantum models. As\nstated above, an example implementation is a standard support vector\nmachine with a kernel computed by a quantum circuit.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We begin by importing all sorts of useful methods:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\nimport torch\nfrom torch.nn.functional import relu\n\nfrom sklearn.svm import SVC\nfrom sklearn.datasets import load_iris\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import accuracy_score\n\nimport pennylane as qml\nfrom pennylane.templates import AngleEmbedding, StronglyEntanglingLayers\nfrom pennylane.operation import Tensor\n\nimport matplotlib.pyplot as plt\n\nnp.random.seed(42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The second step is to define a data set. Since the performance of the\nmodels is not the focus of this demo, we can just use the first two\nclasses of the famous [Iris data\nset](https://en.wikipedia.org/wiki/Iris_flower_data_set). Dating back to\nas far as 1936, this toy data set consists of 100 samples of four\nfeatures each, and gives rise to a very simple classification problem.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X, y = load_iris(return_X_y=True)\n\n# pick inputs and labels from the first two classes only,\n# corresponding to the first 100 samples\nX = X[:100]\ny = y[:100]\n\n# scaling the inputs is important since the embedding we use is periodic\nscaler = StandardScaler().fit(X)\nX_scaled = scaler.transform(X)\n\n# scaling the labels to -1, 1 is important for the SVM and the\n# definition of a hinge loss\ny_scaled = 2 * (y - 0.5)\n\nX_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the [angle-embedding\ntemplate](https://pennylane.readthedocs.io/en/stable/code/api/pennylane.templates.embeddings.AngleEmbedding.html)\nwhich needs as many qubits as there are features:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"n_qubits = len(X_train[0])\nn_qubits"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To implement the kernel we could prepare the two states\n$| \\phi(x) \\rangle$, $| \\phi(x') \\rangle$ on different sets of qubits\nwith angle-embedding routines $S(x), S(x')$, and measure their overlap\nwith a small routine called a [SWAP\ntest](https://en.wikipedia.org/wiki/Swap_test).\n\nHowever, we need only half the number of qubits if we prepare\n$| \\phi(x)\\rangle$ and then apply the inverse embedding with $x'$ on the\nsame qubits. We then measure the projector onto the initial state\n$|0..0\\rangle \\langle 0..0|$.\n\n![](../demonstrations/kernel_based_training/kernel_circuit.png){.align-center}\n\nTo verify that this gives us the kernel:\n\n$$\\begin{aligned}\n\\begin{align*}\n \\langle 0..0 |S(x') S(x)^{\\dagger} \\mathcal{M} S(x')^{\\dagger} S(x) | 0..0\\rangle &= \\langle 0..0 |S(x') S(x)^{\\dagger} |0..0\\rangle \\langle 0..0| S(x')^{\\dagger} S(x) | 0..0\\rangle \\\\\n &= |\\langle 0..0| S(x')^{\\dagger} S(x) | 0..0\\rangle |^2\\\\\n &= | \\langle \\phi(x') | \\phi(x)\\rangle|^2 \\\\\n &= \\kappa(x, x').\n\\end{align*}\n\\end{aligned}$$\n\nNote that a projector $|0..0 \\rangle \\langle 0..0|$ can be constructed\nusing the `qml.Hermitian` observable in PennyLane.\n\nAltogether, we use the following quantum node as a *quantum kernel\nevaluator*:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dev_kernel = qml.device(\"default.qubit\", wires=n_qubits)\n\nprojector = np.zeros((2**n_qubits, 2**n_qubits))\nprojector[0, 0] = 1\n\n@qml.qnode(dev_kernel)\ndef kernel(x1, x2):\n \"\"\"The quantum kernel.\"\"\"\n AngleEmbedding(x1, wires=range(n_qubits))\n qml.adjoint(AngleEmbedding)(x2, wires=range(n_qubits))\n return qml.expval(qml.Hermitian(projector, wires=range(n_qubits)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A good sanity check is whether evaluating the kernel of a data point and\nitself returns 1:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"kernel(X_train[0], X_train[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The way an SVM with a custom kernel is implemented in scikit-learn\nrequires us to pass a function that computes a matrix of kernel\nevaluations for samples in two different datasets A, B. If A=B, this is\nthe [Gram matrix](https://en.wikipedia.org/wiki/Gramian_matrix).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def kernel_matrix(A, B):\n \"\"\"Compute the matrix whose entries are the kernel\n evaluated on pairwise data from sets A and B.\"\"\"\n return np.array([[kernel(a, b) for b in B] for a in A])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training the SVM optimizes internal parameters that basically weigh\nkernel functions. It is a breeze in scikit-learn, which is designed as a\nhigh-level machine learning library:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"svm = SVC(kernel=kernel_matrix).fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compute the accuracy on the test set.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"predictions = svm.predict(X_test)\naccuracy_score(predictions, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The SVM predicted all test points correctly. How many times was the\nquantum device evaluated?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dev_kernel.num_executions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This number can be derived as follows: For $M$ training samples, the SVM\nmust construct the $M \\times M$ dimensional kernel gram matrix for\ntraining. To classify $M_{\\rm pred}$ new samples, the SVM needs to\nevaluate the kernel at most $M_{\\rm pred}M$ times to get the pairwise\ndistances between training vectors and test samples.\n\n::: {.note}\n::: {.admonition-title}\nNote\n:::\n\nDepending on the implementation of the SVM, only $S \\leq M_{\\rm pred}$\n*support vectors* are needed.\n:::\n\nLet us formulate this as a function, which can be used at the end of the\ndemo to construct the scaling plot shown in the introduction.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def circuit_evals_kernel(n_data, split):\n \"\"\"Compute how many circuit evaluations one needs for kernel-based\n training and prediction.\"\"\"\n\n M = int(np.ceil(split * n_data))\n Mpred = n_data - M\n\n n_training = M * M\n n_prediction = M * Mpred\n\n return n_training + n_prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With $M = 75$ and $M_{\\rm pred} = 25$, the number of kernel evaluations\ncan therefore be estimated as:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"circuit_evals_kernel(n_data=len(X), split=len(X_train) /(len(X_train) + len(X_test)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The single additional evaluation can be attributed to evaluating the\nkernel once above as a sanity check.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A similar example using variational training\n============================================\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the variational principle of training, we can propose an *ansatz*\nfor the variational circuit and train it directly. By increasing the\nnumber of layers of the ansatz, its expressivity increases. Depending on\nthe ansatz, we may only search through a subspace of all measurements\nfor the best candidate.\n\nRemember from above, the variational training does not optimize\n*exactly* the same cost as the SVM, but we try to match them as closely\nas possible. For this we use a bias term in the quantum model, and train\non the hinge loss.\n\nWe also explicitly use the\n[parameter-shift](https://pennylane.ai/qml/glossary/parameter_shift.html)\ndifferentiation method in the quantum node, since this is a method which\nworks on hardware as well. While `diff_method='backprop'` or\n`diff_method='adjoint'` would reduce the number of circuit evaluations\nsignificantly, they are based on tricks that are only suitable for\nsimulators, and can therefore not scale to more than a few dozen qubits.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dev_var = qml.device(\"default.qubit\", wires=n_qubits)\n\n@qml.qnode(dev_var, interface=\"torch\", diff_method=\"parameter-shift\")\ndef quantum_model(x, params):\n \"\"\"A variational quantum model.\"\"\"\n\n # embedding\n AngleEmbedding(x, wires=range(n_qubits))\n\n # trainable measurement\n StronglyEntanglingLayers(params, wires=range(n_qubits))\n return qml.expval(qml.PauliZ(0))\n\ndef quantum_model_plus_bias(x, params, bias):\n \"\"\"Adding a bias.\"\"\"\n return quantum_model(x, params) + bias\n\ndef hinge_loss(predictions, targets):\n \"\"\"Implements the hinge loss.\"\"\"\n all_ones = torch.ones_like(targets)\n hinge_loss = all_ones - predictions * targets\n # trick: since the max(0,x) function is not differentiable,\n # use the mathematically equivalent relu instead\n hinge_loss = relu(hinge_loss)\n return hinge_loss"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now summarize the usual training and prediction steps into two\nfunctions similar to scikit-learn\\'s `fit()` and `predict()`. While it\nfeels cumbersome compared to the one-liner used to train the kernel\nmethod, PennyLane---like other differentiable programming\nlibraries---provides a lot more control over the particulars of\ntraining.\n\nIn our case, most of the work is to convert between numpy and torch,\nwhich we need for the differentiable `relu` function used in the hinge\nloss.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def quantum_model_train(n_layers, steps, batch_size):\n \"\"\"Train the quantum model defined above.\"\"\"\n\n params = np.random.random((n_layers, n_qubits, 3))\n params_torch = torch.tensor(params, requires_grad=True)\n bias_torch = torch.tensor(0.0)\n\n opt = torch.optim.Adam([params_torch, bias_torch], lr=0.1)\n\n loss_history = []\n for i in range(steps):\n\n batch_ids = np.random.choice(len(X_train), batch_size)\n\n X_batch = X_train[batch_ids]\n y_batch = y_train[batch_ids]\n\n X_batch_torch = torch.tensor(X_batch, requires_grad=False)\n y_batch_torch = torch.tensor(y_batch, requires_grad=False)\n\n def closure():\n opt.zero_grad()\n preds = torch.stack(\n [quantum_model_plus_bias(x, params_torch, bias_torch) for x in X_batch_torch]\n )\n loss = torch.mean(hinge_loss(preds, y_batch_torch))\n\n # bookkeeping\n current_loss = loss.detach().numpy().item()\n loss_history.append(current_loss)\n if i % 10 == 0:\n print(\"step\", i, \", loss\", current_loss)\n\n loss.backward()\n return loss\n\n opt.step(closure)\n\n return params_torch, bias_torch, loss_history\n\n\ndef quantum_model_predict(X_pred, trained_params, trained_bias):\n \"\"\"Predict using the quantum model defined above.\"\"\"\n\n p = []\n for x in X_pred:\n\n x_torch = torch.tensor(x)\n pred_torch = quantum_model_plus_bias(x_torch, trained_params, trained_bias)\n pred = pred_torch.detach().numpy().item()\n if pred > 0:\n pred = 1\n else:\n pred = -1\n\n p.append(pred)\n return p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's train the variational model and see how well we are doing on the\ntest set.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"n_layers = 2\nbatch_size = 20\nsteps = 100\ntrained_params, trained_bias, loss_history = quantum_model_train(n_layers, steps, batch_size)\n\npred_test = quantum_model_predict(X_test, trained_params, trained_bias)\nprint(\"accuracy on test set:\", accuracy_score(pred_test, y_test))\n\nplt.plot(loss_history)\nplt.ylim((0, 1))\nplt.xlabel(\"steps\")\nplt.ylabel(\"cost\")\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The variational circuit has a slightly lower accuracy than the SVM---but\nthis depends very much on the training settings we used. Different\nrandom parameter initializations, more layers, or more steps may indeed\nget perfect test accuracy.\n\nHow often was the device executed?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dev_var.num_executions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That is a lot more than the kernel method took!\n\nLet's try to understand this value. In each optimization step, the\nvariational circuit needs to compute the partial derivative of all\ntrainable parameters for each sample in a batch. Using parameter-shift\nrules, we require roughly two circuit evaluations per partial\nderivative. Prediction uses only one circuit evaluation per sample.\n\nWe can formulate this as another function that will be used in the\nscaling plot below.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def circuit_evals_variational(n_data, n_params, n_steps, shift_terms, split, batch_size):\n \"\"\"Compute how many circuit evaluations are needed for\n variational training and prediction.\"\"\"\n\n M = int(np.ceil(split * n_data))\n Mpred = n_data - M\n\n n_training = n_params * n_steps * batch_size * shift_terms\n n_prediction = Mpred\n\n return n_training + n_prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This estimates the circuit evaluations in variational training as:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"circuit_evals_variational(\n n_data=len(X),\n n_params=len(trained_params.flatten()),\n n_steps=steps,\n shift_terms=2,\n split=len(X_train) /(len(X_train) + len(X_test)),\n batch_size=batch_size,\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The estimate is a bit higher because it does not account for some\noptimizations that PennyLane performs under the hood.\n\nIt is important to note that while they are trained in a similar manner,\nthe number of variational circuit evaluations differs from the number of\nneural network model evaluations in classical machine learning, which\nwould be given by:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def model_evals_nn(n_data, n_params, n_steps, split, batch_size):\n \"\"\"Compute how many model evaluations are needed for neural\n network training and prediction.\"\"\"\n\n M = int(np.ceil(split * n_data))\n Mpred = n_data - M\n\n n_training = n_steps * batch_size\n n_prediction = Mpred\n\n return n_training + n_prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In each step of neural network training, and due to the clever\nimplementations of automatic differentiation, the backpropagation\nalgorithm can compute a gradient for all parameters in (more-or-less) a\nsingle run. For all we know at this stage, the no-cloning principle\nprevents variational circuits from using these tricks, which leads to\n`n_training` in `circuit_evals_variational` depending on the number of\nparameters, but not in `model_evals_nn`.\n\nFor the same example as used here, a neural network would therefore have\nfar fewer model evaluations than both variational and kernel-based\ntraining:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"model_evals_nn(\n n_data=len(X),\n n_params=len(trained_params.flatten()),\n n_steps=steps,\n split=len(X_train) /(len(X_train) + len(X_test)),\n batch_size=batch_size,\n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which method scales best?\n=========================\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The answer to this question depends on how the variational model is set\nup, and we need to make a few assumptions:\n\n1. Even if we use single-batch stochastic gradient descent, in which\n every training step uses exactly one training sample, we would want\n to see every training sample at least once on average. Therefore,\n the number of steps should scale at least linearly with the number\n of training data samples.\n2. Modern neural networks often have many more parameters than training\n samples. But we do not know yet whether variational circuits really\n need that many parameters as well. We will therefore use two cases\n for comparison:\n\n 2a) the number of parameters grows linearly with the training data,\n or `n_params = M`,\n\n 2b) the number of parameters saturates at some point, which we model\n by setting `n_params = sqrt(M)`.\n\nNote that compared to the example above with 75 training samples and 24\nparameters, a) overestimates the number of evaluations, while b)\nunderestimates it.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is how the three methods compare:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"variational_training1 = []\nvariational_training2 = []\nkernelbased_training = []\nnn_training = []\nx_axis = range(0, 2000, 100)\n\nfor M in x_axis:\n\n var1 = circuit_evals_variational(\n n_data=M, n_params=M, n_steps=M, shift_terms=2, split=0.75, batch_size=1\n )\n variational_training1.append(var1)\n\n var2 = circuit_evals_variational(\n n_data=M, n_params=round(np.sqrt(M)), n_steps=M,\n shift_terms=2, split=0.75, batch_size=1\n )\n variational_training2.append(var2)\n\n kernel = circuit_evals_kernel(n_data=M, split=0.75)\n kernelbased_training.append(kernel)\n\n nn = model_evals_nn(\n n_data=M, n_params=M, n_steps=M, split=0.75, batch_size=1\n )\n nn_training.append(nn)\n\n\nplt.plot(x_axis, nn_training, linestyle='--', label=\"neural net\")\nplt.plot(x_axis, variational_training1, label=\"var. circuit (linear param scaling)\")\nplt.plot(x_axis, variational_training2, label=\"var. circuit (srqt param scaling)\")\nplt.plot(x_axis, kernelbased_training, label=\"(quantum) kernel\")\nplt.xlabel(\"size of data set\")\nplt.ylabel(\"number of evaluations\")\nplt.legend()\nplt.tight_layout()\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the plot we saw at the beginning. With current\nhardware-compatible training methods, whether kernel-based training\nrequires more or fewer quantum circuit evaluations than variational\ntraining depends on how many parameters the latter needs. If variational\ncircuits turn out to be as parameter-hungry as neural networks,\nkernel-based training will outperform them for common machine learning\ntasks. However, if variational learning only turns out to require few\nparameters (or if more efficient training methods are found),\nvariational circuits could in principle match the linear scaling of\nneural networks trained with backpropagation.\n\nThe practical take-away from this demo is that unless your variational\ncircuit has significantly fewer parameters than training data, kernel\nmethods could be a much faster alternative!\n\nFinally, it is important to note that fault-tolerant quantum computers\nmay change the picture for both quantum and classical machine learning.\nAs mentioned in [Schuld (2021)](https://arxiv.org/abs/2101.11020), early\nresults from the quantum machine learning literature show that larger\nquantum computers will most likely enable us to reduce the quadratic\nscaling of kernel methods to linear scaling, which may make classical as\nwell as quantum kernel methods a strong alternative to neural networks\nfor big data processing one day.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}