{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Conditional quantum wasserstein generative adversarial network for learning DIS stock prices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generative adversarial networks (GANs) are notoriously hard to train. Since GANs don't converge—instead reaching an equilibrium—it is hard to gauge when one should stop training. However, there are several challenges towards reaching an equilibrium, the most prominent of which: diminished gradients, non-convergence, and mode collapse. In a paper by Arjovsky et al. in 2017 [[1]](https://arxiv.org/abs/1701.07875), they propose the Wasserstein generative adversarial network (WGAN), using the Wasserstein metric or Earth-Mover's distance, which mitigates several of these problems in training GANs. Here, I take inspiration from the classical WGAN to implement a hybrid quantum-classical wasserstein generative adversarial network conditioned by DIS stock prices. To be transparent, the application of a CQWGAN to stock price prediction is not well motivated but the intention here is not to complete such an intractable machine learning task, but to evaluate and explore the learnability of a CQWGAN given the vanishing gradient problem we came across in the QGAN implementation. With that said, I provide a brief outline of classical WGANs and this specific model implementation below.\n", "\n", "## Wasserstein Generative Adversarial Networks\n", "\n", "### Wasserstein Loss\n", "Wasserstein GANs improve training by adopting a smooth metric for measuring the distance between two probability distributions. We will briefly overview this metric and why it has smooth gradients. As encountered in the QGAN implementation, the discriminator rapidly learns to differentiate between fake and real data samples. Although this may seem beneficial at first, in reality it overpowers the generator causing near-zero gradients for the generator to train off of. Without a rich gradient, the generator is faced to optimize without enough information/feedback on its prior generated data samples. Fortunately, the Wasserstein loss is mathematically proven to have a much broader support (subset of the domain containing non zero values) and therefore avoids vanishing gradients even when the discriminator is trained to optimality. \n", "\n", "For those seeking a deeper intuition behind why the Wasserstein distance works, I explain it here. The traditional loss functions that generative models are trained on depend on the Kullback Leibler divergence or the Jensen Shannon divergence, which I need not mathematically define but know that they both quantify the distance between two probability distributions. This is a crucial aspect of GAN loss functions since the ability for the discriminator to discern between fake and real data centrally depends on the discriminator's ability to learn the distance between the respective data distributions.\n", "\n", "Unfortunately with KL and JS divergence, they both have near zero gradients as the generated distribution, $p_g$, ventures further from the real distribution, $p_r$. In addition, at the start of a GAN training procedure—when $p_g$ is highly inaccurate—the discriminator $D$ rapidly learns to differentiate between $p_g$ and $p_r$. Therefore, the loss function approaches zero and we end up with no gradient to update the loss during learning iterations.\n", "\n", "With these problems in mind, we can address Earth Mover's distance and how it can be of help. Wasserstein distance, as also called Earth Mover’s distance (short for EM distance) can be informally interpreted as the minimum energy needed to move and transform a pile of dirt in the shape of one probability distribution to the shape of the other distribution. The cost is quantified by the sum of the amount of dirt moved multiplied by the respective moving distance for each vertical. Below is a depiction of this idea in the case of two discrete distributions ([source](https://arxiv.org/pdf/1904.08994.pdf))\n", "\n", "\n", "\n", "In spirit of this idea, we will generalize to find a formula for EM distance including continuous distributions. To start, let an individual transport plan, $\\gamma(x,y)$ be the percentage of dirt at point $x$ that needs to be transported such that $x$ follows the same probability distribution as $y$.\n", "\n", "Taking $x$ as the starting point and $y$ as the ending point, we know that the amount of mass needed to complete the individual transformation is $\\gamma (x, y)$ and that the distance covered is $||x - y||$. Hence, the cost for a given transformation $(x \\rightarrow y)$ is $\\gamma(x, y) \\cdot ||x-y||$ Now to calculate the cost of a certain transport plan from $p_g$ to $p_r$, we just need to sum over the constituent costs of each individual transformation,\n", "$$\\underset{x, y}{\\sum} \\gamma (x, y) \\cdot ||x - y|| = \\underset{x,y∼\\gamma}{\\mathop{\\mathbb{E}}} ||x - y||$$\n", "\n", "After trying all such transport plans from $p_g$ to $p_r$, we conclude at the transport plan with the lowest cost. The cost of this most efficient transport plan is what we define as EM distance. Mathematically, we take the *infimum* (greatest lower bound), or put simply, the smallest cost in a set of all costs associated with each possible transport plan.\n", "\n", "$$W(p_r, p_g) = \\underset{\\gamma∼\\Pi (p_r,p_g)}{\\text{inf}} \\mathop{\\mathbb{E}}_{(x,y)∼\\gamma}[||x − y||]$$\n", "\n", "Above is the Wasserstein distance between two distributions $p_r$ and $p_g$ where $\\Pi(p_r, p_g)$ defines the set of all possible dirt transport plans, $\\gamma \\in \\Pi (p_r,p_g)$, or joint probability distributions between $p_r$ and $p_g$.\n", "\n", "### Applied to GANs\n", "To remind ourselves of the motivation behind using Wasserstein loss, think of two disjoint distributions. If they're truly is zero overlap, then you can expect the KL divergence to blow up to $+\\infty$ and the JS to jump to $log 2$ rendering both undifferentiable at that point. We observe that using Wasserstein distance, the loss here is $0$ and it remains smooth making for a stable learning process through gradient updates. This quality is especially useful for GANs since it ensures that the generator will be able to smoothly train in the beginning despite how badly its generated samples turn out to be.\n", "\n", "Due to the intractability of evaluating all possible joint distributions in $\\Pi(p_r, p_g)$, the authors in the [original WGAN paper](https://arxiv.org/abs/1701.07875) introduced a reforumalation of the orginial Wasserstein distance based on the Kantorovich-Rubinstein duality:\n", "\n", "$$W(p_r, p_g) = \\frac{1}{K} \\underset{ ||f||_L \\leq K}{\\text{sup}} \\space \\mathop{\\mathbb{E}}_{x∼p_r} [f(x)] - \\mathop{\\mathbb{E}}_{x∼p_g} [f(x)]$$\n", "\n", "where *supremum* measures the least upper bound, or put simply, the maximum value. The derivation is out of the scope of this notebook but if you're interested in peering further I recommend reading through this [post](https://vincentherrmann.github.io/blog/wasserstein/). It's important to note that the new real-valued function $f$ must satisfy the condition $||f||_L \\leq K$ meaning that the function must be $K$-Lipschitz continuous.\n", "\n", "A real valued function $f$ is called $K$-Lipschitz continuous if there exists a real constant $K ≥ 0$ such that, for all $x_1, x_2 \\in R$\n", "\n", "$$ |f(x_1) - f(x_2)| \\leq K|x_1 - x_2|$$\n", " \n", "Intuitively, the Lipschitz constant $K$ ensures that our loss function is everywhere differentiable and that it doesn't have exploding gradients. Think of it as a condition that makes sure that the loss function is smooth for the whole domain by setting an upper bound on the derivative.\n", "\n", "If we unbound $K$ to allow $f$ to pull from a family of K-Lipschitz continuous functions, ${\\{f_w\\}}_{\\mathcal{w}\\in W}$, the discriminator's new role would be to learn $w$ and arrive at a suitable $K$-Lipschitz continuous function $f_w$ such that the loss function is configured to measure the Wasserstein distance between $p_r$ and $p_g$. Using the above information, we can formally define the loss function and optimization task as\n", "\n", "$$ L(p_r, p_g) = W(p_r, p_g) = \\underset{w\\in W}{\\text{max}} \\space \\mathop{\\mathbb{E}}_{x∼p_r} [f_w(x)] - \\mathop{\\mathbb{E}}_{z∼p_r(z)} [f_w(g_\\theta(z))]$$\n", "\n", "Notice the absence of any logarithms. The Wasserstein GAN loss function is completely dependent on the $K$-Lipschitz continuous functions. Even more importantly, the discriminator here doesn't play the role of a direct classifier for each instance, assigning a value in $[0,1]$ to evaluate the authenticity of a presented data sample. Rather, it outputs a number. This number does not have to be less than one or greater than 0. Discriminator training just tries to make the output bigger for real instances than for fake instances. Because it can't really discriminate between real and fake, the WGAN discriminator is actually called a \"critic\" instead of a \"discriminator\". This holds importance since the critics' job is to widen the gap between the score of $p_r$ and $p_g$, and the generator's objective is to close it. Unsurprisingly, the progression of the generator in this setting correlates with generated data sample quality which wasn't a fixed condition in conventional GANs. \n", "\n", "The only theoretical caveat to mention is the difficulty of maintaining $K$-Lipschitz continuity of the critic's function $f_w$. To overcome this, the authors of the original paper propose a simple weight clipping technique where after each gradient update, the weights $w$ are clipped between a small window like $[-0.01, 0.01]$.\n", "\n", "### In practice\n", "WGANs are surprisingly simple to implement in practice, even the loss function! Since the essence of the loss function is the distance between the critics evaluation for fake and real data samples, we can implement it into our QGAN like so:\n", "\n", "$$L_D(x) = D(G(z)) - D(x) $$\n", "$$L_G(x) = -D(G(z))$$\n", "\n", "The critic tries to minimize the first function. In other words, it tries to maximize the difference between its average score on real samples and its score on fake samples. The generator tries to minimize the second function. In other words, It tries to maximize the discriminator's score for its fake data samples. By looking at Brownlee's [implementation](https://machinelearningmastery.com/how-to-code-a-wasserstein-generative-adversarial-network-wgan-from-scratch/) of a WGAN, I arrived at this elegant implementation of the above loss function with class labels of 1, and -1 assigned to fake data and real data, respectively. \n", "```\n", "def wasserstein_loss(y_true, y_pred):\n", "\treturn mean(y_true * y_pred)\n", "```\n", "\n", "## General overview of QWGAN\n", "The quantum Wasserstein generative adversarial network consists of one 19 qubit parameterized quantum generator paired with a classical CNN as a discriminator. The 19 qubit quantum generator is compiled as a Keras layer and it takes as input the past 15 days of price data sent in as conditional labels and a Gaussian noise vector of length 12. The intention of the implementation is to investigate if the vanishing gradient problem faced with our QGAN implementation is alleviated with the use of the Wasserstein loss function. Below is an image of the model graph. \n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "