Breaking the Linear Iteration Cost Barrier for Some Well-known Conditional Gradient Methods Using MaxIP Data-structures

Zhaozhuo Xu
Rice University
zx22@rice.edu

Zhao Song
Adobe Research
zsong@adobe.com

Anshumali Shrivastava
Rice University and ThirdAI Corp.
anshumali@rice.edu

Abstract

Conditional gradient methods (CGM) are widely used in modern machine learning. CGM’s overall running time usually consists of two parts: the number of iterations and the cost of each iteration. Most efforts focus on reducing the number of iterations as a means to reduce the overall running time. In this work, we focus on improving the per iteration cost of CGM. The bottleneck step in most CGM is maximum inner product search (MaxIP), which requires a linear scan over the parameters. In practice, approximate MaxIP data-structures are found to be helpful heuristics. However, theoretically, nothing is known about the combination of approximate MaxIP data-structures and CGM. In this work, we answer this question positively by providing a formal framework to combine the locality sensitive hashing type approximate MaxIP data-structures with CGM algorithms. As a result, we show the first algorithm, where the cost per iteration is sublinear in the number of parameters, for many fundamental optimization algorithms, e.g., Frank-Wolfe, Herding algorithm, and policy gradient.

1 Introduction

Conditional gradient methods (CGM), such as Frank-Wolfe and its variants, are well-known optimization approaches that have been extensively used in modern machine learning. For example, CGM has been applied to kernel methods [1, 2], structural learning [3] and online learning [4, 5, 6].

Running Time Acceleration in Optimization: Recent years have witnessed the success of large-scale machine learning models trained on vast amounts of data. In this learning paradigm, the computational overhead of most successful models is dominated by the optimization process [7, 8]. Therefore, reducing the running time of the optimization algorithm is of practical importance. The total running time in optimization can be decomposed into two components: (1) the number of iterations towards convergence, (2) the cost spent in each iteration. Reducing the number of iterations requires a better understanding of the geometric proprieties of the problem at hand and the invention of better potential functions to analyze the progress of the algorithm [9, 10, 11, 12, 13, 14, 15]. Reducing the cost spent per iteration usually boils down to designing problem-specific discrete data-structures. In the last few years, we have seen a remarkable growth of using data-structures to reduce iteration cost [16, 17, 18, 19, 20, 21, 22, 23, 24].

MaxIP Data-structures for Iteration Cost Reduction: A well-known strategy in optimization, with CGM, is to perform a greedy search over the weight vectors [9, 10, 13, 16, 25] or training...

samples [26, 27] in each iteration. In this situation, the cost spent in each iteration is linear in the number of parameters. In practical machine learning, recent works [28, 29, 30, 31, 32] formulate this linear cost in iterative algorithms as an approximate maximum inner product search problem (MaxIP). They speed up the amortized cost per iteration via efficient data-structures from recent advances in approximate MaxIP [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]. In approximate MaxIP data-structures, locality sensitive hashing (LSH) achieves promising performance with efficient random projection based preprocessing strategies [33, 34, 35, 36]. Such techniques are widely used in practice for cost reduction in optimization. [28] proposes an LSH based gradient sampling approach that reduces the total empirical running time of the adaptive gradient descent. [29] formulates the forward propagation of deep neural network as a MaxIP problem and uses LSH to select a subset of neurons for backpropagation. Therefore, the total running time of neural network training could be reduced to sublinear in the number of neurons. [31] extends this idea with system-level design for further acceleration, and [30] modifies the LSH with learning and achieves promising acceleration in attention-based language models. [32] formulates the greedy step in iterative machine teaching (IMT) as a MaxIP problem and scale IMT to large datasets with LSH.

Challenges of Sublinear Iteration Cost CGM: Despite the practical success of cost-efficient iterative algorithms with approximate MaxIP data-structure, the theoretical analysis of its combination with CGM is not well-understood. In this paper, we focus on this combination and target answering the following questions: (1) how to transform the iteration step of CGM algorithms into an approximate MaxIP problem? (2) how does the approximate error in MaxIP affect CGM in the total number of iterations towards convergence? (3) how to adapt approximate MaxIP data structure for iterative CGM algorithms?

Our Contributions: We propose a theoretical formulation for combining approximate MaxIP and convergence guarantees of CGM. In particular, we start with the popular Frank-Wolfe algorithm over the convex hull where the direction search in each iteration is a MaxIP problem. Next, we propose a sublinear iteration cost Frank-Wolfe algorithm using LSH type MaxIP data-structures. We then analyze the trade-off of approximate MaxIP and its effect on the number of iterations needed by CGM to converge. We show that the approximation error caused by LSH only leads to a constant multiplicative factor increase in the number of iterations. As a result, we retain the sub-linearly of LSH, with respect to the number of parameters, and at the same time retain the same asymptotic convergence as CGMs.

We summarize our complete contributions as follows.

The following sections are organized as below: Section 2 introduces the related works on data-structures and optimization, Section 3 introduces our algorithm associated with the main statements convergence, Section 4 provides the proof sketch of the main statements, Section 5 presents the societal impact and Section 6 concludes the paper.

2 Related work

2.1 Maximum Inner Product Search for Machine Learning

Maximum Inner Product Search (MaxIP) is a fundamental problem with applications in machine learning. Given a query xRdx \in \mathbb{R}^d and an nn-vector dataset YRdY \subset \mathbb{R}^d, MaxIP targets at searching for yYy \in Y that maximizes the inner product xyx^\top y. The naive MaxIP solution takes O(dn)O(dn) by comparing xx with each yYy \in Y. To accelerate this procedure, various algorithms are proposed to reduce the running time of MaxIP [33, 34, 36, 35, 37, 38, 43, 44, 39, 45, 40, 41, 42]. We could categorize

the MaxIP approaches into two categories: reduction methods and non-reduction methods. The reduction methods use transformations that transform approximate MaxIP into approximate nearest neighbor search (ANN) and solve it with ANN data-structures. One of the popular data-structure is locality sensitive hashing [46, 47].

Definition 2.1 (Locality Sensitive Hashing). Let cˉ>1\bar{c} > 1 denote a parameter. Let p1,p2(0,1)p_1, p_2 \in (0, 1) denote two parameters and p1>p2p_1 > p_2. We say a function family H\mathcal{H} is (r,cˉr,p1,p2)(r, \bar{c} \cdot r, p_1, p_2)-sensitive if and only if, for any vectors x,yRdx, y \in \mathbb{R}^d, for any hh chosen uniformly at random from H\mathcal{H}, we have:

Here we define the LSH functions for euclidean distance. LSH functions could be used for search in cosine [48, 49] or Jaccard similarity [50, 51]. [33] first shows that MaxIP could be solved by 2\ell_2 LSH and asymmetric transformations. After that, [34, 36, 35, 43] propose a series of methods to solve MaxIP via LSH functions for other distance measures. Besides LSH, graph-based ANN approaches [38] could also be used after reduction.

On the other hand, the non-reduction method directly builds data-structures for approximate MaxIP. [37, 42] use quantization to approximate the inner product distance and build codebooks for efficient approximate MaxIP. [38, 44] propose a greedy algorithm for approximate MaxIP under computation budgets. [39, 40, 41] directly construct navigable graphs that achieve state-of-the-art empirical performance.

Recently, there is a remarkable growth in applying data-structures for machine learning [52, 53, 54]. Following the paradigm, approximate MaxIP data-structures have been applied to overcome the efficiency bottleneck of various machine learning algorithms. [38] formulates the inference of a neural network with a wide output layer as a MaxIP problem and uses a graph-based approach to reduce the inference time. In the same inference task, [55] proposes a learnable LSH data-structure that further improves the inference efficiency with less energy consumption. In neural network training, [29, 30, 31] uses approximate MaxIP to retrieve interested neurons for backpropagation. In this way, the computation overhead of gradient updates in neural networks could be reduced. In linear regression and classification models, [28] uses approximate MaxIP data-structures to retrieve the samples with large gradient norm and perform standard gradient descent, which improves the total running time for stochastic gradient descent. [32] proposes a scalable machine teaching algorithm that enables iterative teaching in large-scale datasets. In bandit problem, [56] proposes an LSH based algorithm that solves linear bandits problem with sublinear time complexity.

Despite the promising empirical results, there is little theoretical analysis on approximate MaxIP for machine learning. We summarize the major reasons as: (1) Besides LSH, the other approximate MaxIP data-structures do not provide theoretical guarantees on time and space complexity. (2) Current approaches treat data-structures and learning dynamics separately. There is no joint analysis on the effect of approximate MaxIP for machine learning.

2.2 Projection-free Optimization

Frank-Wolfe algorithm [25] is a projection-free optimization method with wide applications in convex [9, 10] and non-convex optimizations [11, 12]. The procedure of Frank-Wolfe algorithm could be summarized as two steps: (1) given the gradient, find a vector in the feasible domain that has the maximum inner product, (2) update the current weight with the retrieved vector. Formally, given a function g:RdRg : \mathbb{R}^d \rightarrow \mathbb{R} over a convex set SS, starting from an initial weight w0w^0, the Frank-Wolfe algorithm updates the weight with learning rate η\eta follows:

stargminsSs,g(wt)s^t \leftarrow \arg \min_{s \in S} \langle s, \nabla g(w^t) \rangle

wt+1(1ηt)wt+ηtst.w^{t+1} \leftarrow (1 - \eta_t) \cdot w^t + \eta_t \cdot s^t.

Previous literature focuses on reducing the number of iterations for Frank-Wolfe algorithm over specific domains such as p\ell_p balls [9, 10, 13, 14]. The cost reduction in the iterative procedure of Frank-Wolfe algorithm is hardly discussed except [57]. In this work, we focus on the Frank-Wolfe algorithm over the convex hull of a finite feasible set. This formulation is more general and it includes recent Frank-Wolfe applications in probabilistic modeling [1, 2], structural learning [3] and policy optimization [5].

3 Our Sublinear Iteration Cost Algorithm

In this section, we formally present our results on the sublinear iteration cost CGM algorithms. We start with the preliminary definitions of the objective function. Then, we present the guarantees on the number of iteration and cost per iterations for our sublinear CGM algorithms to converge.

3.1 Preliminaries

We provide the notations and settings for this paper. We start with basic notations for this paper. For a positive integer n n , we use [n][n] to denote the set {1,2,,n}{1, 2, \ldots, n}. For a vector x x , we use x2:=(i=1nxi2)1/2 |x|2 := (\sum{i=1}^{n} x_i^2)^{1/2} to denote its 2 \ell_2 norm.

We say a function is convex if

L(x)L(y)+L(y),xy. L(x) \geq L(y) + \langle \nabla L(y), x - y \rangle.

We say a function is β \beta -smooth if

L(y)L(x)+L(x),yx+β2yx22. L(y) \leq L(x) + \langle \nabla L(x), y - x \rangle + \frac{\beta}{2} |y - x|_2^2.

Given a set A={xi}i[n]Rd A = {x_i}{i \in [n]} \subset \mathbb{R}^d , we say its convex hull B(A) B(A) is the collection of all finite linear combinations y y that satisfies y=i[n]aixi y = \sum{i \in [n]} a_i \cdot x_i , where ai[0,1] a_i \in [0, 1] for all i[n] i \in [n] and i[n]ai=1 \sum_{i \in [n]} a_i = 1 .

Let Dmax D_{\text{max}} denote the maximum diameter of B(A) B(A) so that xy2Dmax |x - y|2 \leq D{\text{max}} for all (x,y)B(A)(x, y) \in B(A). We present the detailed definitions in Appendix A.

Next, we present the settings of our work. Let SRD S \subset \mathbb{R}^D denote a n n -point finite set. Given a convex and β \beta -smooth function g:RdR g : \mathbb{R}^d \rightarrow \mathbb{R} defined over the convex hull B(S) B(S) , our goal is to find a wB(S) w \in B(S) that minimizes g(w) g(w) . Given large n n in the higher dimension, the dominant complexity of iteration cost lies in finding the MaxLP of g(w) \nabla g(w) with respect to S S . In this setting, the fast learning rate of Frank-Wolfe in p \ell_p balls [9,13,16][9, 13, 16] can not be achieved. We present the detailed problem setting of the Frank-Wolfe algorithm in Appendix C.

3.2 Our Results

We present our main results with comparison to the original algorithm in Table 2. From the table, we show that with near-linear preprocessing time, our algorithms maintain the same number of iterations towards convergence while reducing the cost spent in each iteration to be sublinear in the number of possible parameters.

Statement Preprocess #Iters Cost per iter
Frank-Wolfe [9][9] O(βDmax2/ϵ)O(\beta D_{\text{max}}^2/\epsilon) O(dn+Tg)O(dn + T_g)
Ours Theorem 3.1 O(βDmax2/ϵ)O(\beta D_{\text{max}}^2/\epsilon) O(dnρ+Tg)O(dn^\rho + T_g)
Herding [1][1] O(Dmax2/ϵ)O(D_{\text{max}}^2/\epsilon) O(dn)O(dn)
Ours Theorem 3.2 O(Dmax2/ϵ)O(D_{\text{max}}^2/\epsilon) O(dnρ)O(dn^\rho)
Policy gradient [5][5] O(βDmax2/ϵ)O(\beta D_{\text{max}}^2/\epsilon) O(dn+TQ)O(dn + T_Q)
Ours Theorem 3.3 O(βDmax2/ϵ)O(\beta D_{\text{max}}^2/\epsilon) O(dnρ+TQ)O(dn^\rho + T_Q)

Table 1: Comparison between classical algorithm and our sublinear time algorithm. We compare our algorithm with Frank-Wolfe in: (1) “Frank-Wolfe” denotes Frank-Wolfe algorithm [9][9] for convex functions over a convex hull. Let Tg T_g denote the time for evaluating the gradient for any parameter. (2) “Herding” denotes kernel Herding algorithm [1][1] (3) “Policy gradient” denotes the projection free policy gradient method [5][5]. Let TQ T_Q denote the time for evaluating the policy gradient for any parameter. Let γ(0,1) \gamma \in (0, 1) denote the discount factor. Let μmin \mu_{\text{min}} denote the minimum probability density of a state. Note that n n is the number of possible parameters. no(1) n^{o(1)} is smaller than nc n^c for any constant c>0 c > 0 . Let ρ(0,1) \rho \in (0, 1) denote a fixed parameter determined by LSH data-structure. The failure probability of our algorithm is 1/poly(n) 1/\text{poly}(n) . β \beta is the smoothness factor. Dmax D_{\text{max}} denotes the maximum diameter of the convex hull.

Next, we introduce the statement of our sublinear iteration cost algorithms. We start by introducing our result for improving the running time of Frank-Wolfe.

Theorem 3.1 (Sublinear time Frank-Wolfe, informal of Theorem D.1). Let g:RdR g : \mathbb{R}^d \rightarrow \mathbb{R} denote a convex and β \beta -smooth function. Let the complexity of calculating g(x) \nabla g(x) to be Tg T_g . Let SRd S \subset \mathbb{R}^d denote a set of n n points. Let BRd B \subset \mathbb{R}^d denote the convex hull of S S with maximum diameter Dmax D_{\text{max}} . Let ρ(0,1) \rho \in (0, 1) denote a fixed parameter. For any parameters ϵ,δ \epsilon, \delta , there is an iterative algorithm (Algorithm 2) that takes O(dn1+o(1)) O(dn^{1+o(1)}) time in pre-processing, takes T=O(βDmax2/ϵ) T = O(\beta D_{\text{max}}^2/\epsilon) iterations and O(dnρ+Tg) O(dn^\rho + T_g) cost per iteration, starts from a random w0 w_0 from B B as its initialization point, and outputs wTRd w_T \in \mathbb{R}^d from B B such that

g(wT)minwBg(w)ϵ, g(w_T) - \min_{w \in B} g(w) \leq \epsilon,

holds with probability at least 11/poly(n) 1 - 1/\text{poly}(n) .

Next, we show our main result for the Herding algorithm. Herding algorithm is widely applied in kernel methods [58]. [1] shows that the Herding algorithm is equivalent to a conditional gradient method with the least-squares loss function. Therefore, we extend our results and obtain the following statement.

Theorem 3.2 (Sublinear time Herding algorithm, informal version of Theorem E.3). Let XRd X \subset \mathbb{R}^d denote a feature set and Φ:RdRK \Phi : \mathbb{R}^d \rightarrow \mathbb{R}^K denote a mapping. Let Dmax D_{\text{max}} denote the maximum diameter of Φ(X) \Phi(X) and B B be the convex hull of Φ(X) \Phi(X) . Given a distribution p(x) p(x) over X X , we denote μ=Exp(x)[Φ(x)] \mu = \mathbb{E}{x \sim p(x)}[\Phi(x)] . Let ρ(0,1) \rho \in (0, 1) denote a fixed parameter. For any parameters ϵ,δ \epsilon, \delta , there is an iterative algorithm (Algorithm 3) that takes O(dn1+o(1)) O(dn^{1+o(1)}) time in pre-processing, takes T=O(Dmax2/ϵ) T = O(D{\text{max}}^2/\epsilon) iterations and O(dnρ) O(dn^\rho) cost per iteration, starts from a random w0 w_0 from B B as its initialization point, and outputs wTRK w_T \in \mathbb{R}^K from B B such that

12wTμ22minwB12wμ22ϵ, \frac{1}{2}|w_T - \mu|2^2 - \min{w \in B} \frac{1}{2}|w - \mu|_2^2 \leq \epsilon,

holds with probability at least 11/poly(n) 1 - 1/\text{poly}(n) .

Finally, we present our result for policy gradient. Policy gradient [59] is a popular algorithm with wide applications in robotics [60] and recommendation [61]. [5] proposes a provable Frank-Wolfe method that maximizes the reward functions with policy gradient. However, the optimization requires a linear scan over all possible actions, which is unscalable in complex environments. We propose an efficient Frank-Wolfe algorithm with per iteration cost sublinear in the number of actions. Our statement is presented below.

Theorem 3.3 (Sublinear time policy gradient, informal version of Theorem F.3). Let TQ T_Q denote the time for computing the policy gradient. Let Dmax D_{\text{max}} denote the maximum diameter of action space and β \beta is a constant. Let γ(0,1) \gamma \in (0, 1) . Let ρ(0,1) \rho \in (0, 1) denote a fixed parameter. Let μmin \mu_{\text{min}} denote the minimal density of states in S S . There is an iterative algorithm (Algorithm 5) that spends O(dn1+o(1)) O(dn^{1+o(1)}) time in preprocessing, takes O(βDmax2ϵ2(1γ)2μmin) O(\frac{\beta D_{\text{max}}^2}{\epsilon^2(1-\gamma)^2\mu_{\text{min}}}) iterations and O(dnρ+TQ) O(dn^\rho + T_Q) cost per iterations, starts from a random point π0 \pi_0 as its initial point, and outputs πT \pi_T that has the average gap sSgT(s)2<ϵ \sqrt{\sum_{s \in S} g_T(s)^2} < \epsilon holds with probability at least 11/poly(n) 1 - 1/\text{poly}(n) , where gT(s) g_T(s) is defined in Eq. (6).

4 Proof Overview

We present the overview of proofs in this section. We start with introducing the efficient MaxIP data-structures. Next, we show how to transform the direction search in a conditional gradient approach into a MaxIP problem. Finally, we provide proof sketches for each main statement in Section 3. The detailed proofs are presented in the supplement material.

4.1 Approximate MaxIP Data-structures

We present the LSH data-structures for approximate MaxIP in this section. The detailed description is presented in Appendix A. We use the reduction-based approximate MaxIP method with LSH data-structure to achieve sublinear iteration cost. Note that we choose this method due to its clear

theoretical guarantee on the retrieval results. It is well-known that an LSH data-structures is used for approximate nearest neighbor problem. The following definition of approximate nearest neighbor search is very standard in literature [62, 46, 47, 63, 64, 65, 66, 67, 68, 69, 70].

Definition 4.1 (Approximate Nearest Neighbor (ANN)). Let c>1\overline{c} > 1 and r(0,2)r \in (0, 2) denote two parameters. Given an nn-vector set YSd1Y \subset S^{d-1} on a unit sphere, the objective of the (c,r)(\overline{c}, r)-Approximate Nearest Neighbor (ANN) is to construct a data structure that, for any query xSd1x \in S^{d-1} such that minyYyx2r\min_{y \in Y} |y - x|_2 \leq r, it returns a vector zz from YY that satisfies zx2cr|z - x|_2 \leq \overline{c} \cdot r.

In the iterative-type optimization algorithm, the cost per iteration could be dominated by the Approximate MaxIP problem (Definition 4.2), which is the dual problem of the (c,r)(\overline{c}, r)-ANN.

Definition 4.2 (Approximate MaxIP). Let c(0,1)c \in (0, 1) and τ(0,1)\tau \in (0, 1) denote two parameters. Given an nn-vector dataset YSd1Y \subset S^{d-1} on a unit sphere, the objective of the (c,τ)(c, \tau)-MaxIP is to construct a data structure that, given a query xSd1x \in S^{d-1} such that maxyYx,yτ\max_{y \in Y} \langle x, y \rangle \geq \tau, it retrieves a vector zz from YY that satisfies x,zcmaxyYx,y\langle x, z \rangle \geq c \cdot \max_{y \in Y} \langle x, y \rangle.

Next, we present the primal-dual connection between ANN and approximate MaxIP. Given to unit vectors x,yRdx, y \in \mathbb{R}^d with both norm equal to 1, xy22=22x,y|x - y|_2^2 = 2 - 2\langle x, y \rangle. Therefore, we could maximizing x,y\langle x, y \rangle by minimizing xy22|x - y|_2^2. Based on this connection, we present how to solve (c,τ)(c, \tau)-MaxIP using (c,r)(\overline{c}, r)-ANN. We start with showing how to solve (c,r)(\overline{c}, r)-ANN with LSH.

Theorem 4.3 (Andoni, Laarhoven, Razenshteyn and Waingarten [67]). Let c>1\overline{c} > 1 and r(0,2)r \in (0, 2) denote two parameters. One can solve (c,r)(\overline{c}, r)-ANN on a unit sphere in query time O(dnρ)O(d \cdot n^\rho) using preprocessing time O(dn1+o(1))O(dn^{1+o(1)}) and space O(n1+o(1)+dn)O(n^{1+o(1)} + dn), where ρ=2c21c4+o(1)\rho = \frac{2}{\overline{c}^2} - \frac{1}{\overline{c}^4} + o(1).

Next, we solve (c,τ)(c, \tau)-MaxIP by solving (c,r)(\overline{c}, r)-ANN using Theorem 4.3. We have

Corollary 4.4 (An informal statement of Corollary B.1). Let c(0,1)c \in (0, 1) and τ(0,1)\tau \in (0, 1) denote two parameters. One can solve (c,τ)(c, \tau)-MaxIP on a unit sphere Sd1S^{d-1} in query time O(dnρ)O(d \cdot n^\rho), where ρ(0,1)\rho \in (0, 1), using LSH with both preprocessing time and space in O(dn1+o(1))O(dn^{1+o(1)}).

In our work, we consider a generalized form of approximate MaxIP, denoted as projected approximate MaxIP.

Definition 4.5 (Projected approximate MaxIP). Let ϕ,ψ:RdRk\phi, \psi : \mathbb{R}^d \rightarrow \mathbb{R}^k denote two transforms. Given an nn-vector dataset YRdY \subset \mathbb{R}^d so that ψ(Y)Sk1\psi(Y) \subset S^{k-1}, the goal of the (c,ϕ,ψ,τ)(c, \phi, \psi, \tau)-MaxIP is to construct a data structure that, given a query xRdx \in \mathbb{R}^d and ϕ(x)Sk1\phi(x) \in S^{k-1} such that maxyYϕ(x),ψ(y)τ\max_{y \in Y} \langle \phi(x), \psi(y) \rangle \geq \tau, it retrieves a vector zYz \in Y that satisfies ϕ(x),ψ(z)c(ϕ,ψ)\langle \phi(x), \psi(z) \rangle \geq c \cdot (\phi, \psi)-MaxIP(x,Y)(x, Y).

For details of space-time trade-offs, please refer to Appendix B. The following sections show how to use projected approximate MaxIP to accelerate the optimization algorithm by reducing the cost per iteration.

4.2 Efficient Transformations

We have learned from Section 4.1 that (c,τ)(c, \tau)-MaxIP on a unit sphere Sd1S^{d-1} using LSH for ANN. Therefore, the next step is to transform the direction search procedure in iterative optimization algorithm into a MaxIP on a unit sphere. To achieve this, we formulate the direction search as a projected approximate MaxIP (see Definition A.5). We start with presenting a pair of transformation ϕ0,ψ0:RdRd+1\phi_0, \psi_0 : \mathbb{R}^d \rightarrow \mathbb{R}^{d+1} such that, given a function g:RdRg : \mathbb{R}^d \rightarrow \mathbb{R}, for any x,yx, y in a convex set KK, we have

ϕ0(x):=[g(x),xg(x)],ψ0(y):=[y,1].\phi_0(x) := [\nabla g(x)^\top, x^\top \nabla g(x)]^\top, \quad \psi_0(y) := [-y^\top, 1]^\top.

(1)

In this way, we show that

yx,g(x)=ϕ0(x),ψ0(y),\langle y - x, \nabla g(x) \rangle = -\langle \phi_0(x), \psi_0(y) \rangle,

argminyYyx,g(x)=argmaxyYϕ0(x),ψ0(y).\arg \min_{y \in Y} \langle y - x, \nabla g(x) \rangle = \arg \max_{y \in Y} \langle \phi_0(x), \psi_0(y) \rangle.

(2)

Therefore, we could transform the direction search problem into a MaxIP problem.

Next, we present a standard transformations [36] that connects the MaxIP to ANN in unit sphere. For any x,yRd x, y \in \mathbb{R}^d , we propose transformation ϕ1,ψ1:RdRd+2 \phi_1, \psi_1 : \mathbb{R}^d \to \mathbb{R}^{d+2} such that

ϕ1(x)=[(Dx1x)01xDx122]ψ1(y)=[(Dy1y)1yDy1220] \begin{align*} \phi_1(x) &= \begin{bmatrix} (D_x^{-1}x)^\top & 0 \ \sqrt{1 - |xD_x^{-1}|_2^2} \end{bmatrix}^\top \ \psi_1(y) &= \begin{bmatrix} (D_y^{-1}y)^\top & \sqrt{1 - |yD_y^{-1}|_2^2} \ 0 \end{bmatrix}^\top \end{align*}

Here Dx,Dy D_x, D_y are some constant that make sure both x/Dx x/D_x and y/Dy y/D_y have norms less than 1. Under these transformations, both ϕ1(x) \phi_1(x) and ψ1(y) \psi_1(y) have norm 1 and argmaxyYϕ1(x),ψ1(y)=argmaxyYx,y \arg \max_{y \in Y} \langle \phi_1(x), \psi_1(y) \rangle = \arg \max_{y \in Y} \langle x, y \rangle .

Combining transformations in Eq. (1) and Eq. (3), we obtain query transform ϕ:RdRd+3 \phi : \mathbb{R}^d \to \mathbb{R}^{d+3} with form ϕ(x)=ϕ1(ϕ0(x)) \phi(x) = \phi_1(\phi_0(x)) and data transform ψ:RdRd+3 \psi : \mathbb{R}^d \to \mathbb{R}^{d+3} with form ψ(y)=ψ1(ψ0(y)) \psi(y) = \psi_1(\psi_0(y)) . Using ϕ \phi and ψ \psi , we transform the direction search problem in optimization into a MaxIP in unit sphere. Moreover, given a set YRd Y \subset \mathbb{R}^d and a query xRd x \in \mathbb{R}^d , the solution z z of (c,ϕ,ψ,τ)(c, \phi, \psi, \tau)-MaxIP over (x,Y)(x, Y) has the propriety that zx,g(x)cminyYyx,g(x) \langle z - x, \nabla g(x) \rangle \leq c \cdot \min_{y \in Y} \langle y - x, \nabla g(x) \rangle . Thus, we could approximate the direction search with LSH based MaxIP data-structure.

Note that only MaxIP problem with positive inner product values could be solved by LSH. We found the direction search problem naturally satisfies this condition. We show that if g g is convex, given a set SRd S \subset \mathbb{R}^d , we have minsSg(x),sx0 \min_{s \in S} \langle \nabla g(x), s - x \rangle \leq 0 for any xB(S) x \in B(S) , where B B is the convex hull of S S . Thus, maxyYϕ0(x),ψ0(y) \max_{y \in Y} \langle \phi_0(x), \psi_0(y) \rangle is non-negative following Eq. (2).

4.3 Proof of Theorem 3.1

We present the proof sketch for Theorem 3.1 in this section. We refer the readers to Appendix D for the detailed proofs.

Let g:RdR g : \mathbb{R}^d \to \mathbb{R} denote a convex and β \beta -smooth function. Let the complexity of calculating g(x) \nabla g(x) to be Tg T_g . Let SRd S \subset \mathbb{R}^d denote a set of n n points, and BRd B \subset \mathbb{R}^d be the convex hull of S S with maximum diameter Dmax D_{\text{max}} . Let ϕ,ψ:RdRd+3 \phi, \psi : \mathbb{R}^d \to \mathbb{R}^{d+3} denote the tranformations defined in Section 4.2. Starting from a random vector w0B(S) w^0 \in B(S) , our sublinear Frank-Wolfe algorithm follows the update following rule that each step

st(c,ϕ,ψ,τ)-MaxIP of wt with respect to Swt+1wt+η(stwt) \begin{align*} s^t &\leftarrow (c, \phi, \psi, \tau)\text{-MaxIP of } w^t \text{ with respect to } S \ w^{t+1} &\leftarrow w^t + \eta \cdot (s^t - w^t) \end{align*}

We start with the upper bounding stwt,g(wt) \langle s^t - w^t, \nabla g(w^t) \rangle . Because st s^t is the (c,ϕ,ψ,τ)(c, \phi, \psi, \tau)-MaxIP of wt w^t with respect to S S , we have

stwt,g(wt)cminsSswt,g(wt)cwwt,g(wt)(4) \langle s^t - w^t, \nabla g(w^t) \rangle \leq c \min_{s \in S} \langle s - w^t, \nabla g(w^t) \rangle \leq c \langle w^* - w^t, \nabla g(w^t) \rangle \tag{4}

For convenient of the proof, for each t t , we define ht=g(wt)g(w) h_t = g(w^t) - g(w^*) . Next, we upper bound ht+1 h_{t+1} as

ht+1g(wt)+ηtstwt,g(wt)+β2ηt2stwt22g(w)g(wt)+cηtwwt,g(wt)+β2ηt2stwt22g(w)g(wt)+cηtwwt,g(wt)+βDmax22ηt2g(w)(1cηt)g(wt)+cηtg(w)+βDmax22ηt2g(w) \begin{align*} h_{t+1} &\leq g(w^t) + \eta_t \langle s^t - w^t, \nabla g(w^t) \rangle + \frac{\beta}{2} \eta_t^2 |s^t - w^t|2^2 - g(w^) \ &\leq g(w^t) + c\eta_t \langle w^ - w^t, \nabla g(w^t) \rangle + \frac{\beta}{2} \eta_t^2 |s^t - w^t|2^2 - g(w^) \ &\leq g(w^t) + c\eta_t \langle w^ - w^t, \nabla g(w^t) \rangle + \frac{\beta D{\text{max}}^2}{2} \eta_t^2 - g(w^) \ &\leq (1 - c\eta_t)g(w^t) + c\eta_t g(w^) + \frac{\beta D{\text{max}}^2}{2} \eta_t^2 - g(w^) \end{align}

where the first step follows from the definition of β \beta -smoothness, the second step follows from Eq. (4), the third step follows from the definition of Dmax D_{\text{max}} , the forth step follows from the convexity of g g .

Let η=2c(t+2) \eta = \frac{2}{c(t+2)} and At=t(t+1)2 A_t = \frac{t(t+1)}{2} . Combining them with Eq.(5), we show that

At+1ht+1Atht=c2t+1t+2βDmax2<c2βDmax2 A_{t+1}h_{t+1} - A_th_t = c^{-2}\frac{t+1}{t+2}\beta D_{\text{max}}^2 < c^{-2}\beta D_{\text{max}}^2

Using induction from 1 to t t , we show that

Atht<c2tβDmax2 A_th_t < c^{-2}t\beta D_{\text{max}}^2

Taken At=t(t+1)2 A_t = \frac{t(t+1)}{2} into consideration, we have

ht<2βDmax2c2(t+1) h_t < \frac{2\beta D_{\text{max}}^2}{c^2(t + 1)}

Given constant approximation ratio c c , t t should be in O(βDmax2ϵ) O(\frac{\beta D_{\text{max}}^2}{\epsilon}) so that htϵ h_t \leq \epsilon . Thus, we complete the proof.

Cost Per Iteration After we take O(dn1+o(1)) O(dn^{1+o(1)}) preprocessing time, the cost per iteration consists of three pairs: (1) it takes Tg T_g to compute g(wt) \nabla g(w^t) , (2) it takes O(d) O(d) to perform transform ϕ \phi and ψ \psi , (3) it takes O(dnp) O(dn^p) to retrieve st s^t from LSH. Thus, the final cost per iteration would be O(dnp+Tg) O(dn^p + T_g) .

Next, we show how to extend the proof to Herding problem. Following [1], we start with defining function g=12wTμ22 g = \frac{1}{2}|w^T - \mu|2^2 . We show that this function g g is a convex and 1-smooth function. Therefore, the Herding algorithm is equivalent to the Frank-Wolfe Algorithm over function g g . Using the proof of Theorem 3.1 with β=1 \beta = 1 , we show that it takes T=O(Dmax2/ϵ) T = O(D{\text{max}}^2/\epsilon) iterations and O(dnp) O(dn^p) cost per iteration to reach the ϵ \epsilon -optimal solution. Similar to Theorem 3.1, we show that the cost per iteration would be O(dnp) O(dn^p) as it takes O(d) O(d) to compute g(wt) \nabla g(w^t) .

4.4 Proof of Theorem 3.3

We present the proof sketch for Theorem 3.3 in this section. We refer the readers to Appendix F for the detailed proofs.

In this paper, we focus on the action-constrained Markov Decision Process (ACMDP). In this setting, we are provided with a state SRk S \in \mathbb{R}^k and action space ARd A \in \mathbb{R}^d . However, at each step tN t \in \mathbb{N} , we could only access a finite n n -vector set of actions C(s)A C(s) \subset A . Let us assume the C(s) C(s) remains the same in each step. Let us denote Dmax D_{\text{max}} as the maximum diameter of A A .

When you play with this ACMDP, the policy you choose is defined as πθ(s):SA \pi_\theta(s) : S \rightarrow A with parameter θ \theta . Meanwhile, there exists a reward function r:S×A[0,1] r : S \times A \in [0, 1] . Then, we define the Q function as below,

Q(s,aπθ)=E[t=0γtr(st,at)s0=s,a0=a,πθ]. Q(s, a|\pi_\theta) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)|s_0 = s, a_0 = a, \pi_\theta \right].

where γ(0,1) \gamma \in (0, 1) is a discount factor.

Given a state distribution μ \mu , the objective of policy gradient is to maximize J(μ,πθ)=Esμ,aπθ[Q(s,aπθ)] J(\mu, \pi_\theta) = \mathbb{E}{s \sim \mu, a \sim \pi\theta}[Q(s, a|\pi_\theta)] via policy gradient [59] denoted as:

θJ(μ,πθ)=Esdπ[θπθ(s)aQ(s,πθ(s)πθ)]. \nabla_\theta J(\mu, \pi_\theta) = \mathbb{E}{s \sim d\pi} \left[ \nabla_\theta \pi_\theta(s)\nabla_a Q(s, \pi_\theta(s)|\pi_\theta) \right].

[5] propose an iterative algorithm that perform MaxIP at each iteration k k over actions to find

gk(s)=maxaC(s)askπθk(s),aQ(s,πθk(s)πθk). g_k(s) = \max_{a \in C(s)} \langle a^k_s - \pi^k_\theta(s), \nabla_a Q(s, \pi^k_\theta(s)|\pi^k_\theta) \rangle.

(6)

In this work, we approximate Eq. (6) using (c,ϕ,ψ,τ)(c, \phi, \psi, \tau)-MaxIP. Here define ϕ:S×RdRd+1 \phi : S \times \mathbb{R}^d \rightarrow \mathbb{R}^{d+1} and ψ:RdRd+1 \psi : \mathbb{R}^d \rightarrow \mathbb{R}^{d+1} as follows:

ϕ(s,πθk):=[aQ(s,πθk(s)πθk),(πθk)Q(s,πθk(s)πθk)],ψ(a):=[a,1]. \phi(s, \pi^k_\theta) := [\nabla_a Q(s, \pi^k_\theta(s)|\pi^k_\theta)^\top, (\pi^k_\theta)^\top Q(s, \pi^k_\theta(s)|\pi^k_\theta)]^\top, \psi(a) := [a^\top, -1]^\top.

Then, we have gk(s)=ϕ(s,πθk),ψ(a) g_k(s) = \langle \phi(s, \pi^k_\theta), \psi(a) \rangle . Note that we still require transformations in Eq. (3) to generate unit vectors.

Next, we show that if we retrieve an action a^sk \hat{a}^k_s using (c,ϕ,ψ,τ)(c, \phi, \psi, \tau)-MaxIP, the gap g^k(s) \hat{g}_k(s) would be lower bounded by

g^k(s)=a^skπθk(s),aQ(s,πθk(s)πθk)cgk(s) \hat{g}k(s) = \langle \hat{a}^k_s - \pi^k\theta(s), \nabla_a Q(s, \pi^k_\theta(s)|\pi^k_\theta) \rangle \ \geq c g_k(s)

(7)

Combining Eq. (7) the standard induction in [5], we upper bound sSgT(s)2 \sum_{s \in S} g_T(s)^2 as

sSgT(s)21T+12βDmax2c2(1γ)3μmin2. \sum_{s \in S} g_T(s)^2 \leq \frac{1}{T + 1} \frac{2\beta D_{\max}^2}{c^2(1 - \gamma)^3 \mu_{\min}^2}.

(8)

where μmin \mu_{\min} denotes the minimal density of states in S S and β \beta is the smoothness factor.

In this way, given a constant factor c c , if we would like to minimize the gap sSgT(s)2<ϵ2 \sum_{s \in S} g_T(s)^2 < \epsilon^2 , T T should be O(βDmax2ϵ2(1γ)3μmin2) O(\frac{\beta D_{\max}^2}{\epsilon^2(1 - \gamma)^3 \mu_{\min}^2}) .

Cost Per Iteration After we take O(dn1+o(1)) O(dn^{1+o(1)}) preprocessing time, the cost per iteration consists of three pairs: (1) it takes TQ T_Q to compute policy gradient, (2) it takes O(d) O(d) to perform transform ϕ \phi and ψ \psi , (3) it takes O(dnρ) O(dn^\rho) to retrieve actions from LSH. Thus, the final cost per iteration would be O(dnρ+TQ) O(dn^\rho + T_Q) .

4.5 Quantization for Adaptive Queries

In optimization, the gradient computed in every iteration is not independent of each other. This would generate a problem for MaxIP data-structures. If we use a vector containing the gradients as a query for MaxIP data-structures, the query failure probability in each iteration is not independent. Therefore, the total failure probability could not be union bounded. As previous MaxIP data-structures focus on the assumptions that queries are independent, the original failure analysis could not be directly applied.

This work uses a standard query quantization method to handle the adaptive query sequence in optimization. Given the known query space, we quantize it by lattices [71]. This quantization is close to the Voronoi diagrams. In this way, each query is located into a cell with a center vector. Next, we perform a query using the center vector in the cell. Therefore, the failure probability of the MaxIP query sequence is equivalent to the probability that any center vector in the cell fails to retrieve its approximate MaxIP solution. As the centers of cells are independent, we could union bound this probability. On the other hand, as the maximum diameter of the cell is λ \lambda , this query quantization would introduce a λ \lambda additive error in the inner product retrieved. We refer the readers to Appendix G for the detailed quantization approach.

4.6 Optimizing Accuracy-Efficiency Trade-off

In this work, we show that by LSH based MaxIP data-structure, the cost for direction search is O(dnρ) O(dn^\rho) , where ρ(0,1) \rho \in (0, 1) . In Section D.2 of the supplementary material, we show that ρ \rho is a function of constant c c and parameter τ \tau in approximate MaxIP (see Definition 4.2). Moreover, we also show in Section D.2 that LSH results in only a constant multiplicative factor increase in the number of iterations. Considering the cost per iteration and the number of iterations, we show that when our algorithms stop at the ϵ \epsilon -optimal solution, LSH could achieve acceleration in the overall running time. Therefore, we could set c c and τ \tau parameters to balance the accuracy-efficiency trade-off of CGM to achieve the desired running time.

5 Potential Negative Societal Impact

This paper discusses the theoretical foundation of data-structures for conditional gradient methods. We believe that this paper does not have negative societal impact in the environment, privacy, and other domains.

6 Concluding Remarks

In this work, we present the first Frank-Wolfe algorithms that achieve sublinear linear time cost per iteration. We also extend this result into Herding algorithm and policy gradient methods. We formulate the direction search in Frank-Wolfe algorithm as a projected approximate maximum inner product search problem with a pair of efficient transformations. Then, we use locality sensitive hashing data-structure to reduce the iteration cost into sublinear over the number of possible parameters. Our theoretical analysis shows that the sublinear iteration cost Frank-Wolfe algorithm preserves the same order in the number of iterations towards convergence. Moreover, we analyze and optimize the trade-offs between saving iteration cost and increasing the number of iterations to achieve sublinear total running time. Furthermore, we identify the problems of existing MaxIP data-structures for cost reduction in iterative optimization algorithms and propose the corresponding solutions. We hope this work can be the starting point of future study on sublinear iteration cost algorithms for optimization.

Acknowledgements

This work was supported by National Science Foundation IIS-1652131, BIGDATA-1838177, AFOSR-YIP FA9550-18-1-0152, ONR DURIP Grant, and the ONR BRC grant on Randomized Numerical Linear Algebra. The authors would like to thank Beidi Chen for the helpful discussion on optimization. The authors would like to thank Lichen Zhang for valuable discussions about data-structures.

References

[1] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012.

[2] Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2512–2520, 2021.

[3] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe optimization for structural svms. In International Conference on Machine Learning (ICML), pages 53–61, 2013.

[4] Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended frank–wolfe method with “in-face” directions, and its application to low-rank matrix completion. SIAM Journal on optimization, 27(1):319–346, 2017.

[5] Jyun-Li Lin, Wei Hung, Shang-Hsuan Yang, Ping-Chun Hsieh, and Xi Liu. Escaping from zero gradient: Revisiting action-constrained reinforcement learning via frank-wolfe policy optimization. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI), 2021.

[6] Elad Hazan and Satyen Kale. Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pages 521–528, 2012.

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[8] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

[9] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning (ICML), pages 427–435, 2013.

[10] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning (ICML), pages 541–549, 2015.