Breaking the Linear Iteration Cost Barrier for Some Well-known Conditional Gradient Methods Using MaxIP Data-structures
Zhaozhuo Xu
Rice University
zx22@rice.edu
Zhao Song
Adobe Research
zsong@adobe.com
Anshumali Shrivastava
Rice University and ThirdAI Corp.
anshumali@rice.edu
Abstract
Conditional gradient methods (CGM) are widely used in modern machine learning. CGM’s overall running time usually consists of two parts: the number of iterations and the cost of each iteration. Most efforts focus on reducing the number of iterations as a means to reduce the overall running time. In this work, we focus on improving the per iteration cost of CGM. The bottleneck step in most CGM is maximum inner product search (MaxIP), which requires a linear scan over the parameters. In practice, approximate MaxIP data-structures are found to be helpful heuristics. However, theoretically, nothing is known about the combination of approximate MaxIP data-structures and CGM. In this work, we answer this question positively by providing a formal framework to combine the locality sensitive hashing type approximate MaxIP data-structures with CGM algorithms. As a result, we show the first algorithm, where the cost per iteration is sublinear in the number of parameters, for many fundamental optimization algorithms, e.g., Frank-Wolfe, Herding algorithm, and policy gradient.
1 Introduction
Conditional gradient methods (CGM), such as Frank-Wolfe and its variants, are well-known optimization approaches that have been extensively used in modern machine learning. For example, CGM has been applied to kernel methods [1, 2], structural learning [3] and online learning [4, 5, 6].
Running Time Acceleration in Optimization: Recent years have witnessed the success of large-scale machine learning models trained on vast amounts of data. In this learning paradigm, the computational overhead of most successful models is dominated by the optimization process [7, 8]. Therefore, reducing the running time of the optimization algorithm is of practical importance. The total running time in optimization can be decomposed into two components: (1) the number of iterations towards convergence, (2) the cost spent in each iteration. Reducing the number of iterations requires a better understanding of the geometric proprieties of the problem at hand and the invention of better potential functions to analyze the progress of the algorithm [9, 10, 11, 12, 13, 14, 15]. Reducing the cost spent per iteration usually boils down to designing problem-specific discrete data-structures. In the last few years, we have seen a remarkable growth of using data-structures to reduce iteration cost [16, 17, 18, 19, 20, 21, 22, 23, 24].
MaxIP Data-structures for Iteration Cost Reduction: A well-known strategy in optimization, with CGM, is to perform a greedy search over the weight vectors [9, 10, 13, 16, 25] or training...
samples [26, 27] in each iteration. In this situation, the cost spent in each iteration is linear in the number of parameters. In practical machine learning, recent works [28, 29, 30, 31, 32] formulate this linear cost in iterative algorithms as an approximate maximum inner product search problem (MaxIP). They speed up the amortized cost per iteration via efficient data-structures from recent advances in approximate MaxIP [33, 34, 35, 36, 37, 38, 39, 40, 41, 42]. In approximate MaxIP data-structures, locality sensitive hashing (LSH) achieves promising performance with efficient random projection based preprocessing strategies [33, 34, 35, 36]. Such techniques are widely used in practice for cost reduction in optimization. [28] proposes an LSH based gradient sampling approach that reduces the total empirical running time of the adaptive gradient descent. [29] formulates the forward propagation of deep neural network as a MaxIP problem and uses LSH to select a subset of neurons for backpropagation. Therefore, the total running time of neural network training could be reduced to sublinear in the number of neurons. [31] extends this idea with system-level design for further acceleration, and [30] modifies the LSH with learning and achieves promising acceleration in attention-based language models. [32] formulates the greedy step in iterative machine teaching (IMT) as a MaxIP problem and scale IMT to large datasets with LSH.
Challenges of Sublinear Iteration Cost CGM: Despite the practical success of cost-efficient iterative algorithms with approximate MaxIP data-structure, the theoretical analysis of its combination with CGM is not well-understood. In this paper, we focus on this combination and target answering the following questions: (1) how to transform the iteration step of CGM algorithms into an approximate MaxIP problem? (2) how does the approximate error in MaxIP affect CGM in the total number of iterations towards convergence? (3) how to adapt approximate MaxIP data structure for iterative CGM algorithms?
Our Contributions: We propose a theoretical formulation for combining approximate MaxIP and convergence guarantees of CGM. In particular, we start with the popular Frank-Wolfe algorithm over the convex hull where the direction search in each iteration is a MaxIP problem. Next, we propose a sublinear iteration cost Frank-Wolfe algorithm using LSH type MaxIP data-structures. We then analyze the trade-off of approximate MaxIP and its effect on the number of iterations needed by CGM to converge. We show that the approximation error caused by LSH only leads to a constant multiplicative factor increase in the number of iterations. As a result, we retain the sub-linearly of LSH, with respect to the number of parameters, and at the same time retain the same asymptotic convergence as CGMs.
We summarize our complete contributions as follows.
The following sections are organized as below: Section 2 introduces the related works on data-structures and optimization, Section 3 introduces our algorithm associated with the main statements convergence, Section 4 provides the proof sketch of the main statements, Section 5 presents the societal impact and Section 6 concludes the paper.
Maximum Inner Product Search (MaxIP) is a fundamental problem with applications in machine learning. Given a query and an -vector dataset , MaxIP targets at searching for that maximizes the inner product . The naive MaxIP solution takes by comparing with each . To accelerate this procedure, various algorithms are proposed to reduce the running time of MaxIP [33, 34, 36, 35, 37, 38, 43, 44, 39, 45, 40, 41, 42]. We could categorize
the MaxIP approaches into two categories: reduction methods and non-reduction methods. The reduction methods use transformations that transform approximate MaxIP into approximate nearest neighbor search (ANN) and solve it with ANN data-structures. One of the popular data-structure is locality sensitive hashing [46, 47].
Definition 2.1 (Locality Sensitive Hashing). Let denote a parameter. Let denote two parameters and . We say a function family is -sensitive if and only if, for any vectors , for any chosen uniformly at random from , we have:
Here we define the LSH functions for euclidean distance. LSH functions could be used for search in cosine [48, 49] or Jaccard similarity [50, 51]. [33] first shows that MaxIP could be solved by LSH and asymmetric transformations. After that, [34, 36, 35, 43] propose a series of methods to solve MaxIP via LSH functions for other distance measures. Besides LSH, graph-based ANN approaches [38] could also be used after reduction.
On the other hand, the non-reduction method directly builds data-structures for approximate MaxIP. [37, 42] use quantization to approximate the inner product distance and build codebooks for efficient approximate MaxIP. [38, 44] propose a greedy algorithm for approximate MaxIP under computation budgets. [39, 40, 41] directly construct navigable graphs that achieve state-of-the-art empirical performance.
Recently, there is a remarkable growth in applying data-structures for machine learning [52, 53, 54]. Following the paradigm, approximate MaxIP data-structures have been applied to overcome the efficiency bottleneck of various machine learning algorithms. [38] formulates the inference of a neural network with a wide output layer as a MaxIP problem and uses a graph-based approach to reduce the inference time. In the same inference task, [55] proposes a learnable LSH data-structure that further improves the inference efficiency with less energy consumption. In neural network training, [29, 30, 31] uses approximate MaxIP to retrieve interested neurons for backpropagation. In this way, the computation overhead of gradient updates in neural networks could be reduced. In linear regression and classification models, [28] uses approximate MaxIP data-structures to retrieve the samples with large gradient norm and perform standard gradient descent, which improves the total running time for stochastic gradient descent. [32] proposes a scalable machine teaching algorithm that enables iterative teaching in large-scale datasets. In bandit problem, [56] proposes an LSH based algorithm that solves linear bandits problem with sublinear time complexity.
Despite the promising empirical results, there is little theoretical analysis on approximate MaxIP for machine learning. We summarize the major reasons as: (1) Besides LSH, the other approximate MaxIP data-structures do not provide theoretical guarantees on time and space complexity. (2) Current approaches treat data-structures and learning dynamics separately. There is no joint analysis on the effect of approximate MaxIP for machine learning.
Frank-Wolfe algorithm [25] is a projection-free optimization method with wide applications in convex [9, 10] and non-convex optimizations [11, 12]. The procedure of Frank-Wolfe algorithm could be summarized as two steps: (1) given the gradient, find a vector in the feasible domain that has the maximum inner product, (2) update the current weight with the retrieved vector. Formally, given a function over a convex set , starting from an initial weight , the Frank-Wolfe algorithm updates the weight with learning rate follows:
Previous literature focuses on reducing the number of iterations for Frank-Wolfe algorithm over specific domains such as balls [9, 10, 13, 14]. The cost reduction in the iterative procedure of Frank-Wolfe algorithm is hardly discussed except [57]. In this work, we focus on the Frank-Wolfe algorithm over the convex hull of a finite feasible set. This formulation is more general and it includes recent Frank-Wolfe applications in probabilistic modeling [1, 2], structural learning [3] and policy optimization [5].
3 Our Sublinear Iteration Cost Algorithm
In this section, we formally present our results on the sublinear iteration cost CGM algorithms. We start with the preliminary definitions of the objective function. Then, we present the guarantees on the number of iteration and cost per iterations for our sublinear CGM algorithms to converge.
3.1 Preliminaries
We provide the notations and settings for this paper. We start with basic notations for this paper. For a positive integer , we use to denote the set . For a vector , we use to denote its norm.
We say a function is convex if
We say a function is -smooth if
Given a set , we say its convex hull is the collection of all finite linear combinations that satisfies , where for all and .
Let denote the maximum diameter of so that for all . We present the detailed definitions in Appendix A.
Next, we present the settings of our work. Let denote a -point finite set. Given a convex and -smooth function defined over the convex hull , our goal is to find a that minimizes . Given large in the higher dimension, the dominant complexity of iteration cost lies in finding the MaxLP of with respect to . In this setting, the fast learning rate of Frank-Wolfe in balls can not be achieved. We present the detailed problem setting of the Frank-Wolfe algorithm in Appendix C.
3.2 Our Results
We present our main results with comparison to the original algorithm in Table 2. From the table, we show that with near-linear preprocessing time, our algorithms maintain the same number of iterations towards convergence while reducing the cost spent in each iteration to be sublinear in the number of possible parameters.
Statement | Preprocess | #Iters | Cost per iter |
---|---|---|---|
Frank-Wolfe | |||
Ours | Theorem 3.1 | ||
Herding | |||
Ours | Theorem 3.2 | ||
Policy gradient | |||
Ours | Theorem 3.3 |
Table 1: Comparison between classical algorithm and our sublinear time algorithm. We compare our algorithm with Frank-Wolfe in: (1) “Frank-Wolfe” denotes Frank-Wolfe algorithm for convex functions over a convex hull. Let denote the time for evaluating the gradient for any parameter. (2) “Herding” denotes kernel Herding algorithm (3) “Policy gradient” denotes the projection free policy gradient method . Let denote the time for evaluating the policy gradient for any parameter. Let denote the discount factor. Let denote the minimum probability density of a state. Note that is the number of possible parameters. is smaller than for any constant . Let denote a fixed parameter determined by LSH data-structure. The failure probability of our algorithm is . is the smoothness factor. denotes the maximum diameter of the convex hull.
Next, we introduce the statement of our sublinear iteration cost algorithms. We start by introducing our result for improving the running time of Frank-Wolfe.
Theorem 3.1 (Sublinear time Frank-Wolfe, informal of Theorem D.1). Let denote a convex and -smooth function. Let the complexity of calculating to be . Let denote a set of points. Let denote the convex hull of with maximum diameter . Let denote a fixed parameter. For any parameters , there is an iterative algorithm (Algorithm 2) that takes time in pre-processing, takes iterations and cost per iteration, starts from a random from as its initialization point, and outputs from such that
holds with probability at least .
Next, we show our main result for the Herding algorithm. Herding algorithm is widely applied in kernel methods [58]. [1] shows that the Herding algorithm is equivalent to a conditional gradient method with the least-squares loss function. Therefore, we extend our results and obtain the following statement.
Theorem 3.2 (Sublinear time Herding algorithm, informal version of Theorem E.3). Let denote a feature set and denote a mapping. Let denote the maximum diameter of and be the convex hull of . Given a distribution over , we denote . Let denote a fixed parameter. For any parameters , there is an iterative algorithm (Algorithm 3) that takes time in pre-processing, takes iterations and cost per iteration, starts from a random from as its initialization point, and outputs from such that
holds with probability at least .
Finally, we present our result for policy gradient. Policy gradient [59] is a popular algorithm with wide applications in robotics [60] and recommendation [61]. [5] proposes a provable Frank-Wolfe method that maximizes the reward functions with policy gradient. However, the optimization requires a linear scan over all possible actions, which is unscalable in complex environments. We propose an efficient Frank-Wolfe algorithm with per iteration cost sublinear in the number of actions. Our statement is presented below.
Theorem 3.3 (Sublinear time policy gradient, informal version of Theorem F.3). Let denote the time for computing the policy gradient. Let denote the maximum diameter of action space and is a constant. Let . Let denote a fixed parameter. Let denote the minimal density of states in . There is an iterative algorithm (Algorithm 5) that spends time in preprocessing, takes iterations and cost per iterations, starts from a random point as its initial point, and outputs that has the average gap holds with probability at least , where is defined in Eq. (6).
We present the overview of proofs in this section. We start with introducing the efficient MaxIP data-structures. Next, we show how to transform the direction search in a conditional gradient approach into a MaxIP problem. Finally, we provide proof sketches for each main statement in Section 3. The detailed proofs are presented in the supplement material.
We present the LSH data-structures for approximate MaxIP in this section. The detailed description is presented in Appendix A. We use the reduction-based approximate MaxIP method with LSH data-structure to achieve sublinear iteration cost. Note that we choose this method due to its clear
theoretical guarantee on the retrieval results. It is well-known that an LSH data-structures is used for approximate nearest neighbor problem. The following definition of approximate nearest neighbor search is very standard in literature [62, 46, 47, 63, 64, 65, 66, 67, 68, 69, 70].
Definition 4.1 (Approximate Nearest Neighbor (ANN)). Let and denote two parameters. Given an -vector set on a unit sphere, the objective of the -Approximate Nearest Neighbor (ANN) is to construct a data structure that, for any query such that , it returns a vector from that satisfies .
In the iterative-type optimization algorithm, the cost per iteration could be dominated by the Approximate MaxIP problem (Definition 4.2), which is the dual problem of the -ANN.
Definition 4.2 (Approximate MaxIP). Let and denote two parameters. Given an -vector dataset on a unit sphere, the objective of the -MaxIP is to construct a data structure that, given a query such that , it retrieves a vector from that satisfies .
Next, we present the primal-dual connection between ANN and approximate MaxIP. Given to unit vectors with both norm equal to 1, . Therefore, we could maximizing by minimizing . Based on this connection, we present how to solve -MaxIP using -ANN. We start with showing how to solve -ANN with LSH.
Theorem 4.3 (Andoni, Laarhoven, Razenshteyn and Waingarten [67]). Let and denote two parameters. One can solve -ANN on a unit sphere in query time using preprocessing time and space , where .
Next, we solve -MaxIP by solving -ANN using Theorem 4.3. We have
Corollary 4.4 (An informal statement of Corollary B.1). Let and denote two parameters. One can solve -MaxIP on a unit sphere in query time , where , using LSH with both preprocessing time and space in .
In our work, we consider a generalized form of approximate MaxIP, denoted as projected approximate MaxIP.
Definition 4.5 (Projected approximate MaxIP). Let denote two transforms. Given an -vector dataset so that , the goal of the -MaxIP is to construct a data structure that, given a query and such that , it retrieves a vector that satisfies -MaxIP.
For details of space-time trade-offs, please refer to Appendix B. The following sections show how to use projected approximate MaxIP to accelerate the optimization algorithm by reducing the cost per iteration.
We have learned from Section 4.1 that -MaxIP on a unit sphere using LSH for ANN. Therefore, the next step is to transform the direction search procedure in iterative optimization algorithm into a MaxIP on a unit sphere. To achieve this, we formulate the direction search as a projected approximate MaxIP (see Definition A.5). We start with presenting a pair of transformation such that, given a function , for any in a convex set , we have
(1)
In this way, we show that
(2)
Therefore, we could transform the direction search problem into a MaxIP problem.
Next, we present a standard transformations [36] that connects the MaxIP to ANN in unit sphere. For any , we propose transformation such that
Here are some constant that make sure both and have norms less than 1. Under these transformations, both and have norm 1 and .
Combining transformations in Eq. (1) and Eq. (3), we obtain query transform with form and data transform with form . Using and , we transform the direction search problem in optimization into a MaxIP in unit sphere. Moreover, given a set and a query , the solution of -MaxIP over has the propriety that . Thus, we could approximate the direction search with LSH based MaxIP data-structure.
Note that only MaxIP problem with positive inner product values could be solved by LSH. We found the direction search problem naturally satisfies this condition. We show that if is convex, given a set , we have for any , where is the convex hull of . Thus, is non-negative following Eq. (2).
We present the proof sketch for Theorem 3.1 in this section. We refer the readers to Appendix D for the detailed proofs.
Let denote a convex and -smooth function. Let the complexity of calculating to be . Let denote a set of points, and be the convex hull of with maximum diameter . Let denote the tranformations defined in Section 4.2. Starting from a random vector , our sublinear Frank-Wolfe algorithm follows the update following rule that each step
We start with the upper bounding . Because is the -MaxIP of with respect to , we have
For convenient of the proof, for each , we define . Next, we upper bound as
where the first step follows from the definition of -smoothness, the second step follows from Eq. (4), the third step follows from the definition of , the forth step follows from the convexity of .
Let and . Combining them with Eq.(5), we show that
Using induction from 1 to , we show that
Taken into consideration, we have
Given constant approximation ratio , should be in so that . Thus, we complete the proof.
Cost Per Iteration After we take preprocessing time, the cost per iteration consists of three pairs: (1) it takes to compute , (2) it takes to perform transform and , (3) it takes to retrieve from LSH. Thus, the final cost per iteration would be .
Next, we show how to extend the proof to Herding problem. Following [1], we start with defining function . We show that this function is a convex and 1-smooth function. Therefore, the Herding algorithm is equivalent to the Frank-Wolfe Algorithm over function . Using the proof of Theorem 3.1 with , we show that it takes iterations and cost per iteration to reach the -optimal solution. Similar to Theorem 3.1, we show that the cost per iteration would be as it takes to compute .
We present the proof sketch for Theorem 3.3 in this section. We refer the readers to Appendix F for the detailed proofs.
In this paper, we focus on the action-constrained Markov Decision Process (ACMDP). In this setting, we are provided with a state and action space . However, at each step , we could only access a finite -vector set of actions . Let us assume the remains the same in each step. Let us denote as the maximum diameter of .
When you play with this ACMDP, the policy you choose is defined as with parameter . Meanwhile, there exists a reward function . Then, we define the Q function as below,
where is a discount factor.
Given a state distribution , the objective of policy gradient is to maximize via policy gradient [59] denoted as:
[5] propose an iterative algorithm that perform MaxIP at each iteration over actions to find
(6)
In this work, we approximate Eq. (6) using -MaxIP. Here define and as follows:
Then, we have . Note that we still require transformations in Eq. (3) to generate unit vectors.
Next, we show that if we retrieve an action using -MaxIP, the gap would be lower bounded by
(7)
Combining Eq. (7) the standard induction in [5], we upper bound as
(8)
where denotes the minimal density of states in and is the smoothness factor.
In this way, given a constant factor , if we would like to minimize the gap , should be .
Cost Per Iteration After we take preprocessing time, the cost per iteration consists of three pairs: (1) it takes to compute policy gradient, (2) it takes to perform transform and , (3) it takes to retrieve actions from LSH. Thus, the final cost per iteration would be .
In optimization, the gradient computed in every iteration is not independent of each other. This would generate a problem for MaxIP data-structures. If we use a vector containing the gradients as a query for MaxIP data-structures, the query failure probability in each iteration is not independent. Therefore, the total failure probability could not be union bounded. As previous MaxIP data-structures focus on the assumptions that queries are independent, the original failure analysis could not be directly applied.
This work uses a standard query quantization method to handle the adaptive query sequence in optimization. Given the known query space, we quantize it by lattices [71]. This quantization is close to the Voronoi diagrams. In this way, each query is located into a cell with a center vector. Next, we perform a query using the center vector in the cell. Therefore, the failure probability of the MaxIP query sequence is equivalent to the probability that any center vector in the cell fails to retrieve its approximate MaxIP solution. As the centers of cells are independent, we could union bound this probability. On the other hand, as the maximum diameter of the cell is , this query quantization would introduce a additive error in the inner product retrieved. We refer the readers to Appendix G for the detailed quantization approach.
In this work, we show that by LSH based MaxIP data-structure, the cost for direction search is , where . In Section D.2 of the supplementary material, we show that is a function of constant and parameter in approximate MaxIP (see Definition 4.2). Moreover, we also show in Section D.2 that LSH results in only a constant multiplicative factor increase in the number of iterations. Considering the cost per iteration and the number of iterations, we show that when our algorithms stop at the -optimal solution, LSH could achieve acceleration in the overall running time. Therefore, we could set and parameters to balance the accuracy-efficiency trade-off of CGM to achieve the desired running time.
This paper discusses the theoretical foundation of data-structures for conditional gradient methods. We believe that this paper does not have negative societal impact in the environment, privacy, and other domains.
6 Concluding Remarks
In this work, we present the first Frank-Wolfe algorithms that achieve sublinear linear time cost per iteration. We also extend this result into Herding algorithm and policy gradient methods. We formulate the direction search in Frank-Wolfe algorithm as a projected approximate maximum inner product search problem with a pair of efficient transformations. Then, we use locality sensitive hashing data-structure to reduce the iteration cost into sublinear over the number of possible parameters. Our theoretical analysis shows that the sublinear iteration cost Frank-Wolfe algorithm preserves the same order in the number of iterations towards convergence. Moreover, we analyze and optimize the trade-offs between saving iteration cost and increasing the number of iterations to achieve sublinear total running time. Furthermore, we identify the problems of existing MaxIP data-structures for cost reduction in iterative optimization algorithms and propose the corresponding solutions. We hope this work can be the starting point of future study on sublinear iteration cost algorithms for optimization.
Acknowledgements
This work was supported by National Science Foundation IIS-1652131, BIGDATA-1838177, AFOSR-YIP FA9550-18-1-0152, ONR DURIP Grant, and the ONR BRC grant on Randomized Numerical Linear Algebra. The authors would like to thank Beidi Chen for the helpful discussion on optimization. The authors would like to thank Lichen Zhang for valuable discussions about data-structures.
References
[1] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012.
[2] Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2512–2520, 2021.
[3] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate frank-wolfe optimization for structural svms. In International Conference on Machine Learning (ICML), pages 53–61, 2013.
[4] Robert M Freund, Paul Grigas, and Rahul Mazumder. An extended frank–wolfe method with “in-face” directions, and its application to low-rank matrix completion. SIAM Journal on optimization, 27(1):319–346, 2017.
[5] Jyun-Li Lin, Wei Hung, Shang-Hsuan Yang, Ping-Chun Hsieh, and Xi Liu. Escaping from zero gradient: Revisiting action-constrained reinforcement learning via frank-wolfe policy optimization. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
[6] Elad Hazan and Satyen Kale. Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pages 521–528, 2012.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[8] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[9] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning (ICML), pages 427–435, 2013.
[10] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning (ICML), pages 541–549, 2015.