Anshul Yadav

Gradient Descent vs Coordinate Descent

Date: March 06, 2019

Gradient descent is the most widely used algorithm for finding the local minima, which is usually the global minima for convex functions. However, in certain cases, computing the derivative is more expensive than computing the function value itself. In such cases, Coordinate Descent proves to be a powerful alternative. It takes steps along one of the coordinate lines to reach the global optimum. However, it is important to note that gradient descent and coordinate descent usually do not converge at a precise value, and some tolerance must be maintained.

Algorithm for Gradient Descent

Suppose \( J \) is the cost function we are trying to minimize. Then,


    Repeat until convergence:
        $$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) $$ (for every j)
    

Algorithm for Coordinate Descent

Consider an unconstrained optimization problem:

$$ \min_{\alpha} W(\alpha_1, \alpha_2, ..., \alpha_m) $$

where \( W \) is some function of parameters \( \alpha_i \).


    Loop until convergence:
        for \( i = 1, ..., m \):
            $$ \alpha_i := \arg\min_{\hat{\alpha_i}} W(\alpha_1, ..., \alpha_{i-1}, \hat{\alpha_i}, \alpha_{i+1}, ..., \alpha_m) $$
    

In the inner loop, we optimize \( W \) with respect to a single \( \alpha_i \), while holding all other parameters fixed. This method is particularly efficient when computing arg min is straightforward.

L2 SVM Optimization Problem

Coordinate ascent, a minor modification of Coordinate Descent, is often used in L2 SVM, where solving a non-convex problem (NP-hard) requires convex approximation.

The dual of the original L2 SVM problem is given by:

$$ \max_{\alpha} W(\alpha) = \sum_{i=1}^m \alpha_i - \frac{1}{2} \sum_{i,j=1}^m y^{(i)} y^{(j)} \alpha_i \alpha_j \langle x^{(i)}, x^{(j)} \rangle $$

Subject to:

$$ 0 \leq \alpha_i \leq C, \quad i = 1,2,.., m $$

$$ \sum_{i=1}^m \alpha_i y^{(i)} = 0 $$

Since updating only one \( \alpha_i \) would not maintain the constraint \( \sum_{i=1}^m \alpha_i y^{(i)} = 0 \), the problem is solved by updating two \( \alpha_i \)'s at a time. This approach forms the basis of the Sequential Minimal Optimization (SMO) algorithm, which is used to solve this convex optimization problem.

SMO Algorithm Steps

  1. Select \( \alpha_i, \alpha_j \) (using heuristics).
  2. Hold all \( \alpha_k \)'s fixed except \( \alpha_i, \alpha_j \).
  3. Optimize \( W(\alpha) \) with respect to \( \alpha_i, \alpha_j \).

Note: In this post, "gradient descent" refers to batch gradient descent.

References