Anshul Yadav

Using SPSA for optimizing Neural Networks

Date: June 21, 2019

Neural Networks are at the core of deep learning. But these are often constrained by the Back-propagation algorithm, which requires the derivative of the Loss function with respect to network parameters. In this post, I will show that Neural Networks are not limited by back-propagation and we can use Simultaneous Perturbation using Stochastic Approximation (SPSA) to find noisy gradients.

This technique is highly useful when it is very expensive to compute the gradients of the loss function or it is not differentiable at all. Gradient Descent or any other popular optimization algorithms like Adam/RMSProp require computing the Gradient.

Simultaneous Perturbation for Stochastic Perturbation (SPSA)

Let us formulate our optimization problem as:

$$\hat{x} = \arg\min_x f(x)$$

Gradient descent is a well-known optimization algorithm for finding the optimum in convex functions. It takes steps as:

$$x_{t+1} = x_t - \epsilon \nabla f(x)$$

Using the first principle of Calculus, this can be computed as:

$$g(x) = \lim_{\epsilon \to 0} \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$$

Now, SPSA takes advantage of the above equation and uses an approximated gradient:

$$\hat{g}(x) = \frac{f(x+\delta) - f(x-\delta)}{2\delta}$$

The reader must keep in mind that backpropagation is a bottom-to-top approach, whereas there is no such constraint on SPSA to update the network parameters. So, we often use a top-to-bottom approach while training using SPSA.

Pseudo Code for SPSA Update


    x_0 \in \mathbb{R}^m
    for t = 1 to t_{max}:
        set a_t = \frac{a}{t^\alpha}
        set c_t = \frac{c}{t^\gamma}
        randomly sample \delta \sim \mathcal{U}({−1, +1}^m)
        set x^+ = x_t + c_t \delta
        set x^- = x_t - c_t \delta
        compute \hat{g}(x_t) using:
            \hat{g}(x_t) = \frac{f(x^+) − f(x^-)}{2 c_t \delta}
        update:
            x_{t+1} = x_t − a_t * \hat{g}(x_t)
    

Python Code


    def train(self, inputs, targets, num_epochs, t_max=200):
        weights = []
        np.random.seed(1)
        weights.append(np.random.randn(self.input_dim, self.output_dim))
        
        for epoch in range(num_epochs):
            for l in range(len(weights)):
                W_p = np.copy(weights)
                W_m = np.copy(weights)
                for t in range(1, t_max):
                    a_t = self.a / t**self.alpha
                    c_t = self.c / t**self.gamma
                    delta = np.random.binomial(1, 0.5, size=weights[l].shape) * 2. - 1
                    W_p[l] += c_t * delta
                    loss_p = self.loss(self.forward(inputs, W_p), targets)
                    W_m[l] -= c_t * delta
                    loss_m = self.loss(self.forward(inputs, W_m), targets)
                    g_hat = (loss_p - loss_m) / (2 * c_t * delta)
                    weights[l] -= a_t * g_hat
            if epoch % 1 == 0:
                print("RMSE Loss is: ", np.sum((self.forward(inputs, weights) - targets)**2))
        return weights
    

References

[Slides of Christian Bauckhage on using SPSA for solving Neural Networks](https://www.researchgate.net/profile/Christian_Bauckhage/project/lectures-on-pattern-recognition/attachment/5a5c9cf14cde266d58831886/AS:583039802462208@1516018929842/download/lecture-18-add-on-2.pdf?context=ProjectUpdatesLog)