Using SPSA for optimizing Neural Networks
Date: June 21, 2019
Neural Networks are at the core of deep learning. But these are often constrained by the Back-propagation algorithm, which requires the derivative of the Loss function with respect to network parameters. In this post, I will show that Neural Networks are not limited by back-propagation and we can use Simultaneous Perturbation using Stochastic Approximation (SPSA) to find noisy gradients.
This technique is highly useful when it is very expensive to compute the gradients of the loss function or it is not differentiable at all. Gradient Descent or any other popular optimization algorithms like Adam/RMSProp require computing the Gradient.
Simultaneous Perturbation for Stochastic Perturbation (SPSA)
Let us formulate our optimization problem as:
$$\hat{x} = \arg\min_x f(x)$$
Gradient descent is a well-known optimization algorithm for finding the optimum in convex functions. It takes steps as:
$$x_{t+1} = x_t - \epsilon \nabla f(x)$$
Using the first principle of Calculus, this can be computed as:
$$g(x) = \lim_{\epsilon \to 0} \frac{f(x+\epsilon) - f(x-\epsilon)}{2\epsilon}$$
Now, SPSA takes advantage of the above equation and uses an approximated gradient:
$$\hat{g}(x) = \frac{f(x+\delta) - f(x-\delta)}{2\delta}$$
The reader must keep in mind that backpropagation is a bottom-to-top approach, whereas there is no such constraint on SPSA to update the network parameters. So, we often use a top-to-bottom approach while training using SPSA.
Pseudo Code for SPSA Update
x_0 \in \mathbb{R}^m
for t = 1 to t_{max}:
set a_t = \frac{a}{t^\alpha}
set c_t = \frac{c}{t^\gamma}
randomly sample \delta \sim \mathcal{U}({−1, +1}^m)
set x^+ = x_t + c_t \delta
set x^- = x_t - c_t \delta
compute \hat{g}(x_t) using:
\hat{g}(x_t) = \frac{f(x^+) − f(x^-)}{2 c_t \delta}
update:
x_{t+1} = x_t − a_t * \hat{g}(x_t)
Python Code
def train(self, inputs, targets, num_epochs, t_max=200):
weights = []
np.random.seed(1)
weights.append(np.random.randn(self.input_dim, self.output_dim))
for epoch in range(num_epochs):
for l in range(len(weights)):
W_p = np.copy(weights)
W_m = np.copy(weights)
for t in range(1, t_max):
a_t = self.a / t**self.alpha
c_t = self.c / t**self.gamma
delta = np.random.binomial(1, 0.5, size=weights[l].shape) * 2. - 1
W_p[l] += c_t * delta
loss_p = self.loss(self.forward(inputs, W_p), targets)
W_m[l] -= c_t * delta
loss_m = self.loss(self.forward(inputs, W_m), targets)
g_hat = (loss_p - loss_m) / (2 * c_t * delta)
weights[l] -= a_t * g_hat
if epoch % 1 == 0:
print("RMSE Loss is: ", np.sum((self.forward(inputs, weights) - targets)**2))
return weights
References
[Slides of Christian Bauckhage on using SPSA for solving Neural Networks](https://www.researchgate.net/profile/Christian_Bauckhage/project/lectures-on-pattern-recognition/attachment/5a5c9cf14cde266d58831886/AS:583039802462208@1516018929842/download/lecture-18-add-on-2.pdf?context=ProjectUpdatesLog)