Gradient Descent Algorithms

📉 Gradient Descent Algorithms

Gradient descent is an iterative first-order optimization algorithm used to find a local minimum of a differentiable function. It is the workhorse of modern machine learning.

🟢 1. First-Order Methods

Basic Gradient Descent

The update rule is simple: move in the direction opposite to the gradient. $x_{k+1} = x_k - \eta \nabla f(x_k)$

$\eta$ (Learning Rate): A crucial hyperparameter. Too large, and the algorithm diverges; too small, and it takes forever to converge.

Stochastic Gradient Descent (SGD)

Instead of computing the gradient for the entire dataset, SGD computes it for a single random sample (or a mini-batch).

Benefit: Significantly faster for large datasets and adds “noise” that can help escape sharp local minima.

🟡 2. Acceleration and Adaptation

Momentum

Momentum helps accelerate SGD in the relevant direction and dampens oscillations by adding a fraction of the previous update to the current one. $v_{t} = \gamma v_{t-1} + \eta \nabla f(x_t)$ $x_{t+1} = x_t - v_t$

Adam (Adaptive Moment Estimation)

Adam combines the benefits of AdaGrad (adaptive learning rates) and RMSProp (moving averages of gradients). It tracks both the first moment (mean) and the second moment (uncentered variance) of the gradients.

Why it’s popular: It requires very little hyperparameter tuning and works well on most deep learning architectures.

🔴 3. Second-Order Methods

First-order methods only use the gradient. Second-order methods use the Hessian ( $H$ ), which provides information about the curvature of the function.

Newton’s Method

$x_{k+1} = x_k - H(x_k)^{-1} \nabla f(x_k)$

Pros: Quadratic convergence (very fast near the optimum).
Cons: Computing and inverting the Hessian is $O(n^3)$ , which is impossible for millions of parameters.

Quasi-Newton Methods (L-BFGS)

These methods approximate the inverse Hessian rather than computing it directly. L-BFGS (Limited-memory BFGS) is widely used for optimization when the number of parameters is moderate.