Gradient Descent Algorithms
π Gradient Descent Algorithms
Gradient descent is an iterative first-order optimization algorithm used to find a local minimum of a differentiable function. It is the workhorse of modern machine learning.
π’ 1. First-Order Methods
Basic Gradient Descent
The update rule is simple: move in the direction opposite to the gradient.
- (Learning Rate): A crucial hyperparameter. Too large, and the algorithm diverges; too small, and it takes forever to converge.
Stochastic Gradient Descent (SGD)
Instead of computing the gradient for the entire dataset, SGD computes it for a single random sample (or a mini-batch).
- Benefit: Significantly faster for large datasets and adds βnoiseβ that can help escape sharp local minima.
π‘ 2. Acceleration and Adaptation
Momentum
Momentum helps accelerate SGD in the relevant direction and dampens oscillations by adding a fraction of the previous update to the current one.
Adam (Adaptive Moment Estimation)
Adam combines the benefits of AdaGrad (adaptive learning rates) and RMSProp (moving averages of gradients). It tracks both the first moment (mean) and the second moment (uncentered variance) of the gradients.
- Why itβs popular: It requires very little hyperparameter tuning and works well on most deep learning architectures.
π΄ 3. Second-Order Methods
First-order methods only use the gradient. Second-order methods use the Hessian (), which provides information about the curvature of the function.
Newtonβs Method
- Pros: Quadratic convergence (very fast near the optimum).
- Cons: Computing and inverting the Hessian is , which is impossible for millions of parameters.
Quasi-Newton Methods (L-BFGS)
These methods approximate the inverse Hessian rather than computing it directly. L-BFGS (Limited-memory BFGS) is widely used for optimization when the number of parameters is moderate.