Optimizers are algorithms or methods used to change the attributes of machine learning/neural network such as weights and learning rate in order to reduce the losses. This repo contains implementation of various optimizers with visualization.
θ=θ−α⋅∇J(θ)
- Frequent updates of model parameters hence, converges in less time.
- Requires less memory as no need to store values of loss functions.
- High variance in model parameters.
- May shoot even after achieving global minima.
It reduces high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction by accumularing past gradients.
V(t) = γV(t−1) + α.∇J(θ)
θ = θ − V(t)
- Reduces the high variance of the parameters.
- Converges faster than gradient descent.
- One more hyper-parameter is added which needs to be selected manually and accurately.
- Overshooting
RMSProp also tries to dampen the oscillations, but in a different way than momentum. RMS prop also takes away the need to adjust learning rate, and does it automatically. More so, RMSProp choses a different learning rate for each parameter.
V(t) = ρV(t−1) + (1 - ρ).∇J(θ)²
∇W(t) = -[α / (√V(t) + ϵ)].∇J(θ)
θ = θ + ∇W(t)
- Reduces the oscillations.
- Reduces overshooting.
- One more hyper-parameter is added which needs to be selected manually and accurately.
- Slow convergence.
- Vanishing learning rate.
RMSProp and Momentum take contrasting approaches. While momentum accelerates our search in direction of minima, RMSProp impedes our search in direction of oscillations. Adam or Adaptive Moment Optimization algorithms combines the heuristics of both Momentum and RMSProp.
m(t) = β1 · m(t−1) + (1 − β1) · ∇J(θ)
v(t) = β2 · v(t−1) + (1 − β2) · ∇J(θ)²
θ = θ - α . m(t) / (√V(t) + ϵ)
- Fast convergence.
- Rectifies vanishing learning rate, high variance.
- Reduces the oscillations.
- Reduces overshooting.
- Computationally expensive.
Adam is the best optimizers. It trains neural networks in less time and more efficiently. But SGD can beat Adam in terms of accuracy if enough time is given(i.e. it takes too long).
Clone the repository and open terminal in same directory and follow the below instuctions.
- numpy
- matplotlib