Activation Functions
Sources:
Notations
- Suppose the input value of a neoron is
, the neoron is saturated when the ablolute value is too large. - The term gradient, altough whose original meaning is the set of all partial derivatives of a multivariate function, can refer to derivative when the function is univariate, and it can also refer to partial derivative when the function is multivariate.
TLDR
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don't expect much
- Don't use sigmoid
Sigmoid

The sigmoid function is often denoted as
Disadvantage:
- it's zero-centered.
- It kills gradients when saturated. You can see the figure that when
, the gradient is alomost 0.
Derivative of sigmoid

The deraivative of sigmoid is:
Code
1 | import torch |
Drawback: vanishing gradient

Drawback:
When x = 10 or x = 10, the gradient tends to be zero.
Therefore, sigmoid "kill off" the gradients when |x| is large.
Drawback: not zero-centered
Say the input the sigmoid is
Since
Suppose there are two parameters
It means:
- When all
, we can only move roughly in the direction of northeast the parameter space. - When all
, we can only move roughly in the direction of southwest the parameter space.
If our goal happens to be in the northwest, we can only move in a zig-zagging fashion to get there, just like parallel parking in a narrow space. (forgive my drawing)

Tanh
Machine Learning/Activation Functions/Figure 5.png

Advantage: it's zero-centered.
Disadvantage: It kills gradients when saturated.
Derivative of tanh

The derivative of the tanh function
The derivation is:
Applying the quotient rule (where
Calculating
Substituting these into the quotient rule:
Sip lifying, we find:
This simplifies further to:
Code
1 | import torch |
ReLU

REctified Linear Unit (ReLU):
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice (e.g. 6x)
- Actually more biologically plausible than sigmoid
Disadvantages:
- Not zero-centered output
- An annoyance: the gradient is zero when
. So some parameters will never be trained (called "dead ReLU").
Derivative of ReLu

The derivative of Relu is:
- 1 if
. - 0 if
.
The derivative doesn't exist at
Code
1 | import torch |
Silu
The SiLU (Sigmoid Linear Unit) function, also known as the Swish activation function, is defined as:
Derivative of ReLu
Parametric ReLU
where
When
Advantages:
- Does not saturate
- Computationally efficient
- Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
- will not “die”
Derivative of Parametric ReLU

The derivative of the Leaky ReLU function with respect to
Code
1 | import torch |
Softplus
Note: the base of
Derivative of softplus

Outer function derivative (
Inner function derivative
Applying the chain rule:
Code
1 | def softplus(x): |
Softmax
Image from Thomas's article

The softmax function for a vector
The output is a
Derivative of softmax
Recalling that for function
The jacobian matrix
Consider the softmax situation,
- When
- When
Thus, we obtain
Proof:
Case 1:
Using the quotient rule
Case 2:
Code
1 | import torch |
Note that in softmax_derivative
, the Jacobian matrix is derived from matrix1 - matrix2
where
and