Activation Functions

Sources:

  1. Stanford CS231, Lecture 6

Notations

  • Suppose the input value of a neoron is x, the neoron is saturated when the ablolute value |x|is too large.
  • The term gradient, altough whose original meaning is the set of all partial derivatives of a multivariate function, can refer to derivative when the function is univariate, and it can also refer to partial derivative when the function is multivariate.

TLDR

  • Use ReLU. Be careful with your learning rates
  • Try out Leaky ReLU / Maxout / ELU
  • Try out tanh but don't expect much
  • Don't use sigmoid

Sigmoid

Figure 1

The sigmoid function is often denoted as σ(), σ(x)=1/(1+ex). The range of the sigmoid function is (0,1).

Disadvantage:

  1. it's zero-centered.
  2. It kills gradients when saturated. You can see the figure that when |x|=10, the gradient is alomost 0.

Derivative of sigmoid

Figure 2

The deraivative of sigmoid is: dσ(x)dx=ddx(11+ex)=ddx(1+ex)1=1(1+ex)2ddx(1+ex)=1(1+ex)2(ex)=ex(1+ex)2=σ(x)(1σ(x)) The range of the deraivative of sigmoid function is (0,0.25].

Code

1
2
3
4
5
6
7
8
9
10
11
12
import torch

def sigmoid(x):
sigmoid.__name__ = "sigmoid(x)"
return 1 / (1 + torch.exp(-x))
def sigmoid_derivative(x):
'''
The first-order derivative of `sigmoid(x)` with respect to x.
'''
sigmoid_derivative.__name__ = "first-order derivative of `sigmoid(x)`"
s = sigmoid(x)
return s * (1 - s)

Drawback: vanishing gradient

Figure 3

Drawback:

When x = 10 or x = 10, the gradient tends to be zero.

Therefore, sigmoid "kill off" the gradients when |x| is large.

Drawback: not zero-centered

Source

Say the input the sigmoid is f(iwixi)+b: dfdwi=xidLdwi=dLdfdfdwi=dLdfxi where L is sigmoid function.

Since dLdf(0,0.25], dLdf>0, and the gradient dLdwi always has the same sign as xi.

Suppose there are two parameters w1 and w2, and x1>0,x2>0 or x1<0,x2<0, then the gradients of two dimensions are always of the same sign (i.e., either both are positive or both are negative).

It means:

  1. When all xi>0, we can only move roughly in the direction of northeast the parameter space.
  2. When all xi<0, we can only move roughly in the direction of southwest the parameter space.

If our goal happens to be in the northwest, we can only move in a zig-zagging fashion to get there, just like parallel parking in a narrow space. (forgive my drawing)

Figure 4

Tanh

Machine Learning/Activation Functions/Figure 5.png

Figure 5

tanh(x)=sinh(x)cosh(x)=exexex+ex=e2x1e2x+1=121e2x+1 The range of the tanh function is (-1,1).

Advantage: it's zero-centered.

Disadvantage: It kills gradients when saturated.

Derivative of tanh

Figure 6

The derivative of the tanh function tanh(x)=(exexex+ex) with respect to x is ddxtanh(x)=1tanh(x)2. The range of the tanh function is (0,1].

The derivation is: ddxtanh(x)=ddx(exexex+ex)

Applying the quotient rule (where f(x)=exex and g(x)=ex+ex ), we ge ddxtanh(x)=f(x)g(x)f(x)g(x)[g(x)]2

Calculating f(x) and g(x) : f(x)=ddx(exex)=ex+exg(x)=ddx(ex+ex)=exex

Substituting these into the quotient rule: ddxtanh(x)=(ex+ex)(ex+ex)(exex)(exex)(ex+ex)2

Sip lifying, we find: ddxtanh(x)=1(exex)2(ex+ex)2

This simplifies further to: ddxtanh(x)=1tanh(x)2

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
import torch

def tanh(x):
tanh.__name__ = "tanh(x)"
numerator = torch.exp(2*x) - 1
denominator = torch.exp(2*x) + 1
return numerator / denominator
def tanh_derivative(x):
'''
The first-order derivative of `tanh(x)` with respect to x.
'''
tanh_derivative.__name__ = "first-order derivative of `tanh(x)`"
return 1 - torch.tanh(x)**2

ReLU

Figure 7

REctified Linear Unit (ReLU): f(x)=max(0,x) Advantages:

  1. Does not saturate (in +region)
  2. Very computationally efficient
  3. Converges much faster than sigmoid/tanh in practice (e.g. 6x)
  4. Actually more biologically plausible than sigmoid

Disadvantages:

  1. Not zero-centered output
  2. An annoyance: the gradient is zero when x0. So some parameters will never be trained (called "dead ReLU").

Derivative of ReLu

Figure 8

The derivative of Relu is:

  1. 1 if x>0.
  2. 0 if x<0.

The derivative doesn't exist at x=0. However, for convience, we regulate that the derivate = 0 when x=0.

Code

1
2
3
4
5
6
7
8
9
10
11
import torch

def relu(x):
relu.__name__ = "relu(x)"
return torch.maximum(x, torch.tensor(0.0))
def relu_derivative(x):
'''
The first-order derivative of `relu(x)` with respect to x.
'''
relu_derivative.__name__ = "first-order derivative of `relu(x)`"
return torch.where(x > 0, torch.tensor(1.0), torch.tensor(0.0))

Silu

The SiLU (Sigmoid Linear Unit) function, also known as the Swish activation function, is defined as: silu(x)=xσ(x) where σ(x) is the sigmoid function: σ(x)=11+ex

Derivative of ReLu

ddxsilu(x)=σ(x)(1+x(1σ(x)))

Parametric ReLU

Figure 9 f(x)=max(αx,x)

where α is a small constant (typically around 0.01).

When α=0.01, it's called "leaky ReLU".

Advantages:

  1. Does not saturate
  2. Computationally efficient
  3. Converges much faster than sigmoid/tanh in practice! (e.g. 6x)
  4. will not “die”

Derivative of Parametric ReLU

Figure 10

The derivative of the Leaky ReLU function with respect to x is: {1 if x>0α if x0

Code

1
2
3
4
5
6
7
8
9
10
11
import torch

def parametric_relu(x, alpha=0.01):
'''
When alpha = 0.01, it becomes a leaky ReLU
'''
parametric_relu.__name__ = f"parametric_relu(x) with alpha={alpha}"
return torch.maximum(x, alpha * x)
def parametric_relu_derivative(x, alpha=0.01):
parametric_relu_derivative.__name__ = f"first-order derivative of `parametric_relu(x)` with alpha={alpha}"
return torch.where(x > 0, torch.tensor(1.0), torch.tensor(alpha))

Softplus

Figure 11 f(x)=log(1+ex)

Note: the base of log here is e.

Derivative of softplus

Figure 12

Outer function derivative (loge): ddulog(u)=1u. Here, u=1+ex.

Inner function derivative (1+ex) : ddx(1+ex)=ex

Applying the chain rule: ddxlog(1+ex)=11+exex=ex1+ex

Code

1
2
3
4
5
6
7
8
9
def softplus(x):
softplus.__name__ = "softplus(x)"
return torch.log(1 + torch.exp(x))

def softplus_derivative(x):
'''
The first-order derivative of `softplus(x)` with respect to x.
'''
return 1 / (1 + torch.exp(-x))

Softmax

Image from Thomas's article

Figure 13

The softmax function for a vector x=[x1,x2,c,xn]T is f:Rn×1Rn×1: f(xi)=exij=1Nexj

The output is a n-dimensional vector where every element is positive and the total elements sum up to 1.

Derivative of softmax

Recalling that for function f:RN×1RM×1, the derivative of f at a point x, also called the Jacobian matrix, is the M×N matrix of partial derivatives.

The jacobian matrix J is defined as Jik=f(xi)xk.

Consider the softmax situation, f:Rn×1Rn×1, f(xi)=exij=1Nexj. There'rethe two cases for the derivative:

  1. When i=k,Jik=f(xi)(1f(xi))
  2. When ik,Jik=f(xi)f(xk)

Thus, we obtain J=[f(x1)(1f(x1))f(x1)f(x2)f(x1)f(xN)f(x2)f(x1)f(x2)(1f(x2))f(x2)f(xN)f(xN)f(x1)f(xN)f(x2)f(xN)(1f(xN))]

Proof:

Case 1: i=k f(xi)xi=xi(exij=1Nexj)

Using the quotient rule x(uv)=uvuvv2 where u=exi and v=j=1Nexj : =exij=1Nexjexiexi(j=1Nexj)2=exij=1Nexj(1exij=1Nexj)=f(xi)(1f(xi))

Case 2: ik f(xi)xk=xk(exij=1Nexj) Using the quotient rule again, but now the numerator does not depend on xk : =exiexk(j=1Nexj)2=exij=1Nexjexkj=1Nexj=f(xi)(xk)

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import torch

def softmax(x):
softmax.__name__ = "softmax(x)"
exp_x = torch.exp(x - torch.max(x)) # For numerical stability
return exp_x / torch.sum(exp_x, dim=0)

def softmax_derivative(x):
'''
The first-order derivative of `softmax(x)` with respect to x.
'''
softmax_derivative.__name__ = f"first-order derivative of `softmax(x)`"

s = softmax(x)
matrix1 = torch.diag_embed(s) # creates a diagonal matrix from the 1-dimensional tensor s. If s is of length N, it produces an N x N matrix with the elements of s on the diagonal and zeros elsewhere.
matrix2 = s.unsqueeze(-1) * s.unsqueeze(-2) # The result is an N x N matrix where each element (i, j) is the product of s[i] and s[j].
return matrix1 - matrix2

Note that in softmax_derivative, the Jacobian matrix is derived from matrix1 - matrix2 where matrix1=[f(x1)f(x1)000f(x2)f(x2)000f(xn)f(xm)]

and matrix2=[f(x1)f(x1)f(x1)f(x2)f(x1)f(xn)f(x2)f(x1)f(x2)f(x2)f(x2)f(xn)f(xn)f(x1)f(xn)f(x2)f(xn)f(xn)]