Activation functions for Artificial Neural Networks (ANN)
Artificial Neural Network is a machine learning algorithm based on the model of a human brain. Like the human brain, neural networks consist of a large number of connected processing units powered by activation functions that mimic neurons.
ANNs consists of three layers, the Input layer, Hidden layer(s), Output layer and each layer is made up of multiple interconnected nodes. While the number of nodes in the hidden layer is our choice(the more the better), for the input layer it depends on the dimensions(features) of x(i) and for the output layer, it represents the number of classes. Each node has a weight associated with it that represents the strength of the connection between the two nodes.
When the neural network receives the input, it is multiplied by the weight and then passed to the next node through the connection. A weighted sum (here, x1.w1 + x2.w2) is computed for each node in the second layer and a bias term(b1) is added to it and the resulting sum is passed through the activation function, which then performs the transformation.
y1 = f (x1.w1 + x2.w2 + b1)
The ultimate goal of an activation function is to convert the input and generate the output, which will be used as the input for the next layer. It decides whether the neuron should be activated or not.
Why use an activation function?
Including an activation function increases the complexity and it is introduced as an additional step at each neuron of the neural network. So can we do without an activation function?
Well, if we had to remove the activation function, then the output of the neural network would simply be a linear function that is unable to learn the complex patterns of the data. And if we had only linear layers in a neural network, all the layers would essentially collapse into a single linear layer and thus a âdeepâ neural network architecture would cease to exist and it would just be a linear classifier!
Without activation function
y = f (W1.W2.W3 x) = f (W x)
where W(i) represents the weight-bias matrix for each layer and f represents the activation function.
Whereas, including activation helps introduce non-linearity thereby capable of finding complex patterns and preventing the layers from collapsing.
With activation function,
y = f1 (W1.f2 (W2.f3 (W3.x)))
Choosing an activation function
The choice of activation function is critical as it has a large impact on the performance and capability of the model. All the hidden layers use the same activation function. However, the output layer might use an activation function different from that of the hidden layer and it depends on the type of prediction used by the model.
1. Sigmoid (or Logistic ) Activation function
Itâs one of the most widely used non-linear activation functions. The curve looks S-shaped and the values it takes range from (0,1). It is denoted by
f(x) = 1/(1+e^-z)
where z(i)=W(i).x +b
This is mostly used in output layers of models such as binary and multi-label classification where we have to predict the probability. This function is continuously differentiable but due to small gradient values, it causes the vanishing gradient problem. The output isnât zero centred.
2. Binary Step Function
This activation function is a threshold-based classifier The neuron is activated only when the value is greater than 0.
f(x) = 0 if x<0
f(x) = 1 if x>=0
The binary step function can be used as an activation function while creating a binary classifier but not for the multiclass classifier. As its derivative is zero (except at 0 where it's undefined), making gradient-based approaches for optimisation impossible.
3. Linear (or Identity ) Activation Function
Linear Activation Function is also called Identity activation function(multiplied by 1) because it does not change the weighted sum of the inputs and returns the value directly.
f(x) = ax
where, a = constant
When non-linearities exist, this activation function alone is insufficient, though it may still be employed as the activation function on the final output nodes for regression tasks.
4. Tanh (or Hyperbolic Tangent) Activation function
Similar to sigmoid, tanh is also S-shaped but shifted mathematically that it ranges from -1 to +1.
f(x)= (e^x - e^(-x)) / (e^x + e^(-x))
Since its entirely differentiable, centred at zero and anti-symmetrical it's more favoured than the sigmoid function in classification tasks and RNNs.To mitigate slow learning and/or vanishing gradients, flatter variations on this function (log-log, softsign, symmetrical sigmoid, etc.) can be employed.
5. ReLU (Rectified Linear Unit) Activation function
ReLU is the most popular function used in the hidden layers of the deep learning model, as it overcomes the limitations of sigmoid and tanh such as the vanishing gradients by introducing sparsity into the model( it doesnât activate all the neurons at the same time). The range is 0 to infinity.
f(x) = max(0, x)
As the derivative is 0 at non-positive inputs, ReLU may suffer from slow learning or even dead neurons, where neurons that have negative valued inputs are unable to update their weights due to the zero-valued gradients, rendering them silent for the remainder of the training phase.
6. Leaky ReLU Activation Function
Leaky ReLU is the widely popular improved version of ReLU which attempts to solve the dying neurons problem. It includes a very small slope in the case of negative values (as opposed to 0 in ReLU ) so that there is no dead neurons in that region.
f(x) = 0.01(x) if x < 0
f(x) = x if x >= 0
Above is another variant of ReLU, the Parameteric Rectified Linear Unit (PReLU) where the slope of the line is learned during the model training (as opposed to the fixed slope of 0.01 in Leaky ReLU)
f(x) = a(x) if x < 0
f(x) = x if x >= 0
Tips:
Always start with ReLU as an activation function for hidden layers and move to other functions such as leaky ReLU, Parameteric ReLU or Randomized ReLU in case of dead neurons.
Generally, avoid Sigmoid and Tanh due to the vanishing gradients problem.
For Multilayer Perceptron(MLP) and Convolution Neural Networks(CNN) use ReLU.
Recurrent Neural Networks like LSTM commonly uses the Sigmoid activation for recurrent connections and the Tanh activation for output.
Activation functions in the output layer include Linear, Sigmoid and Softmax.
For Regression problems use linear activation function in the output layer.
For Binary and Multilabel classification use sigmoid activation function in the output layer and for Multiclass classification use softmax activation function.











