There could be two types of training algorithms for the weights for a neuron.
First is to minimize the error between predicted y_hat and y. Here y_hat = boolean(activation >= threshold). This type of perceptron-based learning
- works best for linearly separable data and
- guarantees finite iterations.
Second type is Gradient Descent algorithm which minimizes the error between the activation function value, . It has the similar nature as Least Square Regression. It is robust because it uses calculus. We need to differentiate based on activation function, because it has weights which makes it differentiable. On the other hand, perceptron-based error function is not differentiable w.r.t. weights. This is why we need to use a beautiful function like sigmoid function which looks like a step function, continuously differentiable with a nice differential form.