Multinomial Logistic Regression · Interactive · ML SS2026
In softmax regression the probability that a data point belongs to each class is calculated by softmax function instead of sigmoid function in logistic regression.
The softmax function takes as input a vector Z of K real numbers (Z = (z1, z2, …, zK)):
All K linear scores (z1, …, zK) are computed in parallel and feed into a single SOFTMAX block that produces probabilities (a1, …, aK). ARGMAX picks the class with the highest probability:
All K classes share the same input layer; SOFTMAX normalizes the scores so ak ∈ [0, 1] and Σ ak = 1. Then ŷ = argmax ak = argmax zk.
The diagram below is the same block diagram as above, concretized to M = 2 features and K = 3 classes. Move the sliders and watch every quantity update on the graph itself — input values x, parameters bk,j (with bk,0 the bias and bk,1, bk,2 the feature weights), the linear scores zk, the softmax outputs ak (drawn as bars right at the SOFTMAX output), and the predicted class ŷ = argmax ak.
Each row k holds the bias bk,0 and the two feature weights bk,1, bk,2:
| bk,0 | bk,1 (·x₁) | bk,2 (·x₂) | |
| Class 1 | |||
| Class 2 | |||
| Class 3 |
Enter raw logit values (z) and see how the softmax function turns them into a probability distribution. The predicted class is ŷ = argmax ak.
Properties: all probabilities ak ∈ [0, 1] and they sum to 1: K∑j=1 aj = 1. Even when input z values are negative or larger than 1, softmax produces a valid probability distribution.
Softmax(z) can also be represented as:
To make predictions, it is the argmax of the probabilities out of the softmax function:
Because the softmax operation preserves the ordering among its arguments, we do not need to compute the softmax to determine which class has been assigned the highest probability:
The element k of vector Z(i) of the training example number i can be represented as follows:
Softmax(Z(i)) = [ a1(i) a2(i) … aK(i) ]
The probability that y(i) is assigned to a class (1, 2, …, or K) can be expressed as:
The Likelihood: Overall probability from the product of the probabilities of all the n training examples:
Optimization by maximizing ln(L) or minimizing (−ln(L)):
This loss function is called multi-category cross entropy.
The model's parameter values bk,m can be estimated using a gradient descent method via partial derivatives and the multivariable chain rule. As an example for b1,m for one training example:
where (y_1 − a1) is the error for class 1 and xm is the feature value.
Adjust the predicted logits (z) and pick the true class to see how the cross-entropy loss responds. Notice how confidently wrong predictions are penalized far more than uncertain ones.
—
η is called the learning rate or the learning step size.
The learning rate and the number of epochs can be considered as hyperparameters of this method.
The diagram below combines the block diagram, the softmax, and the cross-entropy loss into one full training step for a single example, with M = 2 features and K = 3 classes. Move the sliders, pick a true class y, then press Take gradient step to apply the update bk,j ← bk,j − η · ∂L/∂bk,j.
y is the one-hot encoded label of the true class.
Editable parameters on the left, computed gradients on the right (read-only):
| B (parameters) | ∂L/∂B (gradients) | ||||||
| bk,0 | bk,1 | bk,2 | ∂L/∂bk,0 | ∂L/∂bk,1 | ∂L/∂bk,2 | ||
| Class 1 | -0.26 | -0.40 | 0.13 | ||||
| Class 2 | 0.10 | 0.15 | -0.05 | ||||
| Class 3 | 0.16 | 0.25 | -0.08 | ||||
Watch gradient descent train a 3-class softmax classifier in real time. The decision regions and loss curve update as training progresses.
Decision regions (3 classes)
Cross-entropy loss vs epoch
—