Softmax Regression

Multinomial Logistic Regression · Interactive · ML SS2026

Softmax Regression (Multinomial Logistic Regression)

In softmax regression the probability that a data point belongs to each class is calculated by softmax function instead of sigmoid function in logistic regression.

The softmax function takes as input a vector Z of K real numbers (Z = (z1, z2, …, zK)):

softmax(z) = [ ez1Kj=1 ezj   ez2Kj=1 ezj   …   ezKKj=1 ezj ] = [ a1   a2   …   aK ]

📐 Block Diagram: Softmax Regression

All K linear scores (z1, …, zK) are computed in parallel and feed into a single SOFTMAX block that produces probabilities (a1, …, aK). ARGMAX picks the class with the highest probability:

x₁ x₂ xM b₁,₁ b₁,₂ b₁,M b₁,₀ Σ z₁ x₁ x₂ xM bK,1 bK,2 bK,M bK,0 Σ zK S O F T M A X a₁ aK A R G M A X ŷ Predict the class with the highest probability

All K classes share the same input layer; SOFTMAX normalizes the scores so ak ∈ [0, 1] and Σ ak = 1. Then ŷ = argmax ak = argmax zk.

🔬 Live Simulation: Block Diagram with Real Numbers (forward path)

The diagram below is the same block diagram as above, concretized to M = 2 features and K = 3 classes. Move the sliders and watch every quantity update on the graph itself — input values x, parameters bk,j (with bk,0 the bias and bk,1, bk,2 the feature weights), the linear scores zk, the softmax outputs ak (drawn as bars right at the SOFTMAX output), and the predicted class ŷ = argmax ak.

INPUT LINEAR SCORES (z) SOFTMAX PROBABILITIES (a) ARGMAX PREDICTION x₁ 1.50 x₂ -0.50 b₁,₁=1.0 b₁,₂=0.0 b₁,₀=0.0 Σ z₁ = 1.50 b₂,₁=0.0 b₂,₂=1.0 b₂,₀=0.0 Σ z₂ = -0.50 b₃,₁=-0.5 b₃,₂=-0.5 b₃,₀=0.5 Σ z₃ = 0.00 S O F T M A X a₁ = 0.736 73.6% a₂ = 0.100 10.0% a₃ = 0.164 16.4% A R G M A X ŷ = Class 1

📥 Inputs (x)

⚙ Weight matrix B (3 classes × 3 params)

Each row k holds the bias bk,0 and the two feature weights bk,1, bk,2:

bk,0 bk,1 (·x₁) bk,2 (·x₂)
Class 1
Class 2
Class 3

📐 Step-by-step computation

🎛 Interactive: Softmax Calculator

Enter raw logit values (z) and see how the softmax function turns them into a probability distribution. The predicted class is ŷ = argmax ak.

Properties: all probabilities ak ∈ [0, 1] and they sum to 1: Kj=1 aj = 1. Even when input z values are negative or larger than 1, softmax produces a valid probability distribution.

Softmax Predictions

Softmax(z) can also be represented as:

[ P(y_1 = 1|z1)   P(y_2 = 1|z2)   …   P(y_k = 1|zK) ]

To make predictions, it is the argmax of the probabilities out of the softmax function:

ŷ = argmax ak      k ∈ {1, 2, …, K}

Because the softmax operation preserves the ordering among its arguments, we do not need to compute the softmax to determine which class has been assigned the highest probability:

ŷ = argmax ak = argmax zk      k ∈ {1, 2, …, K}

Softmax Regression — Training

The element k of vector Z(i) of the training example number i can be represented as follows:

zk(i) = bk,0 + Mm=1 bk,m · xm(i)

Softmax(Z(i)) = [ a1(i)   a2(i)   …   aK(i) ]

The probability that y(i) is assigned to a class (1, 2, …, or K) can be expressed as:

Kk=1 P(y_k(i) = 1|zk(i)) = Kk=1 (ak(i))y_k(i)

The Likelihood: Overall probability from the product of the probabilities of all the n training examples:

L = ni=1 Kk=1 (ak(i))y_k(i)

Simplifying by taking logarithms

ln(L) = ni=1 Kk=1 y_k(i) · ln(ak(i))

Optimization by maximizing ln(L) or minimizing (−ln(L)):

Loss = − ni=1 Kk=1 y_k(i) · ln(ak(i))

This loss function is called multi-category cross entropy.

The model's parameter values bk,m can be estimated using a gradient descent method via partial derivatives and the multivariable chain rule. As an example for b1,m for one training example:

∂Loss∂b1,m = ∂Loss∂a1 · ∂a1∂z1 · ∂z1∂b1,m + ∂Loss∂a2 · ∂a2∂z1 · ∂z1∂b1,m + … + ∂Loss∂aK · ∂aK∂z1 · ∂z1∂b1,m
= − (y_1 − a1) · xm

where (y_1 − a1) is the error for class 1 and xm is the feature value.

🎛 Interactive: Cross-Entropy Loss

Adjust the predicted logits (z) and pick the true class to see how the cross-entropy loss responds. Notice how confidently wrong predictions are penalized far more than uncertain ones.

Gradient Descent Steps for n training examples

  1. Initialize values for the parameters bk,m[0] to get started with the iteration process.
  2. Keep on iterating for d = 0, 1, 2, … using the update rule of the Gradient Descent:
    bk,m[d+1] := bk,m[d] − η · ∂Loss∂bk,m
            := bk,m[d] − η · ni=1 (ak(i) − y_k(i)) · xm(i)

    η is called the learning rate or the learning step size.

  3. Termination criteria for a process can include:
    • Setting a specific number of iterations to be performed (number of epochs)
    • Predefine improvement to be obtained in successive iterations

The learning rate and the number of epochs can be considered as hyperparameters of this method.

🔬 Live Simulation: Forward & Backward Pass

The diagram below combines the block diagram, the softmax, and the cross-entropy loss into one full training step for a single example, with M = 2 features and K = 3 classes. Move the sliders, pick a true class y, then press Take gradient step to apply the update bk,j ← bk,j − η · ∂L/∂bk,j.

INPUT LINEAR SCORES (z) SOFTMAX PROBABILITIES (a) ARGMAX ŷ, y & LOSS x₁ 1.50 x₂ -0.50 b₁,₁=1.0 b₁,₂=0.0 b₁,₀=0.0 Σ z₁ = 1.50 δ₁ = -0.26 b₂,₁=0.0 b₂,₂=1.0 b₂,₀=0.0 Σ z₂ = -0.50 δ₂ = 0.10 b₃,₁=-0.5 b₃,₂=-0.5 b₃,₀=0.5 Σ z₃ = 0.00 δ₃ = 0.16 S O F T M A X a₁ = 0.736 73.6% a₂ = 0.100 10.0% a₃ = 0.164 16.4% A R G M A X ŷ = Class 1 y = Class 1 Cross-Entropy Loss L = 0.307 ∂L/∂zk = ak − yk backward pass: gradients flow right → left

🎚 Inputs (x)

🏷 True class y

y is the one-hot encoded label of the true class.

⚙ Actions

📉 Gradient descent step

⚙ Parameter matrix B and gradient ∂L/∂B

Editable parameters on the left, computed gradients on the right (read-only):

B (parameters) ∂L/∂B (gradients)
bk,0 bk,1 bk,2 ∂L/∂bk,0 ∂L/∂bk,1 ∂L/∂bk,2
Class 1 -0.26 -0.40 0.13
Class 2 0.10 0.15 -0.05
Class 3 0.16 0.25 -0.08

📐 Step-by-step computation

▶ FORWARD PASS
◀ BACKWARD PASS

🎛 Interactive: Watch Softmax Training

Watch gradient descent train a 3-class softmax classifier in real time. The decision regions and loss curve update as training progresses.

Decision regions (3 classes)

Cross-entropy loss vs epoch

🎯 Key Takeaways