Softmax Regression

In softmax regression the probability that a data point belongs to each class is calculated by softmax function instead of sigmoid function in logistic regression.

The softmax function takes as input a vector Z of K real numbers (Z = (z₁, z₂, …, z_K)):

softmax(z) = [ e^z₁K∑j=1 e^z_j e^z₂K∑j=1 e^z_j … e^z_KK∑j=1 e^z_j ] = [ a₁ a₂ … a_K ]

Some of the elements of vector Z could be negative, or greater than one, and might not sum up to 1; but after applying softmax:
each component will be in the interval (0, 1): a₁, a₂, … and a_K ∈ [0, 1]
and they will sum up to 1, so that they can be interpreted as probabilities: K∑j=1 a_j = 1

All K linear scores (z₁, …, z_K) are computed in parallel and feed into a single SOFTMAX block that produces probabilities (a₁, …, a_K). ARGMAX picks the class with the highest probability:

All K classes share the same input layer; SOFTMAX normalizes the scores so a_k ∈ [0, 1] and Σ a_k = 1. Then ŷ = argmax a_k = argmax z_k.

The diagram below is the same block diagram as above, concretized to M = 2 features and K = 3 classes. Move the sliders and watch every quantity update on the graph itself — input values x, parameters b_k,j (with b_k,0 the bias and b_k,1, b_k,2 the feature weights), the linear scores z_k, the softmax outputs a_k (drawn as bars right at the SOFTMAX output), and the predicted class ŷ = argmax a_k.

📥 Inputs (x)

x₁ : 1.5

x₂ : -0.5

⚙ Weight matrix B (3 classes × 3 params)

Each row k holds the bias b_k,0 and the two feature weights b_k,1, b_k,2:

	b_k,0	b_k,1 (·x₁)	b_k,2 (·x₂)
Class 1
Class 2
Class 3

📐 Step-by-step computation

Enter raw logit values (z) and see how the softmax function turns them into a probability distribution. The predicted class is ŷ = argmax a_k.

Number of classes K: 3

Properties: all probabilities a_k ∈ [0, 1] and they sum to 1: K∑j=1 a_j = 1. Even when input z values are negative or larger than 1, softmax produces a valid probability distribution.

Softmax Predictions

Softmax(z) can also be represented as:

[ P(y_1 = 1|z₁) P(y_2 = 1|z₂) … P(y_k = 1|z_K) ]

To make predictions, it is the argmax of the probabilities out of the softmax function:

ŷ = argmax a_k k ∈ {1, 2, …, K}

Because the softmax operation preserves the ordering among its arguments, we do not need to compute the softmax to determine which class has been assigned the highest probability:

ŷ = argmax a_k = argmax z_k k ∈ {1, 2, …, K}

The element k of vector Z⁽ⁱ⁾ of the training example number i can be represented as follows:

z_k⁽ⁱ⁾ = b_k,0 + M∑m=1 b_k,m · x_m⁽ⁱ⁾

Softmax(Z⁽ⁱ⁾) = [ a₁⁽ⁱ⁾ a₂⁽ⁱ⁾ … a_K⁽ⁱ⁾ ]

The probability that y⁽ⁱ⁾ is assigned to a class (1, 2, …, or K) can be expressed as:

K∏k=1 P(y_k⁽ⁱ⁾ = 1|z_k⁽ⁱ⁾) = K∏k=1 (a_k⁽ⁱ⁾)^y_k⁽ⁱ⁾

The Likelihood: Overall probability from the product of the probabilities of all the n training examples:

L = n∏i=1 K∏k=1 (a_k⁽ⁱ⁾)^y_k⁽ⁱ⁾

Simplifying by taking logarithms

ln(L) = n∑i=1 K∑k=1 y_k⁽ⁱ⁾ · ln(a_k⁽ⁱ⁾)

Optimization by maximizing ln(L) or minimizing (−ln(L)):

Loss = − n∑i=1 K∑k=1 y_k⁽ⁱ⁾ · ln(a_k⁽ⁱ⁾)

This loss function is called multi-category cross entropy.

The model's parameter values b_k,m can be estimated using a gradient descent method via partial derivatives and the multivariable chain rule. As an example for b_1,m for one training example:

∂Loss∂b_1,m = ∂Loss∂a₁ · ∂a₁∂z₁ · ∂z₁∂b_1,m + ∂Loss∂a₂ · ∂a₂∂z₁ · ∂z₁∂b_1,m + … + ∂Loss∂a_K · ∂a_K∂z₁ · ∂z₁∂b_1,m

= − (y_1 − a₁) · x_m

where (y_1 − a₁) is the error for class 1 and x_m is the feature value.

Adjust the predicted logits (z) and pick the true class to see how the cross-entropy loss responds. Notice how confidently wrong predictions are penalized far more than uncertain ones.

z₁ (Class 1 logit): 2.0

z₂ (Class 2 logit): 0.5

z₃ (Class 3 logit): -0.5

True class y:

—

Gradient Descent Steps for n training examples

Initialize values for the parameters b_k,m^[0] to get started with the iteration process.
Keep on iterating for d = 0, 1, 2, … using the update rule of the Gradient Descent:
b_k,m^[d+1] := b_k,m^[d] − η · ∂Loss∂b_k,m
:= b_k,m^[d] − η · n∑i=1 (a_k⁽ⁱ⁾ − y_k⁽ⁱ⁾) · x_m⁽ⁱ⁾

η is called the learning rate or the learning step size.
Termination criteria for a process can include:
- Setting a specific number of iterations to be performed (number of epochs)
- Predefine improvement to be obtained in successive iterations

The learning rate and the number of epochs can be considered as hyperparameters of this method.

The diagram below combines the block diagram, the softmax, and the cross-entropy loss into one full training step for a single example, with M = 2 features and K = 3 classes. Move the sliders, pick a true class y, then press Take gradient step to apply the update b_k,j ← b_k,j − η · ∂L/∂b_k,j.

Forward pass (black, left-to-right): inputs x → linear scores z_k = b_k,0 + b_k,1x₁ + b_k,2x₂ → softmax probabilities a_k → prediction ŷ + cross-entropy L = −ln a_y
Backward pass (red): output gradient δ_k = a_k − y_k at each Σ block, parameter gradients ∂L/∂b_k,j = δ_k · x_j (with x₀ ≡ 1 for the bias)

🎚 Inputs (x)

x₁ : 1.5

x₂ : -0.5

🏷 True class y

y is the one-hot encoded label of the true class.

⚙ Actions

📉 Gradient descent step

η : 0.5

⚙ Parameter matrix B and gradient ∂L/∂B

Editable parameters on the left, computed gradients on the right (read-only):

	B (parameters)			∂L/∂B (gradients)
	b_k,0	b_k,1	b_k,2	∂L/∂b_k,0	∂L/∂b_k,1	∂L/∂b_k,2
Class 1				-0.26	-0.40	0.13
Class 2				0.10	0.15	-0.05
Class 3				0.16	0.25	-0.08

📐 Step-by-step computation

▶ FORWARD PASS

◀ BACKWARD PASS

Watch gradient descent train a 3-class softmax classifier in real time. The decision regions and loss curve update as training progresses.

Learning rate η: 0.10

Class separation: 2.0

Decision regions (3 classes)

Cross-entropy loss vs epoch

—

🎯 Key Takeaways

One-vs-Rest trains K binary classifiers, one per class, and picks the class with the highest confidence.
Softmax regression outputs a probability distribution over all K classes simultaneously. Probabilities sum to 1.
The multi-category cross entropy loss is the natural extension of binary cross entropy.
Gradient descent update rule: b_k,m ← b_k,m − η · Σ(a_k − y_k) · x_m