# SwiGLU: The Smart Activation Function

## What is it

SwiGLU is the activation function used inside the feed forward
network of modern transformers. An activation function decides
how much information passes through a layer. Old activation
functions were simple on or off switches. SwiGLU is smarter. It
has a second path that acts like a gate. The gate learns when to
let information through and when to block it.

Think of it like a water faucet. ReLU is a faucet that is either
fully open or fully closed. Nothing in between. SwiGLU is a
faucet you can turn to any position. A little open for a trickle.
Half open for moderate flow. Fully open when you need everything.
The model learns the right position for every input.

## Where is it used

SwiGLU lives inside every transformer block. It replaces the
older activation functions inside the feed forward network. Every
time the model processes a token through the FFN layer SwiGLU
decides what information to keep and what to throw away.

```
Transformer Block:
  x → RMSNorm → Attention → +x
    → RMSNorm → SwiGLU FFN → +x
                   ^^^^^^
                   This part
```

LLaMA PaLM Gemini and most models built since 2022 use SwiGLU.
GPT-2 and GPT-3 used GELU which was the previous best. SwiGLU
beats GELU at every scale.

## Why we use it instead of ReLU or GELU

ReLU is the simplest activation. It outputs zero for negative
numbers and does nothing for positive numbers.

```
ReLU(x): max(0, x)

ReLU(-3.2) = 0    (blocked)
ReLU(0.5)  = 0.5  (passed)
ReLU(4.1)  = 4.1  (passed)
```

The problem with ReLU is the hard cutoff at zero. Any negative
value is completely killed. The information is gone forever. This
is called the dying ReLU problem. Neurons that receive only
negative inputs never activate again. They become dead weight.

GELU fixes this by making the cutoff smooth. Instead of a hard
zero GELU outputs very small values for negative inputs.

```
GELU(-3.2) ≈ -0.002  (mostly blocked but not dead)
GELU(0.5)  ≈ 0.346   (partially passed)
GELU(4.1)  ≈ 4.100   (mostly passed)
```

GELU is better than ReLU but still has one decision point. Every
input gets the same treatment. There is no way for the model to
decide *this* input should pass through more than *that* input.

SwiGLU adds a gate. The input splits into two paths. One path
computes values like a normal activation. The other path computes
how much of those values to keep. The gate and the values are
computed from the same input using different learned weights.

```
SwiGLU(x) = (SiLU(x × W₁)) × (x × W₂)

Path 1 (values): SiLU(x × W₁) → the information
Path 2 (gate):   x × W₂       → how much information to pass
```

The gate can output any number. If the gate outputs 0.1 the value
path is reduced to ten percent. If the gate outputs 5.0 the value
path is amplified five times. The model learns what to amplify
and what to suppress. This is why SwiGLU outperforms both ReLU
and GELU at large scale.

## When was it invented

The paper that introduced SwiGLU was published in 2020 by Noam
Shazeer a well known researcher who also co invented the
transformer. The paper compared many activation variants and
found that gated linear units consistently won. PaLM adopted it
in 2022. LLaMA adopted it in 2023. Now it is the standard.

## How it works step by step

Let us trace a single number flowing through SwiGLU.

### The setup

```
Input x = 1.5

Weights (learned during training):
W₁ = 0.8   (for the value path)
W₂ = 2.0   (for the gate path)
```

### Path 1: compute the value

First multiply the input by W₁.

```
x × W₁ = 1.5 × 0.8 = 1.2
```

Then apply SiLU. SiLU is also called the Swish function. It is
x multiplied by the sigmoid of x.

```
SiLU(1.2) = 1.2 × sigmoid(1.2)

sigmoid(1.2) = 1 / (1 + e^(-1.2))
             = 1 / (1 + 0.301)
             = 1 / 1.301
             = 0.769

SiLU(1.2) = 1.2 × 0.769 = 0.922
```

SiLU gives 0.922. This is the processed value.

### Path 2: compute the gate

Simply multiply the input by W₂.

```
x × W₂ = 1.5 × 2.0 = 3.0
```

The gate value is 3.0. This means let three times the information
through. The gate is open wide.

### Combine the two paths

Multiply the value by the gate.

```
output = 0.922 × 3.0 = 2.766
```

If the gate had been smaller like 0.1 the output would have been
0.092. If the gate had been zero the output would have been zero.
The gate controls everything.

### What about negative inputs

Let us try an input of -2.0.

```
x = -2.0

Path 1 (value):
  x × W₁ = -2.0 × 0.8 = -1.6
  SiLU(-1.6) = -1.6 × sigmoid(-1.6)
  sigmoid(-1.6) = 1 / (1 + e^1.6) = 1 / 5.953 = 0.168
  SiLU(-1.6) = -1.6 × 0.168 = -0.269

Path 2 (gate):
  x × W₂ = -2.0 × 2.0 = -4.0

Combine:
  output = -0.269 × (-4.0) = 1.076
```

Even though the input was negative the output is positive. That
is because both the value path and the gate path became negative
and negative times negative equals positive. The gating mechanism
gives the model extra flexibility to transform negative signals
into positive ones when needed. ReLU would have just output zero
and lost all information.

## A tiny code example

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, d_model, expansion_factor=4):
        super().__init__()
        hidden_dim = expansion_factor * d_model
        self.w1 = nn.Linear(d_model, hidden_dim, bias=False)
        self.w2 = nn.Linear(d_model, hidden_dim, bias=False)
        self.w3 = nn.Linear(hidden_dim, d_model, bias=False)

    def forward(self, x):
        # Path 1: values processed by SiLU
        values = F.silu(self.w1(x))
        # Path 2: gates controlling how much passes
        gates = self.w2(x)
        # Combine and project back to original size
        return self.w3(values * gates)

# Test with random input
d_model = 4
ffn = SwiGLU(d_model)
x = torch.tensor([[1.5, -2.0, 0.3, 4.1]])

output = ffn(x)
print(f"Input:  {x}")
print(f"Output: {output}")
print(f"Shape preserved: {x.shape == output.shape}")
```

Running this code you will see something like:

```
Input:  tensor([[ 1.5000, -2.0000,  0.3000,  4.1000]])
Output: tensor([[-1.234,  0.567, -0.891,  2.345]])
Shape preserved: True
```

## Why the expansion factor matters

Notice the hidden dimension in the code is four times larger than
the input dimension. This is the expansion factor. The network
goes from d_model to four times d_model and back again.

```
768 → 3072 → 768
```

This expand then contract pattern gives the network room to
transform information. In the middle layer there are many more
neurons than at the input or output. This is like widening a pipe
to let more water flow through before narrowing it again. The
extra width lets the model learn more complex transformations.

SwiGLU uses three weight matrices instead of the two that ReLU
or GELU networks use. The extra matrix is for the gate. This
makes SwiGLU about fifty percent larger than a standard FFN at
the same expansion factor. For our GPT-2 scale model this adds
about twenty eight million extra parameters. Every one of those
parameters contributes to better performance.

## What you need to remember

SwiGLU is a gated activation function. It splits the feed forward
network into a value path and a gate path. The gate controls how
much of each value passes through. This is more flexible than
ReLU or GELU which treat every input the same way.

The SiLU function on the value path provides smooth non linearity.
The gate on the control path provides adaptive filtering. Together
they outperform every older activation function at large scale.

Every modern language model uses SwiGLU. It is one extra matrix
multiplication per forward pass for a measurable improvement in
every metric that matters.