ποΈ [17.06] Transformer
The Dawn of a New Era
ποΈ [18.06] GPT-1
Twelve-Layer Decoder
ποΈ [18.10] BERT
Twelve Layers of Encoders
ποΈ [19.01] Transformer-XL
Longer Contexts
ποΈ [19.02] GPT-2
Forty-Eight Layers of Decoders
ποΈ [19.04] Sparse Transformer
Sparse Attention Mechanism
ποΈ [19.06] XLNet
Two-Stream Attention Mechanism
ποΈ [19.07] RoBERTa
A Guide to Training BERT
ποΈ [19.09] ALBERT
A Compact Version of BERT
ποΈ [19.11] MQA
Shared Key-Value Mechanism
ποΈ [20.01] Scaling Laws
Scaling Laws for Neural Language Models
ποΈ [20.04] Longformer
Long Attention Mechanism
ποΈ [20.05] GPT-3
Ninety-Six Layer Decoder
ποΈ [20.07] BigBird
BigBird Attention Mechanism
ποΈ [21.04] RoFormer
Rotary Position Embedding