📄️ [17.06] Transformer
The Dawn of a New Era
📄️ [18.06] GPT-1
Twelve-Layer Decoder
📄️ [18.10] BERT
Twelve Layers of Encoders
📄️ [19.01] Transformer-XL
Longer Contexts
📄️ [19.02] GPT-2
Forty-Eight Layers of Decoders
📄️ [19.04] Sparse Transformer
Sparse Attention Mechanism
📄️ [19.06] XLNet
Two-Stream Attention Mechanism
📄️ [19.07] RoBERTa
A Guide to Training BERT
📄️ [19.09] ALBERT
A Compact Version of BERT
📄️ [19.11] MQA
Shared Key-Value Mechanism
📄️ [20.01] Scaling Laws
Scaling Laws for Neural Language Models
📄️ [20.04] Longformer
Long Attention Mechanism
📄️ [20.05] GPT-3
Ninety-Six Layer Decoder
📄️ [20.07] BigBird
BigBird Attention Mechanism
📄️ [21.01] Switch Transformer
Let the Experts Speak
📄️ [21.04] RoFormer
Rotary Position Embedding