Skip to main content

Transformers (16)

📄️ [17.06] Transformer

The Dawn of a New Era

📄️ [18.06] GPT-1

Twelve-Layer Decoder

📄️ [18.10] BERT

Twelve Layers of Encoders

📄️ [19.01] Transformer-XL

Longer Contexts

📄️ [19.02] GPT-2

Forty-Eight Layers of Decoders

📄️ [19.04] Sparse Transformer

Sparse Attention Mechanism

📄️ [19.06] XLNet

Two-Stream Attention Mechanism

📄️ [19.07] RoBERTa

A Guide to Training BERT

📄️ [19.09] ALBERT

A Compact Version of BERT

📄️ [19.11] MQA

Shared Key-Value Mechanism

📄️ [20.01] Scaling Laws

Scaling Laws for Neural Language Models

📄️ [20.04] Longformer

Long Attention Mechanism

📄️ [20.05] GPT-3

Ninety-Six Layer Decoder

📄️ [20.07] BigBird

BigBird Attention Mechanism

📄️ [21.01] Switch Transformer

Let the Experts Speak

📄️ [21.04] RoFormer

Rotary Position Embedding