Researcher Profile

Editor reviewed

Yikang Shen

Linear transformers via the delta rule

Researcher working on efficient sequence models and multimodal RLHF

Useful because his work links two strands that usually get discussed separately: efficient sequence-model architectures on one side and multimodal alignment work on the other.

Organizations

MIT-IBM Watson AI Lab

Topics

Multimodal Post-Training & Alignment Systems & Infrastructure

About This Page

This profile is meant to help you get oriented quickly: why this researcher matters, what to read first, and where to explore next.

Last reviewed

March 18, 2026

Best First Clicks

Yikang Shen - ACL Anthologyprofile Gated Linear Attention Transformers with Hardware-Efficient Trainingpaper Aligning Large Multimodal Models with Factually Augmented RLHFpaper

Known For

The ideas, systems, and research directions that make this person worth knowing.

Gated linear attention and Delta-rule models

Multimodal RLHF and hallucination reduction

Research at the boundary of systems efficiency and alignment

Linear transformers via the delta rule

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

DeltaNet

Start Here

Canonical papers, project pages, or repositories that anchor this profile.

Yikang Shen - ACL Anthologyprofile Gated Linear Attention Transformers with Hardware-Efficient Trainingpaper Aligning Large Multimodal Models with Factually Augmented RLHFpaper Parallelizing Linear Transformers with the Delta Rule over Sequence Lengthpaper

Signature Works

Additional papers, projects, or repositories that help flesh out the profile.

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemblepaper

Supporting Sources

Additional links that help verify and flesh out this profile.

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemblepaper

Related Researchers

People worth exploring next because they share topics, labs, or source material with this profile.

Shared canonical source

Songlin Yang

Linear transformers via the delta rule

3 sources

A high-signal researcher for the post-attention design space, especially if you care about the line of work trying to make linear-attention and Delta-rule models actually competitive in real language-model systems.

Systems & Infrastructure

Start HereSonglin Yang

Shared canonical source

Bailin Wang

Linear transformers via the delta rule

3 sources

A good page to have because he is one of the recurring names in the recent MIT line of work on linear-attention alternatives, especially where hardware-efficient training meets practical long-context sequence modeling.

Systems & Infrastructure

Start HereGated Linear Attention Transformers with Hardware-Efficient Training

Shared canonical source

Yu Zhang

Linear transformers via the delta rule

4 sources

Worth surfacing because he leads the Gated Slot Attention paper, which is one of the clearer attempts to push the RWKV-adjacent efficient-sequence line toward stronger memory and retrieval behavior rather than stopping at architecture novelty.

Evaluation & Benchmarks Systems & Infrastructure

Start HereGated Slot Attention for Efficient Linear-Time Sequence Modeling

Shared canonical source

Yoon Kim

Linear transformers via the delta rule

4 sources

A useful researcher to study for the line from classic neural NLP into today’s efficient large-model work, with papers that span early sentence models, character-aware language modeling, and current sequence-model efficiency research.

Systems & Infrastructure

Start HereYoon Kim

Shared topics

Geoffrey Irving

Reasoning, verification, math

4 sources

A useful person to study if you care about alignment proposals that try to make superhuman systems legible enough for humans to supervise in practice.

Google DeepMind Multimodal Post-Training & Alignment

Start HereRed Teaming Language Models with Language Models

Shared topics

Andrew M. Dai

Gemini (multimodal foundation models)

4 sources

A good researcher to follow for the infrastructure side of frontier language models, especially mixture-of-experts scaling, instruction tuning, and the data systems that make very large models usable.

Multimodal Post-Training & Alignment

Start HereMore Efficient In-Context Learning with GLaM