A strong systems page because his work repeatedly shows up where inference efficiency meets usable long context, especially in attention sinks, StreamingLLM, post-training quantization, and later long-context head designs.
Researcher Profile
Editor reviewedSong Han
Streaming + long-context stability (attention sinks)
Associate professor at MIT and distinguished scientist at NVIDIA
One of the clearest researchers to follow for efficient AI systems, especially the line of work that makes large models smaller, faster, and easier to deploy without giving up too much quality.
Organizations
Topics
About This Page
This profile is meant to help you get oriented quickly: why this researcher matters, what to read first, and where to explore next.
Last reviewed
March 18, 2026
Known For
The ideas, systems, and research directions that make this person worth knowing.
01
Model compression and quantization
02
Efficient AI systems
03
Deployment-focused optimization for large models
04
Streaming + long-context stability (attention sinks)
05
Efficient Streaming Language Models with Attention Sinks
06
Long context
Start Here
Canonical papers, project pages, or repositories that anchor this profile.
Signature Works
Additional papers, projects, or repositories that help flesh out the profile.
Supporting Sources
Additional links that help verify and flesh out this profile.
Related Researchers
People worth exploring next because they share topics, labs, or source material with this profile.
A high-signal researcher for the systems side of modern AI, especially where reinforcement learning, memory-efficient large-model training, and long-context inference meet.
A strong researcher to follow for efficient and long-context LLM systems, especially where inference tricks and memory management make large models practical to run.
A strong person to study for the modern NLP stack because his work spans denoising pretraining, retrieval-augmented generation, and later long-context inference tricks rather than only one phase of the language-model pipeline.
A high-signal researcher for the latency and systems side of modern language models, especially where clever decoding tricks turn frontier models into usable products.
An important systems page because he is one of the named authors on speculative decoding, a technique that became part of the mainstream conversation about making large-model inference materially faster without changing outputs.