Guangxuan Xiao 肖光烜

I am a Member of Technical Staff at Thinking Machines Lab, where I work on pre-training science.

I received my Ph.D. from MIT EECS, advised by Prof. Song Han. My research focuses on efficient algorithms and systems for large foundation models.

Previously, I graduated from Tsinghua University with a B.Eng. in Computer Science and a B.Econ. in Finance (with honors), and spent 2020–2021 as a visiting researcher at Stanford University.

Guangxuan Xiao

Blog

Linear: O(d)
Softmax: O(ed)
The Memory Capacity of Attention

How much information can attention mechanisms store? Using relative error analysis, we show that linear attention scales linearly with head dimension while softmax attention scales exponentially with head dimension.

Deff = W · |ln(ε)| / |ln(1-α)|
Why Stacking Sliding Windows Can't See Very Far

A mathematical explanation of why sliding window attention's effective receptive field is O(W) rather than the theoretical O(LW), regardless of depth, due to information dilution and exponential decay from residual connections.

SNR = Δμ√(d/2B)
Statistics behind Block Sparse Attention

A statistical model revealing how block sparse attention achieves efficiency and accuracy through learned similarity gaps.

softmax([sink, a₁, ..., aₜ])
How Attention Sinks Keep Language Models Stable

We discovered that attention sinks—where models park unused attention on initial tokens—are crucial for language model stability. Without them, models catastrophically fail when processing long conversations, but with attention sinks, they maintain stable performance across millions of tokens.

Selected Research

FlashMoBA
Optimizing Mixture of Block Attention
arXiv 2025
[paper] [code]
StreamingVLM
StreamingVLM: Real-Time Understanding for Infinite Video Streams
ICLR 2026
[paper] [code]
XAttention
XAttention: Block Sparse Attention with Antidiagonal Scoring
ICML 2025
[paper] [code]
DuoAttention
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
ICLR 2025
[paper] [code] [demo]
StreamingLLM
Efficient Streaming Language Models with Attention Sinks
ICLR 2024
[paper] [code] [MIT News] [NVIDIA TensorRT-LLM] [on iPhone]
SmoothQuant
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023
[paper] [code] [NVIDIA TensorRT-LLM]
FastComposer
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
IJCV 2024
[website] [paper] [code]

Education

Massachusetts Institute of Technology

2022 – 2025
Ph.D. in Computer Science
S.M. in Computer Science
Thesis: Efficient Algorithms and Systems for Large Language Models

Tsinghua University

2018 – 2022
B.Eng. in Computer Science
B.Econ. in Economics (Second Major)

Work Experience

Thinking Machines Lab

2025 – Present
Member of Technical Staff
San Francisco, CA
Pre-training.

NVIDIA

2024 – 2025
Research Intern
Santa Clara, CA
with Song Han · Researching efficient large language models.

Meta Inc.

2023
Research Scientist Intern
Menlo Park, CA
with Mike Lewis · Developed efficient streaming language models.