Last summer I stared at the scGPT benchmarks for a long time.
100 thousand cells, a 12-layer Transformer, a respectable parameter count — cell-type classification at 0.95 accuracy, zero-shot annotation working. Every metric was screaming this is year zero of single-cell foundation models. But there was one voice in my head I couldn’t talk myself out of —
You captured the cell’s now. But the story of a tumor is about the future.
A question that woke me up at night
The tumor immune microenvironment (TIME) is the deciding factor in immunotherapy. Give the same PD-1 inhibitor to two different patients, and one tumor gets surrounded and demolished by immune cells in a month, while the other actually progresses faster. Why? Because TIME is not a snapshot. It is a running, second-by-second dynamical system in which a dozen cell types are constantly negotiating with each other.
The scGPT family does something brilliant — it compresses the cell as it is right now into a high-dimensional vector. But what a clinician actually wants to know is:
What does this patient’s immune landscape look like next month? After ICB? After switching therapies?
That’s a world model problem, not a representation-learning problem.
Stealing an old idea from deep learning
I didn’t invent the term “world model.” Yann LeCun, David Ha, the DeepMind crowd — they’ve been playing with it since 2018. Give an agent an internalized simulator of its environment and let it roll out actions in its head. Genie, SIMA, Sora — they all walk this road.
But they were doing robotics and video. No one had seriously tried to drop the framing onto cell populations.
The bet I made with my collaborators is exactly this: TIME dynamics can be learned autoregressively by a trajectory-aware Transformer — given the cell-state distribution at time t, predict t+1. Get the prediction good enough, and the model can do four things nobody else can:
- Trajectory reconstruction — turn sparsely-sampled real-world data into continuous trajectories
- Cross-cancer transfer — does the immune dynamics learned on lung cancer tell us how melanoma will respond?
- Counterfactual reasoning — if we don’t give anti-PD1, but give anti-CTLA4 instead, which branch of the TIME tree does this patient walk down?
- Clinical decision support — ICB response prediction across 14 independent cohorts, AUC 0.891
Each one of these has been done individually. Doing all four at once, on the same backbone, with the same loss — nobody.
Why this is hard
The hardest technical problem is not model size. It’s heterogeneity.
When you train an LLM, every token is a token. When you train ImageNet, every image is RGB. But TIME data looks like this:
A tumor slice from one patient might have 18 cell subtypes; the next patient has 23, and 5 of them are subtypes the first patient didn’t have. Different labs annotate using different naming conventions. The same cell that’s labeled “exhausted CD8+” today might become “TCF1+ progenitor exhausted” next week as the technology updates.
We spent three months aligning these naming conventions and rewriting a unified cell ontology. This is not sexy work. It will not become a main figure in the NC paper. But without it, everything else we trained on top would be a castle in the air.
What I mostly worked on was that ontology layer and the training-data pipeline. I’m not the first author on this NC submission. I’m the 4th, with three co-first authors carrying the load above me. And I’m at peace with that. Because I know better than most that the 12.4 M-cell pretraining only ran because somebody did the upstream, ugly, no-glory data-alignment work — not because somebody wrote a clever attention layer.
What I’m actually betting on
I’m not betting on whether ImmuneWorld gets accepted. It’s already in major revision at NC — that’s no longer in my hands.
I’m betting on the path: the next decade of biomedical AI will not be defined by bigger LLMs. It will be defined by the people who finally stitch dynamical-systems modeling and single-cell data together. Once that framing holds, the same backbone can run from single-cell flow → spatial transcriptomics → multi-omics → clinical decision support.
I made a bet: spend the scarcest resource a third-year undergraduate has — time — on building an early instance of that framing.
What’s next
ImmuneWorld is just the opening. I’m working on RST2G (DCE-MRI breast cancer spatiotemporal Transformer segmentation, already published in CBS) with Tsinghua SIGS; on a multimodal TCM Four-Diagnosis agent with Longhua Hospital, Shanghai University of TCM (CBS, first author, in preparation); and on the CA: A Cancer Journal for Clinicians bone tumor epidemiology review with NUS Yong Loo Lin School of Medicine (co-first author later in the order, submission summer 2026). These projects look unrelated. Underneath, they are the same question:
How do you feed high-dimensional, multimodal, time-structured clinical data to a model that can actually make decisions?
I don’t know if I’m betting right. But I do know that putting all my chips on a research taste I believe in makes me more excited to wake up than any safe grad-school checklist ever did.
——
This is my ImmuneWorld working note. If you’re also doing single-cell foundation models / world models / clinical AI, my inbox is open.