Why I'm Building a World Simulator for the Tumor Immune Microenvironment

Last summer I stared at the scGPT benchmarks for a long time.

100 thousand cells, a 12-layer Transformer, a respectable parameter count — cell-type classification at 0.95 accuracy, zero-shot annotation working. Every metric was screaming this is year zero of single-cell foundation models. But there was one voice in my head I couldn’t talk myself out of —

You captured the cell’s now. But the story of a tumor is about the future.

A question that woke me up at night

The tumor immune microenvironment (TIME) is the deciding factor in immunotherapy. Give the same PD-1 inhibitor to two different patients, and one tumor gets surrounded and demolished by immune cells in a month, while the other actually progresses faster. Why? Because TIME is not a snapshot. It is a running, second-by-second dynamical system in which a dozen cell types are constantly negotiating with each other.

The scGPT family does something brilliant — it compresses the cell as it is right now into a high-dimensional vector. But what a clinician actually wants to know is:

What does this patient’s immune landscape look like next month? After ICB? After switching therapies?

That’s a world model problem, not a representation-learning problem.

Stealing an old idea from deep learning

I didn’t invent the term “world model.” Yann LeCun, David Ha, the DeepMind crowd — they’ve been playing with it since 2018. Give an agent an internalized simulator of its environment and let it roll out actions in its head. Genie, SIMA, Sora — they all walk this road.

But they were doing robotics and video. No one had seriously tried to drop the framing onto cell populations.

The bet I made with my collaborators is exactly this: TIME dynamics can be learned autoregressively by a trajectory-aware Transformer — given the cell-state distribution at time t, predict t+1. Get the prediction good enough, and the model can do four things nobody else can:

Trajectory reconstruction — turn sparsely-sampled real-world data into continuous trajectories
Cross-cancer transfer — does the immune dynamics learned on lung cancer tell us how melanoma will respond?
Counterfactual reasoning — if we don’t give anti-PD1, but give anti-CTLA4 instead, which branch of the TIME tree does this patient walk down?
Clinical decision support — ICB response prediction across 14 independent cohorts, AUC 0.891

Each one of these has been done individually. Doing all four at once, on the same backbone, with the same loss — nobody.

Why this is hard

The hardest technical problem is not model size. It’s heterogeneity.

When you train an LLM, every token is a token. When you train ImageNet, every image is RGB. But TIME data looks like this:

A tumor slice from one patient might have 18 cell subtypes; the next patient has 23, and 5 of them are subtypes the first patient didn’t have. Different labs annotate using different naming conventions. The same cell that’s labeled “exhausted CD8+” today might become “TCF1+ progenitor exhausted” next week as the technology updates.

We spent three months aligning these naming conventions and rewriting a unified cell ontology. This is not sexy work. It will not become a main figure in the NC paper. But without it, everything else we trained on top would be a castle in the air.

What I mostly worked on was that ontology layer and the training-data pipeline. I’m not the first author on this NC submission. I’m the 4th, with three co-first authors carrying the load above me. And I’m at peace with that. Because I know better than most that the 12.4 M-cell pretraining only ran because somebody did the upstream, ugly, no-glory data-alignment work — not because somebody wrote a clever attention layer.

What I’m actually betting on

I’m not betting on whether ImmuneWorld gets accepted. It’s already in major revision at NC — that’s no longer in my hands.

I’m betting on the path: the next decade of biomedical AI will not be defined by bigger LLMs. It will be defined by the people who finally stitch dynamical-systems modeling and single-cell data together. Once that framing holds, the same backbone can run from single-cell flow → spatial transcriptomics → multi-omics → clinical decision support.

I made a bet: spend the scarcest resource a third-year undergraduate has — time — on building an early instance of that framing.

What’s next

ImmuneWorld is just the opening. I’m working on RST2G (DCE-MRI breast cancer spatiotemporal Transformer segmentation, already published in CBS) with Tsinghua SIGS; on a multimodal TCM Four-Diagnosis agent with Longhua Hospital, Shanghai University of TCM (CBS, first author, in preparation); and on the CA: A Cancer Journal for Clinicians bone tumor epidemiology review with NUS Yong Loo Lin School of Medicine (co-first author later in the order, submission summer 2026). These projects look unrelated. Underneath, they are the same question:

How do you feed high-dimensional, multimodal, time-structured clinical data to a model that can actually make decisions?

I don’t know if I’m betting right. But I do know that putting all my chips on a research taste I believe in makes me more excited to wake up than any safe grad-school checklist ever did.

——

This is my ImmuneWorld working note. If you’re also doing single-cell foundation models / world models / clinical AI, my inbox is open.