The proactive model is a world model, not a VLM

Blog 03 sketched a proactive VLM: predict the scene's trajectories, flag the low-likelihood ones, and notify a human. It works as far as it goes. But there's a quieter assumption baked in that I want to challenge:

Why is "notify a human" the only action?

A VLM, no matter how proactive, can only ever speak — flag that a worrying future is coming. The loop only closes when something also acts to prevent it. And once you accept that the model needs to act, it is no longer a VLM — it's a world model with a policy, and the language head becomes a side channel, not the headline.

What a world model gets you that a VLM doesn't

Blog 03's system already predicts the future — that's the whole trajectory story. A world model adds the part that actually changes outcomes: predicting what is about to happen under different action choices, so you can pick one:

"If no one intervenes in the next 8 s, the baby reaches the knife."
"If the autonomous car holds its current trajectory, it cuts off the pedestrian in the crosswalk in 2.4 s."
"If the ICU patient's HR continues this trend, they desaturate in ~90 s."

Blog 03's notifier can flag "this is heading somewhere bad." A world model goes one step further: "it will be bad in 90 s and here are the three actions whose rollouts avoid it." That's qualitatively different. Proactivity in the strong sense isn't noticing early; it's acting early enough that the bad state never arrives.

Which means: the headline output of a proactive model isn't a natural-language alert. It's an action distribution conditioned on rolled-out futures.

Speech is one action; action is the full set

In a world-model-as-agent framing, the model has a learned policy π(a | z) over latent state z. The action a is whatever the system can do. Sometimes that's:

a motor command (Dream-to-Fly, DayDreamer — Wu 2022)
a vehicle trajectory (GAIA-1 — Hu 2023, Vista — Gao 2024)
an MCTS move (MuZero — Schrittwieser 2019)
a tool invocation
a natural-language utterance ("knife on the floor — please move it")
all of the above

Speech becomes one channel of action rather than the only one. For a surveillance camera in a public space the speech channel might be the only physically available action — fine. For a home-robot, an OR assistant, an autonomous vehicle, the action set is much wider and speech is the narrow special case.

The world model doesn't care which channel; it picks whichever one minimizes predicted future cost / regret. The VLM head becomes the natural-language codec for the speech channel, not the seat of proactivity.

The proactive loop is the world-model loop, full stop

The textbook world-model RL loop already is the proactive loop:

The textbook world-model loop — proactivity falls out of it automatically.

Dreamer, MuZero, GAIA-1-as-simulator, Cosmos-as-substrate, V-JEPA 2-AC — they all instantiate this loop with different solver choices (CEM, MPPI, MCTS, learned actor). Proactivity falls out of the loop automatically. It is not an additional capability you bolt on; it is what the loop is.

Blog 03's "outlier gate → notifier" is a degenerate version of this loop: the only available action is "notify a human," and it rolls out just the one future it's drifting toward instead of the futures under each possible intervention. Expand the action set, evaluate those interventions, and you get back the full world-model agent.

Where the VLM actually fits

The VLM doesn't disappear — it has a real role, just not the central one:

As a notifier / interface layer: the world model's internal latent state isn't human-readable. A VLM head converts state + planned-action into natural language for any human in the loop.
As a comprehension layer for human instructions: humans speak. The VLM translates "watch the kid, don't let her near the stove" into a constraint the world model can integrate into its cost function.
As a source of prior knowledge: the world model doesn't know what a "knife" is — its latent encodes appearance and dynamics. The VLM grounds latent regions to language concepts (so the notifier can produce the word "knife" rather than "latent dimension 374").

The fork is clean:

World model = engine (predicts, plans, acts).
VLM = interface (translates between humans and the engine).

Calling the VLM "the proactive system" is like calling the steering wheel the car.

Honest fork: passive vs active proactive

There's a real subdivision that I undersold in blog 03:

Class	What it can do	Architecture
Passive proactive	Observe, predict, alert. Cannot intervene physically.	Surveillance, public-space safety, medical monitoring outside of automated care. World model + VLM notifier; speech is the only action channel.
Active proactive	Observe, predict, act. Closes its own loop.	Robots, vehicles, automated medical devices, home assistants with effectors. World model + actor/planner. VLM is one channel among many.

Passive-proactive is a real, large product surface, and the predict-and-notify story in blog 03 is the right architecture for it. But the interesting frontier — the one where "proactive" stops meaning "fast alert" and starts meaning "the bad state never happened" — lives in the active-proactive column. And it's run by world models, not VLMs.

Why this matters for who you bet on

This reframing predicts where the next decade's leverage is:

World Labs, Wayve, Toyota Research, Physical Intelligence, Skild, Figure are building world models. They will own active-proactive applications: home robots, AVs, factories, OR robots. Their VLMs are downstream tooling.
OpenAI, Anthropic, Google, xAI are building VLMs / multimodal LLMs. Their natural surface is passive-proactive: monitoring, alerting, knowledge work. To enter active-proactive they have to acquire or build world-model stacks (which is exactly why Cosmos, Genie 2, V-JEPA 2 are happening at the big labs).

The takeaway

Blog 03: proactive monitoring = predict the scene's trajectories, flag the outliers, notify a human.
Blog 04 (this one): the notifier is a side channel; the world model with an action policy is the actual proactive system. VLMs sit on top of world models, not the other way around.

Stated plainly: the substrate of any system that anticipates the future and acts on the anticipation is a world model; everything else (notifiers, LLMs, VLMs) is interface.