Blog 03 sketched a proactive VLM: surprise signal + verbalizer + scene-conditional priors. It works as far as it goes. But there's a quieter assumption baked in that I want to challenge:

Why is "speak" the action?

A VLM, no matter how proactive, can only ever say something happened. The real loop closes only when something also does something about it. And once you accept that the model needs to act, the model is no longer a VLM — it's a world model with a policy, and the language head becomes a side channel, not the headline.

What a world model gets you that a VLM doesn't

The defining capability of a world model is predicting future state. Not just "what is happening" (a VLM does that) but what is about to happen under different action choices:

  • "If no one intervenes in the next 8 s, the toddler reaches the knife."
  • "If the autonomous car holds its current trajectory, it cuts off the pedestrian in the crosswalk in 2.4 s."
  • "If the ICU patient's HR continues this trend, they desaturate in ~90 s."

A VLM with surprise-trigger can flag "something is off" right now. A world model can flag "something will be off in 90 s and here are the three actions whose rollouts avoid it." That's qualitatively different. Proactivity in the strong sense isn't noticing early; it's acting early enough that the bad state never arrives.

Which means: the headline output of a proactive model isn't a natural-language alert. It's an action distribution conditioned on rolled-out futures.

Speech is one action; action is the full set

In a world-model-as-agent framing, the model has a learned policy π(a | z) over latent state z. The action a is whatever the system can do. Sometimes that's:

  • a motor command (Dream-to-Fly, DayDreamer — Wu 2022)
  • a vehicle trajectory (GAIA-1 — Hu 2023, Vista — Gao 2024)
  • an MCTS move (MuZero — Schrittwieser 2019)
  • a tool invocation
  • a natural-language utterance ("knife on the floor — please move it")
  • all of the above

Speech becomes one channel of action rather than the only one. For a surveillance camera in a public space the speech channel might be the only physically available action — fine. For a home-robot, an OR assistant, an autonomous vehicle, the action set is much wider and speech is the narrow special case.

The world model doesn't care which channel; it picks whichever channel minimizes predicted future surprise / cost / regret. The VLM head becomes the natural-language codec for the speech channel, not the seat of proactivity.

The proactive loop is the world-model loop, full stop

The textbook world-model RL loop already is the proactive loop:

The world-model planning loop: observe, encode to latent state, roll the latent dynamics forward, evaluate rollouts under cost, pick and execute the best action, and feed the consequence back to the observation step.
The textbook world-model loop — proactivity falls out of it automatically.

Dreamer, MuZero, GAIA-1-as-simulator, Cosmos-as-substrate, V-JEPA 2-AC — they all instantiate this loop with different solver choices (CEM, MPPI, MCTS, learned actor). Proactivity falls out of the loop automatically. It is not an additional capability you bolt on; it is what the loop is.

Blog 03's "surprise gate → verbalizer" is a degenerate version of this loop where the only available action is "say something" and the rollout horizon is essentially zero. Strip the constraints, expand the action set, push the horizon out, and you get back to the full world-model agent.

Where the VLM actually fits

The VLM doesn't disappear — it has a real role, just not the central one:

  1. As a verbalizer / interface layer: the world model's internal latent state isn't human-readable. A VLM head converts state + planned-action into natural language for any human in the loop.
  2. As a comprehension layer for human instructions: humans speak. The VLM translates "watch the kid, don't let her near the stove" into a constraint the world model can integrate into its cost function.
  3. As a source of prior knowledge: the world model doesn't know what a "knife" is — its latent encodes appearance and dynamics. The VLM grounds latent regions to language concepts (so the verbalizer can produce the word "knife" rather than "latent dimension 374").

The fork is clean:

  • World model = engine (predicts, plans, acts).
  • VLM = interface (translates between humans and the engine).

Calling the VLM "the proactive system" is like calling the steering wheel the car.

Honest fork: passive vs active proactive

There's a real subdivision that I undersold in blog 03:

Class What it can do Architecture
Passive proactive Observe, predict, alert. Cannot intervene physically. Surveillance, public-space safety, medical monitoring outside of automated care. World model + VLM verbalizer; speech is the only action channel.
Active proactive Observe, predict, act. Closes its own loop. Robots, vehicles, automated medical devices, home assistants with effectors. World model + actor/planner. VLM is one channel among many.

Passive-proactive is a real, large product surface, and the VLM-centric story in blog 03 is the right architecture for it. But the interesting frontier — the one where "proactive" stops meaning "fast alert" and starts meaning "the bad state never happened" — lives in the active-proactive column. And it's run by world models, not VLMs.

Why this matters for who you bet on

This reframing predicts where the next decade's leverage is:

  • World Labs, Wayve, Toyota Research, Physical Intelligence, Skild, Figure are building world models. They will own active-proactive applications: home robots, AVs, factories, OR robots. Their VLMs are downstream tooling.
  • OpenAI, Anthropic, Google, xAI are building VLMs / multimodal LLMs. Their natural surface is passive-proactive: monitoring, alerting, knowledge work. To enter active-proactive they have to acquire or build world-model stacks (which is exactly why Cosmos, Genie 2, V-JEPA 2 are happening at the big labs).

This is also why the JEPA + Neural-ODE direction from blogs 01 and 02 is load-bearing: an active-proactive system needs continuous, structured, multi-sensor latent dynamics to plan rollouts that are physically realistic — exactly what continuous-time predictive world models give you and discrete-token VLMs do not.

The arc through blogs 01–04

  • Blog 01: world models should evolve continuously.
  • Blog 02: that's most defensible in physical / multi-sensor / control regimes.
  • Blog 03: proactive monitoring = world model + verbalizer wired to surprise signal.
  • Blog 04 (this one): the verbalizer is a side channel; the world model with an action policy is the actual proactive system. VLMs sit on top of world models, not the other way around.

The thesis tightens with each blog. By blog 04 the claim is no longer "continuous-time is interesting" — it is the substrate of any system that anticipates the future and acts on the anticipation is a continuous-time latent world model; everything else (verbalizers, LLMs, VLMs) is interface.