Does AI understand what it sees well enough to predict what happens next?
On MPMWorld, a benchmark of rich 2D physical simulations, we compare two approaches — vision–language code generation and video diffusion — to test whether models can infer dynamics from video and extrapolate forward in time.
To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically plausible extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.
We study a core capability of physical reasoning: given a short observation of a scene — a video prefix v≤t — can a model infer the underlying dynamics and extrapolate forward, predicting what happens next? Success means recovering a forward simulation that continues faithfully beyond the frames we show.
On MPMWorld, each simulation is split into an observation segment (v≤t) and a held-out extrapolation. A video diffusion model predicts extrapolation frames directly in pixel space; a vision–language model instead writes an executable MPM simulation program and runs it forward to produce the continuation.
Beyond the observation video, models may also receive structured scene information — and we vary how much is given vs. withheld. Structured inputs include scene configuration (colors, physics parameters, and other settings), object positions, and material information (type, parameter values, etc.). We evaluate four input regimes:
A VLM-generated simulation program is executed by an explicit MPM solver, so its extrapolations inherit conservation of mass and object permanence by construction — simulated particles cannot spontaneously vanish or appear. Video diffusion has no such guarantee; we regularly see hallucinated motion, disappearing objects, abrupt colour shifts, and temporal drift that grows worse over long horizons. We observe this pattern in the evaluation metrics from our paper. Motion accuracy (W-MAE) measures whether a rollout has similar temporal activity to ground truth; object preservation (collapse score) tracks whether foreground regions remain visible; temporal stability (anomaly rate) flags abrupt frame-to-frame appearance jumps; color composition (CTV) compares foreground colour distributions; and shape overlap (mask IoU) measures spatial alignment of foreground regions.
Below we show selected examples where code generation outperforms video diffusion. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition, with the relevant metric wins labelled underneath.
In each example, the Video Gen column begins with observation frames copied from ground truth, then switches to model-generated extrapolation. Once playback reaches predicted frames, look for a small orange Predicted pill in the top-right corner of that video, plus an orange inset border around the panel.
Video diffusion predicts extrapolation frames directly in pixel space, which tends to preserve approximate object placement, spatial overlap, and colour composition more reliably than VLM-generated simulation programs. VLM code sometimes places objects at wrong initial positions or draws incorrect material boundaries, even when the overall dynamics look plausible.
Below we show selected examples where video diffusion outperforms code generation. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition, with the relevant metric wins labelled underneath.
These examples illustrate why the two approaches are complementary rather than competing: on the same scene, code generation and video diffusion often excel on different metric classes. A simple prefix-quality routing rule — use VLM when the generated prefix matches observed frames, otherwise video diffusion — consistently outperforms either model alone.
Below we show selected examples where each model wins on a different set of metrics. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition.
VLM outputs change substantially with input quality; video diffusion outputs do not. When material labels or object positions are withheld, VLM exhibits corresponding mismatches — suggesting it genuinely uses structured scene information. Video diffusion rollouts remain nearly identical across input regimes, indicating reliance on short-term visual extrapolation rather than the provided specifications.
Below we show selected candidates side by side under all four input regimes. Maximal info gives the full scene specification; minimal info gives only the observation video; no materials reveals object positions but hides material labels; no positions reveals materials but hides where objects start. Comparing columns within a row shows how each model responds when specific physical information is present or missing.
On some scenes both models produce high-quality extrapolations — neither clearly dominates on the metrics we highlight. These cases illustrate settings where either approach may be reasonable.
Below we show selected examples where both models perform well. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition.
Aggregate metrics from the held-out test split. Below we summarize how code generation and video diffusion compare across input conditions, materials, and prediction horizon — complementing the qualitative examples above.
Normalized scores across evaluation metrics and input regimes. Lower is better for W-MAE, CTV, anomaly rate, and object collapse; higher is better for mIoU. The rightmost column averages across metrics. Code generation is stronger on temporal stability and long-horizon consistency; video diffusion leads on spatial overlap (mIoU).
Moving-average W-MAE over continuation frames. Video diffusion motion error grows faster with prediction horizon than code generation, reflecting accumulated temporal drift in pixel-space extrapolation.
Performance change when material information is removed from the prompt, by material family. VLMs degrade on elastic, snow, and sand scenes; liquids are less affected. VDMs change little — they rely mainly on pixels, not structured material specs.
Average normalized score by material family (zero = model-family mean). Code generation is strongest on liquids and snow, weakest on sand. Video diffusion struggles most on elastic/plastic scenes where coherent trajectories must persist over time.
Each MPMWorld scene is released as a program (executable Taichi MPM solver), a structured scene description (physics parameters, object geometry, and visualization settings), and a rendered simulation video. Below is one representative scene: snow and liquid blocks fall through two circular obstacles onto a rotating fan.