MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

Does AI understand what it sees well enough to predict what happens next?

On MPMWorld, a benchmark of rich 2D physical simulations, we compare two approaches — vision–language code generation and video diffusion — to test whether models can infer dynamics from video and extrapolate forward in time.

Contributions

Abstract

To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically plausible extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

The Task: Infer and Extrapolate Physical Dynamics

We study a core capability of physical reasoning: given a short observation of a scene — a video prefix v≤t — can a model infer the underlying dynamics and extrapolate forward, predicting what happens next? Success means recovering a forward simulation that continues faithfully beyond the frames we show.

Code Gen vs. Video Gen on MPMWorld

On MPMWorld, each simulation is split into an observation segment (v≤t) and a held-out extrapolation. A video diffusion model predicts extrapolation frames directly in pixel space; a vision–language model instead writes an executable MPM simulation program and runs it forward to produce the continuation.

Beyond the observation video, models may also receive structured scene information — and we vary how much is given vs. withheld. Structured inputs include scene configuration (colors, physics parameters, and other settings), object positions, and material information (type, parameter values, etc.). We evaluate four input regimes:

Given to the model Withheld (masked out)
Maximal info
Observation videoGiven
Scene configcolors, physics, settingsGiven
Object positionsGiven
Material infotype, parameters, etc.Given
Minimal info
Observation videoGiven
Scene configcolors, physics, settingsWithheld
Object positionsWithheld
Material infotype, parameters, etc.Withheld
No materials
Observation videoGiven
Scene configcolors, physics, settingsGiven
Object positionsGiven
Material infotype, parameters, etc.Withheld
No positions
Observation videoGiven
Scene configcolors, physics, settingsGiven
Object positionsWithheld
Material infotype, parameters, etc.Given

Experimental Highlights

When does code generation excel?

A VLM-generated simulation program is executed by an explicit MPM solver, so its extrapolations inherit conservation of mass and object permanence by construction — simulated particles cannot spontaneously vanish or appear. Video diffusion has no such guarantee; we regularly see hallucinated motion, disappearing objects, abrupt colour shifts, and temporal drift that grows worse over long horizons. We observe this pattern in the evaluation metrics from our paper. Motion accuracy (W-MAE) measures whether a rollout has similar temporal activity to ground truth; object preservation (collapse score) tracks whether foreground regions remain visible; temporal stability (anomaly rate) flags abrupt frame-to-frame appearance jumps; color composition (CTV) compares foreground colour distributions; and shape overlap (mask IoU) measures spatial alignment of foreground regions.

Below we show selected examples where code generation outperforms video diffusion. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition, with the relevant metric wins labelled underneath.

In each example, the Video Gen column begins with observation frames copied from ground truth, then switches to model-generated extrapolation. Once playback reaches predicted frames, look for a small orange Predicted pill in the top-right corner of that video, plus an orange inset border around the panel.

Ground Truth
Code Gen
Video Gen
 
No materials
No materials
Code Gen · wins on
color compositionoverall scoreobject preservationtemporal stability
All chips are evaluation wins. Click to highlight a subset for this demo; export saves your picks.
Notes
succcess demo for VLM
ALSO VLM > VDM on color tv, cpompsite score, and pred collapse score and anomaly.
VDM haluciantes the object positions (elastic)
positional 03 still better than VDM in wmae and especially pred collapse score, but worse in mask iou overlap..

When does video diffusion excel?

Video diffusion predicts extrapolation frames directly in pixel space, which tends to preserve approximate object placement, spatial overlap, and colour composition more reliably than VLM-generated simulation programs. VLM code sometimes places objects at wrong initial positions or draws incorrect material boundaries, even when the overall dynamics look plausible.

Below we show selected examples where video diffusion outperforms code generation. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition, with the relevant metric wins labelled underneath.

Ground Truth
Code Gen
Video Gen
 
No positions
No positions
Video Gen · wins on
color compositionshape overlapoverall score
All chips are evaluation wins. Click to highlight a subset for this demo; export saves your picks.
Notes
VDM wins in the positional case (attempt 03) (metric win in both color tv, mask iou, etc.)
Material case VLM better in terms of wmea and mask iou.
VDM has easier time getting the right starting positions.

Complementary strengths

These examples illustrate why the two approaches are complementary rather than competing: on the same scene, code generation and video diffusion often excel on different metric classes. A simple prefix-quality routing rule — use VLM when the generated prefix matches observed frames, otherwise video diffusion — consistently outperforms either model alone.

Below we show selected examples where each model wins on a different set of metrics. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition.

Ground Truth
Code Gen
Video Gen
 
No positions
No positions
Code Gen · wins on
motion accuracyobject preservation
Video Gen · wins on
shape overlap
All chips are evaluation wins. Click to highlight a subset for this demo; export saves your picks.
Notes
succcess demo for VLM
ALSO VLM > VDM on color tv, cpompsite score, and pred collapse score and anomaly.
VDM haluciantes the object positions (elastic)
positional 03 still better than VDM in wmae and especially pred collapse score, but worse in mask iou overlap..

The role of physical input information

VLM outputs change substantially with input quality; video diffusion outputs do not. When material labels or object positions are withheld, VLM exhibits corresponding mismatches — suggesting it genuinely uses structured scene information. Video diffusion rollouts remain nearly identical across input regimes, indicating reliance on short-term visual extrapolation rather than the provided specifications.

Below we show selected candidates side by side under all four input regimes. Maximal info gives the full scene specification; minimal info gives only the observation video; no materials reveals object positions but hides material labels; no positions reveals materials but hides where objects start. Comparing columns within a row shows how each model responds when specific physical information is present or missing.

Ground
Truth
Maximal info
full scene specification
Minimal info
video frames only
No materials
positions given, material hidden
No positions
material given, positions hidden
Code Gen
Video Gen
Input Sensitivity
Notes
input types: positional atttempt 01, maximal attempt 01, minimal attempt 12, material attempt 01

Both models do well

On some scenes both models produce high-quality extrapolations — neither clearly dominates on the metrics we highlight. These cases illustrate settings where either approach may be reasonable.

Below we show selected examples where both models perform well. Each card compares Ground Truth · Code Gen · Video Gen under the same input condition.

Ground Truth
Code Gen
Video Gen
 
Maximal info
Maximal info
Both models perform well
All chips are evaluation wins. Click to highlight a subset for this demo; export saves your picks.
Notes
All very accurate

Quantitative Results

Aggregate metrics from the held-out test split. Below we summarize how code generation and video diffusion compare across input conditions, materials, and prediction horizon — complementing the qualitative examples above.

Overall comparison

Normalized scores across evaluation metrics and input regimes. Lower is better for W-MAE, CTV, anomaly rate, and object collapse; higher is better for mIoU. The rightmost column averages across metrics. Code generation is stronger on temporal stability and long-horizon consistency; video diffusion leads on spatial overlap (mIoU).

Heatmap comparing VLM and VDM performance across input conditions and metrics
VLM vs. VDM across input conditions (maximal info, minimal info, no materials, no positions, frames only). Values averaged over the held-out test split.

Error grows with horizon

Moving-average W-MAE over continuation frames. Video diffusion motion error grows faster with prediction horizon than code generation, reflecting accumulated temporal drift in pixel-space extrapolation.

Moving-average W-MAE over continuation frames for VLM and VDM
W-MAE over extrapolation frames (top-1 per model, test split average).

Sensitivity to material info

Performance change when material information is removed from the prompt, by material family. VLMs degrade on elastic, snow, and sand scenes; liquids are less affected. VDMs change little — they rely mainly on pixels, not structured material specs.

Performance change after removing material information by material family
Δ performance relative to full-configuration setting, grouped by material family.

Performance by material

Average normalized score by material family (zero = model-family mean). Code generation is strongest on liquids and snow, weakest on sand. Video diffusion struggles most on elastic/plastic scenes where coherent trajectories must persist over time.

Average normalized performance by material family for VLM and VDM
Normalized performance aggregated across tasks and metrics, grouped by material family.

Dataset Example

Each MPMWorld scene is released as a program (executable Taichi MPM solver), a structured scene description (physics parameters, object geometry, and visualization settings), and a rendered simulation video. Below is one representative scene: snow and liquid blocks fall through two circular obstacles onto a rotating fan.

Simulation output