How to Implement GRASP for Robust Long-Horizon Planning with World Models

Introduction

Planning over long horizons with learned world models is a powerful but fragile task. As models scale, they predict high-dimensional visual sequences, but optimization becomes ill-conditioned and prone to local minima. GRASP (Gradient-based Planning for World Models at Longer Horizons) addresses this by lifting the trajectory into virtual states for parallel optimization, adding stochasticity for exploration, and reshaping gradients to avoid brittle signals from vision models. This guide walks you through implementing GRASP step by step.

How to Implement GRASP for Robust Long-Horizon Planning with World Models — Source: bair.berkeley.edu

What You Need

A trained world model P_θ(s_t+1 | s_t-h:t, a_t) that predicts next states conditioned on past states and actions.
An action space A (discrete or continuous).
State observations (images, latent vectors, or proprioception data).
Optimization library (e.g., PyTorch, JAX) with automatic differentiation.
Hyperparameters: planning horizon H, number of virtual states N, exploration noise scale σ, gradient clipping threshold τ.
Baseline planner for comparison (optional).

Step-by-Step Guide

Step 1: Set Up and Parallelize the Trajectory with Virtual States

The core idea is to treat the planned trajectory as a sequence of virtual states that are optimized in parallel across time. This avoids sequential dependency and speeds up gradient-based planning.

1a. Define virtual states: Let v₁, v₂, ..., v_H be the virtual states corresponding to time steps 1 to H. Each virtual state is a parameter in the same space as the world model's state representation (e.g., latent vectors). Initialize them with random noise or by rolling out the world model with random actions.
1b. Parallel optimization objective: Instead of sequentially minimizing prediction error, define a loss that constrains each virtual state to be a valid next state transition. For each time step t, the loss term is L(v_t+1, P_θ(v_t, a_t)) where L is a distance metric (e.g., MSE). Optimize all v_t and a_t simultaneously using gradient descent.
1c. Exploit parallelism: Since the loss is additive across time, compute gradients in parallel for all t. This reduces wall-clock time and improves convergence, especially for long horizons (H > 50).

Step 2: Inject Stochasticity into State Iterates for Exploration

To escape poor local minima, GRASP adds controlled noise directly to the virtual states during optimization. This is different from action noise and targets the latent space.

2a. Add Gaussian noise: At each gradient update step, perturb the virtual states with random noise drawn from N(0, σ²) scaled by the current step size. This encourages exploration of alternative trajectories.
2b. Schedule noise annealing: Start with higher σ (e.g., 0.1 times state dimension) and decay over iterations. This ensures early exploration and later refinement.
2c. Resample every K steps: Optionally, resample noise every K optimization steps to avoid settling into one noisy region. Combine with gradient clipping to keep updates stable.

Step 3: Reshape Gradients to Avoid Brittle Vision Signals

High-dimensional vision models produce noisy or saturated gradients that mislead planning. GRASP reshapes gradients by bypassing the vision model's gradient path and using a cleaner gradient from a low-dimensional latent loss.

3a. Use a gradient clipping trick: Instead of propagating gradients through the entire world model (which includes vision encoders/decoders), compute the gradient of the loss with respect to the latent representation only. This avoids vanishing/exploding gradients from image reconstruction.
3b. Implement a stop-gradient detour: For the action path, compute a surrogate gradient that directly updates actions based on the virtual state loss, without flowing through the vision model. This keeps action gradients clean and prevents the optimizer from being misled by high-frequency noise in visual features.
3c. Normalize gradient magnitudes: Use per-parameter gradient clipping to ensure that gradients from different modalities (actions vs. states) are balanced. This prevents any single component from dominating the update.

Step 4: Run the Optimizer and Monitor Convergence

4a. Choose an optimizer: Adam typically works well. Set learning rate between 1e-3 and 1e-2, depending on horizon length. Use gradient clipping threshold τ = 1.0.
4b. Iterate: Repeat steps 1-3 for T iterations (e.g., 100-500). Each iteration involves: compute loss over virtual states, add noise (Step 2), compute reshaped gradients (Step 3), update parameters.
4c. Decode the final trajectory: After optimization, the virtual states v_t are not directly observations; either decode them using the world model's decoder to get images, or read the corresponding actions a_t for execution.

Tips and Best Practices

Hyperparameter tuning: The noise scale σ should be roughly 10% of the typical state noise in your environment. Start with larger values and anneal to zero over half the optimization steps.
Horizon length: GRASP works best for horizons longer than 20. For very long horizons (>200), consider chunking the trajectory into overlapping windows and planning each window separately.
Visual validation: Always visualize the decoded planned trajectory qualitatively. GRASP should produce smooth, physically plausible sequences; if you see jitter, reduce noise or increase gradient clipping.
Comparison baseline: Compare against a standard gradient-based planner (e.g., direct backpropagation through the world model) and a model-free method (e.g., random shooting) to verify improvements.
Computational cost: Parallelization adds memory overhead (virtual states multiplied by horizon length). If memory is limited, trade off batch size for longer horizons.
Stochasticity vs. diversity: Too much noise can destabilize training. Use the Stop-gradient detour (Step 3b) to keep action gradients stable even with large noise.