Abstract
Scaling video predictive models to higher resolutions typically re
quires retraining from scratch, incurring substantial computational
costs. We present Walsh-Hadamard Spectral Bridging (WHSB), a
zero-parameter method that transfers learned video dynamics from
a low-resolution predictive model to a high-resolution decoder without
any additional training of the predictor. WHSB is grounded in the
mathematical observation that the latent space of video world models
exhibits a nested group structure under the Walsh-Hadamard trans
form over Z n 2 : the dynamics learned at coarse resolution correspond
to the low-frequency Walsh spectrum, which is a strict subset of the
spectrum at finer resolutions. We construct a bridge operator that per
forms forward and inverse Walsh-Hadamard transforms with spectral
truncation (downsampling) and zero-padding (upsampling), requiring
no learnable parameters. Experiments on KTH and UCF101 video
datasets demonstrate that WHSB surpasses the full high-resolution
training baseline at a 4× resolution span (64 → 256px), achieving a
cross-resolution prediction ratio C/B = 1.10–1.17. The zero-parameter
Walsh bridge consistently outperforms a learnable linear bridge with
8,352 parameters on UCF101 (1.04× advantage), and achieves compet
itive performance on KTH. Our results suggest that the latent repre
sentations of video predictive models possess an intrinsic Walsh spec
tral nesting structure, enabling zero-cost cross-resolution transfer and
challenging the prevailing assumption that resolution-specific retrain
ing is necessary









No comments yet