AI and machine learning algorithms are becoming increasingly good at predicting next actions in videos. The very best can anticipate fairly accurately where a baseball might travel after it’s been pitched, or the appearance of a road miles from a starting position. To this end, a novel approach proposed by researchers at Google, the University of Michigan, and Adobe advances the state of the art with large-scale models that generate high-quality videos from only a few frames. All the more impressive, it does so without relying on techniques like optical flows (the pattern of apparent motion of objects, surfaces or edges in a scene) or landmarks, unlike previous methods.
“In this work, we investigate whether we can achieve high quality video predictions … by just maximizing the capacity of a standard neural network,” wrote the researchers in a preprint paper describing their work. “To the best of our knowledge, this work is the first to perform a thorough investigation on the effect of capacity increases for video prediction.”
The team’s baseline model builds on an existing stochastic video generation (SVG) architecture, with a component that models the inherent uncertainty in future predictions. They separately trained and tested several versions of the model against data sets tailored to three prediction categories: object interactions, structured motion, and partial observability. For the first task — object interactions — the researchers selected 256 videos from a corpus of videos of robot arm interacting with towels, and for the second — structured motion — they sourced clips from Human 3.6M, a corpus containing clips of humans performing actions like sitting on a chair. As for the partial observability task, they used the open source KITTI driving data set of front car dashboard camera footage.
Team conditioned every model on between two input to five video frames and had the models predict between five to ten frames into the future during training, at a low resolution (64 by 64 pixels) for all tasks and at both a low and high resolution (128 by 128 pixels) for the objects interactions task During testing, the models generated up to 25 frames.
The researchers report that one of the largest models was preferred 90.2, 98.7%, and 99.3% of the time by evaluators recruited through Amazon Mechanical Turk with respect to the object interactions, structured motion, and partial observability tasks, respectively. Qualitatively, the team notes that it crisply depicted human arms and legs and made “very sharp predictions that looked realistic in comparison to the ground truth.
“Our experiments confirm the importance of recurrent connections and modeling stochasticity [or randomness] in the presence of uncertainty (e.g., videos with unknown action or control),” wrote the paper’s coauthors. “We also find that maximizing the capacity of such models improves the quality of video prediction. We hope our work encourages the field to push along similar directions in the future – i.e., to see how far we can get … for achieving high quality video prediction.”