Image
alt="The “Diffusion Forcing” method can sort through noisy data and reliably predict the next steps in a task, helping a robot complete manipulation tasks, for example. In one experiment, it helped a robotic arm rearrange toy fruits into target spots on circular mats despite starting from random positions and visual distractions (Credits: Mike Grimmett/MIT CSAIL)."
CSAIL article

In the current AI zeitgeist, sequence models have skyrocketed in popularity for their ability to analyze data and predict what to do next. For instance, you’ve likely used next-token prediction models like ChatGPT, which anticipate each word (token) in a sequence to form answers to users’ queries. There are also full-sequence diffusion models like Sora, which convert words into dazzling, realistic visuals by successively “denoising” an entire video sequence

Image
The Grasping Neural Process uses limited interaction data to help robots understand unclear objects in real-time (Credits: Alex Shipps/MIT CSAIL).
CSAIL article

When robots come across unfamiliar objects, they struggle to account for a simple truth: Appearances aren’t everything. They may attempt to grasp a block, only to find out it’s a literal piece of cake. The misleading appearance of that object could lead the robot to miscalculate physical properties like the object’s weight and center of mass, using the wrong grasp and applying more force than needed.