Unlike chatbots or image generators, robots must operate in real time. While a robot is “thinking”, the world around it evolves according to physical laws, so delays between inputs and outputs have a tangible impact on performance. For a language model, the difference between fast and slow generation is a satisfied or annoyed user; for a vision-language-action model (VLA), it could be the difference between a robot handing you a hot coffee or spilling it in your lap. While VLAs have achieved promising results in open-world generalization, they can be slow to run. Like their cousins in language and vision, these models have billions of parameters and require heavy-duty GPUs. On edge devices like mobile robots, that adds even more latency for network communication between a centralized inference server and the robot1.
To build a real-time system with VLAs, we are going to need some form of asynchrony: that is, we must let a model think about its future actions while executing a previous one. Action chunking — where a robot outputs and executes a sequence of multiple actions for each inference call — provides a good starting point. Our VLAs all use a chunk size of 50 actions, corresponding to 1 second of real time. However, chunking alone is not enough. When we switch chunks, the new actions might not “agree” with the old ones, causing discontinuities and unsafe accelerations. Trying to naively smooth over these discontinuities is not guaranteed to produce valid actions, and can have disastrous consequences.
The aforementioned disastrous consequences. These results are obtained by attempting to do real-time inference with temporal ensembling to smooth over the discontinuities, a popular prior method introduced in ACT (Zhou et al., 2023).
As a result, in π0, π0-FAST, and π0.5, we did not use a real-time strategy2. We executed actions synchronously, meaning that we would finish executing one chunk, wait for model inference, and then begin executing the next one. This way, the robot started each chunk from rest, and we circumvented the issues that arise from switching chunks while in motion. However, this introduces pauses between chunks that are still harmful — these pauses aren't in the training data. Not to mention: they slow things down, are ugly to look at, and discourage us from scaling up the size of our models.
To solve these problems, we developed an algorithm that we call real-time chunking (RTC). It enables real-time execution without discontinuities, and it works on any diffusion- or flow-based VLA — including π0.5 — with no training-time changes. We found that RTC significantly sped up execution time for all the tasks we tested. It was also very robust to latency, even when we injected artificial delays to cause much higher latencies than normal. RTC completed dynamic and precise tasks, like striking a match or plugging in an Ethernet cable, with inference delays of more than 300 milliseconds.
When there is an inference delay, real-time execution requires careful consideration. While a new chunk is generated (red), the previous chunk (green) continues executing. If the new chunk is substantially different — which it often is — switching to the new chunk results in disaster.
The core challenge of real-time execution is to maintain consistency between action chunks. By the time a new chunk has been generated by the model, the previous one has already been executed partway. Without a specialized algorithm, the new chunk might be totally incompatible with the robot's current trajectory — the model might be reacting to new information, or it might just be sampling a different “strategy” from its learned distribution of behaviors. Thus, we must somehow use the overlapping timesteps where we have access to the remaining actions of the previous chunk. A good real-time algorithm should produce a new chunk that is consistent with these overlapping actions while still preserving the model's reactivity to new information and ability to make intelligent decisions.
Our key insight is to pose real-time chunking as an inpainting problem. Let's say that our model takes 3 controller timesteps to produce an action chunk after receiving an observation. The first 3 actions of a new chunk can't be executed, since those timesteps will have already passed by the time the new chunk is available. Thus, it makes sense to “freeze” those actions to the values from the previous chunk that we know will be executed; our goal is then to fill in the remainder of the new chunk, much like inpainting a section of an image that has been removed. Depending on the chunk size and how often we run inference, there will be some number of actions beyond the first 3 that also overlap with the previous chunk. Rather than ignoring them completely and starting fresh, it also makes sense to partially attend to these middle actions — encouraging the model to keep a consistent strategy, but allowing it to make updates based on new information.
Luckily, diffusion and flow models happen to be really good at image inpainting, even without being trained for it. By adapting these algorithms to our setting and adding our “partial attention” idea on top, we can solve the chunk consistency problem without any training-time changes. This means we can easily apply our method directly on top of models like π0 and π0.5, benefiting from all the research that goes into training while executing in real time!
We designed a set of experiments to see how real-time chunking performs compared to synchronous inference. RTC should be faster, of course, since it eliminates the inference pauses between chunks. However, we also hypothesized that RTC would help with precision — those pauses change the dynamics of the robot in a way that is not accounted for by the model. Therefore, we tested two short, highly precise tasks (lighting a candle with a match and plugging in an Ethernet cable) in addition to our standard longer-horizon tasks.
Part of the motivation for real-time chunking is to handle higher inference delays than we have right now. Whether from scaling up model size, running inference on the cloud, or supporting different embodiments, latencies will get higher, and handling them will become increasingly critical. Therefore, we also studied how our policies performed with +100ms and +200ms of artificially injected latency. At +200ms, the total inference delay on a mobile manipulator is more than one-third of a second.
Component | time (mobile) | time (static) |
---|---|---|
Model | 97ms | 97ms |
Network | 21ms | 6.9ms |
Image resize | 11ms | 1.4ms |
Other | 9.7ms | 3.0ms |
Total | 139ms | 108ms |
Evaluating the “speed” of a policy is not trivial. We don't want to reward failure, no matter how speedy, nor do we want to reward succeeding quickly but only rarely. We devised a throughput metric by dividing each task into substeps — e.g., folding one item of clothing — and then defining throughput for an episode as the proportion of substeps completed successfully divided by the duration of the episode. As shown below, synchronous inference suffers greatly from increasing inference delays, whereas the performance of real-time chunking remains completely unchanged up to +200ms.
Of course, we expected RTC to be faster, since it eliminates the pauses between chunks. To test our hypothesis about precision, we also measured policies by another metric: controller steps, or equivalently, elapsed time with the inference pauses removed. This does not affect RTC, but gives synchronous inference an apparent boost in performance. As shown below, the two methods measured by controller steps often reach similar final scores; however, RTC tends to make more progress earlier in the episode, indicating that it makes fewer mistakes. Check out the full paper for more in-depth results!
VLAs and other robot foundation models must handle the physical world in real time. RTC provides a very simple and effective strategy for real-time inference with current VLAs, and a surprising amount of resilience even to high inference delays. There is a lot more to do here: future robot systems will need to make complex inferences at multiple levels of abstraction and multiple time scales, plan out complex and quick dynamic movements, and pause to “think harder” as needed. Getting this right will become more and more important as models increase in size, and the scale we need for true physical intelligence will likely necessitate more sophisticated mechanisms for real-time inference.
If you are interested in joining us to work on these and other problems to bring physical intelligence into the real world, get in touch! For researchers interested in our work, collaborations, or other queries, please write to research@physicalintelligence.company.
Edge GPUs will improve over time, but as robot datasets grow in size, so will the best VLAs. We don't want to be forever limited to the size of a model we can fit onto a mobile robot's onboard computer. ↩
We did actually use an early version of real-time chunking for some parts of some videos, but not for any quantitative results. See if you can spot which videos have inference pauses and which don't! ↩