SurgWM: Learning Physics from Unlabeled Surgical Video
The Simulation Gap
Reinforcement Learning (RL) has shown incredible promise in games (like my previous work at Bhoos Games). However, applying RL to robotic surgery faces a massive hurdle: simulators.
Hand-coding the physics of soft-tissue deformation is incredibly difficult. Ideally, we want an agent to learn directly from video of real surgeries. But real surgical video is unlabeled—we don't know exactly what action the surgeon took at every millisecond.
Introducing Surgical Vision World Model (SurgWM)
In our latest paper, we propose a way to learn a "World Model" purely from observation.
How it works
Inspired by the Genie framework, we utilize a VQ-VAE to compress visual observations into discrete tokens. The core innovation is an unsupervised Latent Action Model.
Instead of needing ground-truth labels (e.g., "moved tool left"), the model infers discrete actions that explain the transition between two frames.
Why this matters
By optimizing the latent space to disentangle tool movements from background noise, we ensured these inferred actions were semantically meaningful. This allows us to generate action-controllable surgical video.
We can now say to the model: "Show me what happens if the tool moves here," and the model hallucinates the tissue deformation physically correctly. This opens the door for training autonomous surgical agents entirely inside a dreamed world.