We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces.
We demonstrate the range of tasks teleoperable with our whole-body controller in a real home. The controller uses a MuJoCo- and Mink-based IK solver with velocity, posture, and collision constraints—identical to the one used for all autonomous rollouts.
Our controller avoids self-collisions between the arm, base, and camera mounts by modeling each component—including the mounts as cylinders—in the MuJoCo MJCF and enforcing velocity-based collision constraints during IK optimization.
In order to train HoMeR, we require mode and salient point annotations for each demonstration. We use a custom web-based UI to label control modes, allowing annotators to scrub through episodes and segment them into keypose (orange) and dense (gray) modes. Above, we annotate the Cabinet task by labeling the reaching motion toward the handle as keypose, and the grasping and opening phase as dense.
To train HoMeR's keypose policy, we provide supervision in the form of salient point annotations. Using a 3D UI, annotators can select a task-relevant point in the scene—such as the cabinet handle. This point is used to supervise the policy’s predicted saliency map and offset-based actions, helping it focus on meaningful regions to reach.
Using the data collected and annotated above, we train HoMeR, a framework that
combines a hybrid IL policy with a whole-body controller for execution:
We evaluate HoMeR on six diverse tasks using only 20 demonstrations each. To isolate key design factors, we compare against baselines that differ along two axes:
All baselines use the same training data but differ in action space or control strategy.
Method | Hybrid Actions |
Whole-Body Control |
Action Space |
---|---|---|---|
DP (B+A) | ✗ | ✗ | Dense base + end-effector delta actions |
DP (WBC) | ✗ | ✓ | Dense end-effector delta actions |
HoMeR (B+A) | ✓ | ✗ | Base and end-effector keyposes + dense base and end-effector delta actions |
HoMeR (Ours) | ✓ | ✓ | End-effector keyposes and dense end-effector delta actions |
DP (B+A): Dense-only Diffusion Policy with decoupled base-arm delta prediction.
DP (WBC): Dense-only Diffusion Policy with EE delta prediction, executed through WBC.
HoMeR (B+A): Hybrid keypose+delta model without WBC.
HoMeR: Full method with hybrid actions and WBC for coordinated base-arm control.
This is a long-horizon, multi-phase task where the robot must reach the cabinet handle, open the cabinet door, then pick up a TV remote and place it on the TV stand.
Both DP (B+A) and DP (WBC) struggle with this task due to their reliance on dense-only actions. Without an abstracted notion of keyposes, reliably reaching the cabinet handle is difficult.
HoMeR (B+A) performs slightly better by leveraging a base keypose to approach the cabinet and an arm keypose to reach the handle, and dense from there. However, any slight base misalignments complicate arm manipulation.
HoMeR performs best: it first predicts a keypose to reach the cabinet, then switches to dense actions to fetch the remote. The whole-body controller (WBC) enables smooth, coordinated base-arm motion, reducing failures due to poor base positioning while enabling precise object manipulation.
A key advantage of HoMeR is the ability to condition the keypose policy on externally provided 3D keypoints—enabling flexible goal specification and better generalization. We introduce HoMeR-Cond, which takes as input 3D keypoints generated by the vision-language model Molmo based on a language description of the task.
To further improve robustness, HoMeR-Cond is trained on point clouds without color and augmented with randomly generated distractor points to simulate clutter and occlusions. We evaluate it on four challenging Cube variants with changes in size, distractors, and appearance.
Both HoMeR and HoMeR-Cond-NoAugs succeed in simple settings but degrade in the presence of distractors or novel objects. In contrast, HoMeR-Cond remains robust, demonstrating the value of combining salient point conditioning with point cloud augmentations.
Below, we show that while HoMeR is trained on a single TV stand cabinet and a specific pillow–couch configuration, the learned policy achieves non-zero success on novel cabinets and a different pillow/couch pair at test time. We note that HoMeR still struggles quite a bit on the new couch/pillow setup, likely due to the vastly different wall color and lighting conditions.
Although achieving truly zero-shot generalization in unconstrained real-world environments remains a significant challenge, we are encouraged by HoMeR's promising initial results and potential to leverage VLM knowledge in the future to tackle truly different furniture geometry or room layouts.