HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control

1Stanford University, 2University of Cambridge

Autonomous Rollouts


Above: HoMeR tackles long-horizon, precise, spatially varied real-world and
simulated tasks having only been trained on 20 demos. Videos are 7X.


TLDR: We introduce HoMeR (Hybrid Whole-Body Policies for Mobile Robots),
which combines a hybrid IL agent with a fast, kinematics-based whole-body controller for sample-efficient, generalizable mobile manipulation in-the-wild 🏡.

Abstract

We introduce HoMeR, an imitation learning framework for mobile manipulation that combines whole-body control with hybrid action modes that handle both long-range and fine-grained motion, enabling effective performance on realistic in-the-wild tasks. At its core is a fast, kinematics-based whole-body controller that maps desired end-effector poses to coordinated motion across the mobile base and arm. Within this reduced end-effector action space, HoMeR learns to switch between absolute pose predictions for long-range movement and relative pose predictions for fine-grained manipulation, offloading low-level coordination to the controller and focusing learning on task-level decisions. We deploy HoMeR on a holonomic mobile manipulator with a 7-DoF arm in a real home. We compare HoMeR to baselines without hybrid actions or whole-body control across 3 simulated and 3 real household tasks such as opening cabinets, sweeping trash, and rearranging pillows. Across tasks, HoMeR achieves an overall success rate of 79.17% using just 20 demonstrations per task, outperforming the next best baseline by 29.17 on average. HoMeR is also compatible with vision-language models and can leverage their internet-scale priors to better generalize to novel object appearances, layouts, and cluttered scenes. In summary, HoMeR moves beyond tabletop settings and demonstrates a scalable path toward sample-efficient, generalizable manipulation in everyday indoor spaces.



Data Collection

Whole Body Teleoperation on a Diverse Set of Household Tasks

We demonstrate the range of tasks teleoperable with our whole-body controller in a real home. The controller uses a MuJoCo- and Mink-based IK solver with velocity, posture, and collision constraints—identical to the one used for all autonomous rollouts.

Whole Body Teleoperation: Self-Collision Avoidance

Our controller avoids self-collisions between the arm, base, and camera mounts by modeling each component—including the mounts as cylinders—in the MuJoCo MJCF and enforcing velocity-based collision constraints during IK optimization.

Data Annotation: Mode Labeling








In order to train HoMeR, we require mode and salient point annotations for each demonstration. We use a custom web-based UI to label control modes, allowing annotators to scrub through episodes and segment them into keypose (orange) and dense (gray) modes. Above, we annotate the Cabinet task by labeling the reaching motion toward the handle as keypose, and the grasping and opening phase as dense.

Data Annotation: Salient Point Labeling

To train HoMeR's keypose policy, we provide supervision in the form of salient point annotations. Using a 3D UI, annotators can select a task-relevant point in the scene—such as the cabinet handle. This point is used to supervise the policy’s predicted saliency map and offset-based actions, helping it focus on meaningful regions to reach.


Method Overview

Using the data collected and annotated above, we train HoMeR, a framework that
combines a hybrid IL policy with a whole-body controller for execution:



  • Keypose Policy: Uses third-person point clouds to predict 6-DoF end-effector poses and next control mode for long-range motion. Rather than directly regressing a target pose, the policy learns to predict a task-relevant salient point in the scene and outputs a relative offset from this point to the desired position. Separate learnable tokens predict end-effector orientation, gripper state, and control mode. The policy can optionally be conditioned on an externally specified salient point—e.g., from a vision-language model—for dynamic and interpretable goal specification (HoMeR-Cond).
  • Dense Policy: Diffusion Policy which uses RGB images (third-person and wrist) to predict relative 6-DoF delta actions for fine-grained manipulation once near objects.
  • Whole-Body Controller (WBC): Converts end-effector actions into joint commands for the mobile base and arm, enabling smooth, constraint-aware execution.

Experiments

We evaluate HoMeR on six diverse tasks using only 20 demonstrations each. To isolate key design factors, we compare against baselines that differ along two axes:

  • Hybrid vs. Dense-Only: Switching between keypose and dense actions vs. using only dense actions.
  • Whole-Body Control vs. Decoupled Base+Arm: Coordinated control using a WBC vs. separate base and arm motion.

All baselines use the same training data but differ in action space or control strategy.

Baselines Overview

Method Hybrid
Actions
Whole-Body
Control
Action Space
DP (B+A) Dense base + end-effector delta actions
DP (WBC) Dense end-effector delta actions
HoMeR (B+A) Base and end-effector keyposes + dense base and end-effector delta actions
HoMeR (Ours) End-effector keyposes and dense end-effector delta actions


DP (B+A): Dense-only Diffusion Policy with decoupled base-arm delta prediction.
DP (WBC): Dense-only Diffusion Policy with EE delta prediction, executed through WBC.
HoMeR (B+A): Hybrid keypose+delta model without WBC.
HoMeR: Full method with hybrid actions and WBC for coordinated base-arm control.


Benchmarking Results

HoMeR Rollouts

Cube (18/20)

Dishwasher (13/20)

Cabinet (17/20)

Pillow Rearrangement (16/20)

TV Remote Retrieval (15/20)

Sweeping Trash (16/20)

Representative Baseline Comparison: TV Remote Task

This is a long-horizon, multi-phase task where the robot must reach the cabinet handle, open the cabinet door, then pick up a TV remote and place it on the TV stand.

Both DP (B+A) and DP (WBC) struggle with this task due to their reliance on dense-only actions. Without an abstracted notion of keyposes, reliably reaching the cabinet handle is difficult.

DP (B+A) - (7/20)

DP (WBC) - (3/20)

HoMeR (B+A) performs slightly better by leveraging a base keypose to approach the cabinet and an arm keypose to reach the handle, and dense from there. However, any slight base misalignments complicate arm manipulation.
HoMeR performs best: it first predicts a keypose to reach the cabinet, then switches to dense actions to fetch the remote. The whole-body controller (WBC) enables smooth, coordinated base-arm motion, reducing failures due to poor base positioning while enabling precise object manipulation.

HoMeR (B+A) - (10/20)

HoMeR (Ours) - (15/20)


Quantitative Generalization Results

A key advantage of HoMeR is the ability to condition the keypose policy on externally provided 3D keypoints—enabling flexible goal specification and better generalization. We introduce HoMeR-Cond, which takes as input 3D keypoints generated by the vision-language model Molmo based on a language description of the task.

To further improve robustness, HoMeR-Cond is trained on point clouds without color and augmented with randomly generated distractor points to simulate clutter and occlusions. We evaluate it on four challenging Cube variants with changes in size, distractors, and appearance.

Both HoMeR and HoMeR-Cond-NoAugs succeed in simple settings but degrade in the presence of distractors or novel objects. In contrast, HoMeR-Cond remains robust, demonstrating the value of combining salient point conditioning with point cloud augmentations.

HoMeR-Cond Rollouts

Smaller/Larger Cube (14/20)

Cube w/ Distractors (15/20)

New Cube Specified (16/20)


Qualitative Generalization Results

Below, we show that while HoMeR is trained on a single TV stand cabinet and a specific pillow–couch configuration, the learned policy achieves non-zero success on novel cabinets and a different pillow/couch pair at test time. We note that HoMeR still struggles quite a bit on the new couch/pillow setup, likely due to the vastly different wall color and lighting conditions.

Unseen TV Remote / Pillow Scenarios

Although achieving truly zero-shot generalization in unconstrained real-world environments remains a significant challenge, we are encouraged by HoMeR's promising initial results and potential to leverage VLM knowledge in the future to tackle truly different furniture geometry or room layouts.