R+X: Retrieval and Execution from
Everyday Human Videos

Georgios Papagiannis*, Norman Di Palo*, Pietro Vitiello and Edward Johns

*Equal Contribution, random order

ICRA 2025

Abstract

We present R+X, a framework which enables robots to learn skills from long, unlabelled first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then conditions an in-context imitation learning technique on this behaviour to execute the skill. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods.

Key Idea R+X learns skills from a long, unlabelled, first person video of a human doing everyday tasks using foundation models to retrieve demonstrations, compute and execute the desired actions.

Record Anywhere, from Multiple Views

R+X can learn from videos recorded in a wide variety of environments. We test it by recording everyday like tasks in multiple rooms, multiple buildings, and even outside. We show that it is also possible to use different views, including a chest camera, head camera or a third person camera.

gripper_traj1-ezgif.com-video-to-gif-converter.gif

gripper_traj5-ezgif.com-video-to-gif-converter.gif

gripper_traj2-ezgif.com-video-to-gif-converter.gif

Deploy Immediately to Novel Environments and Objects

Skills learned from videos can generalize to novel environments, filled with distractors, and even unseen test objects.

R+X : Overview

Example R+X Policy Rollouts

We teach 12 everyday skills to a robot by simply recording a video of a user performing these tasks in everyday environments, without the need to provide any form of labels. R+X can execute these tasks given a language command without needing labels or any training or finetuning by leveraging the abilities of recent Vision Language Models, and employing a few-shot in-context imitation learning model.

(Gripper trajectory starts from red and progress to blue)

Hard Spatial and Language Generalisation

When comparing R+X to monolithic language-condition policies as baselines, we noticed two main sources of difference in performance: the ability to generalise spatially and to nuanced language. These differences mostly arise from (1) R+X ability to extract semantic keypoints by comparing the retrieved video clips (2) R+X use of Gemini, a large Vision Language Model, to perform retrieval, that therefore allows it to inherit the language and vision understanding of Gemini.

Here we show some examples of trajectories generated by R+X or R3M-DiffLang, highlighting how R+X is able to succeed also when target objects are placed on top of other objects (creating an out of distribution spatial configuration), and when receiving a nuanced language command. More details can be found in the paper.

Robustness to Distractors

As we show in our paper, R+X is more robust to distractors than the baselines. This property emerges from the ability of our method to first retrieve examples of the task being completed, and then finding a set of semantically and geometrically meaningful keypoints by matching DINO descriptors that are common among the frames of the retrieved videos. We show results and example trajectories with and without distractors on two tasks, and the keypoints that are extracted by our method on the "put cloth in basket" task.

Mapping Hand to Gripper

Given a retrieved clip from our human video, we leverage a recent hand-detection model, HaMeR, to extract the 21 joints parametrising the MANO model. From these, based on the task, we use an heuristic to translate hand movement to gripper movement.

To align the gripper with the hand pose, we employ different heuristics based on the kind of "action" that the hand is performing, such as grasping, pushing or pressing. The action itself is determined by Gemini during the retrieval phase. More information can be found in our Supplementary Material.

Stabilising a First Person Video

The camera pose, being attached to the user's chest, moves from frame to frame, making it difficult to represent the scene as a fixed point cloud and expressing the hand trajectory in that frame of reference. As classic Structure from Motion techniques tend to fail when objects move in the scene (such as the user's arm), we design a different pipeline: we first segment out the arm and look for stable objects like tables, walls, and floors, and then leverage TAPIR, a keypoint-tracking method, to compute the relative transformation between each frame, and therefore extract the camera movement.

Note we do not explicitly use these point clouds as inputs to our models, but rather extract a set of sparse visual 3D keypoints as described in the paper.

Retrieved Video

Point Cloud before Stabilisation

Gripper's trajectory before Stabilisation

Point Cloud after Stabilisation

Gripper's trajectory after Stabilisation

Preprocessing the Human Video

After recording the video of a human performing everyday tasks, we process it to remove unnecessary frames, i.e., frames that do not contain human hands. We achieve this automatically by leveraging HaMeR to detect the frames where hands are present. Consequently, we are left with one long, unlabelled video of smaller video clips, concatenated together, each containing only frames where a human interacts with various objects with their hands.

The above frames are extracted from the video below, an example of the data we record in R+X. As recording a single, tens of minutes long video emulating household tasks in a lab setting is challenging, we recorded several videos as the one below where we tackle several tasks in a single take. We then process it as above, obtaining a list of video clips, and then concatenate them all together. The result, given the removal of parts where hands are not visible, is basically identical to a single, tens of minutes long take, with the difference that "resetting" the tasks is easier during takes.