CLIPort: What and Where Pathways for Robotic Manipulation
CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2021
CLIPort is an end-to-end imitation-learning agent that can learn a single language-conditioned policy for various tabletop tasks. The framework combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of TransporterNets to learn generalizable skills from limited training demonstrations.
For the latest updates, see: cliport.github.io
Clone Repo:
git clone https://github.com/cliport/cliport.git
Setup virtualenv and install requirements:
# setup virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages cliport_env
source cliport_env/bin/activate
pip install --upgrade pip
cd cliport
pip install -r requirements.txt
export CLIPORT_ROOT=$(pwd)
python setup.py develop
Note: You might need versions of torch==1.7.1
and torchvision==0.8.2
that are compatible with your CUDA and hardware.
A quick tutorial on evaluating a pre-trained multi-task model.
Download a pre-trained checkpoint for multi-language-conditioned
trained with 1000 demos:
sh scripts/quickstart_download.sh
Generate a small test
set of 10 instances for stack-block-pyramid-seq-seen-colors
inside $CLIPORT_ROOT/data
:
python cliport/demos.py n=10 \
task=stack-block-pyramid-seq-seen-colors \
mode=test
This will take a few minutes to finish.
Evaluate the best validation checkpoint for stack-block-pyramid-seq-seen-colors
on the test set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=1000 \
exp_folder=cliport_quickstart \
checkpoint_type=test_best \
update_results=True \
disp=True
If you are on a headless machine turn off the visualization with disp=False
.
You can evaluate the same multi-language-conditioned
model on other tasks. First generate a val
set for the task and then specify eval_task=<task_name>
with mode=val
and checkpoint_type=val_missing
(the quickstart doesn't include validation results for all tasks; download all task results from here).
Checkout affordance.ipynb to visualize affordance predictions of cliport
on various tasks.
Download center-of-mass (COM) corrected Google Scanned Objects:
sh scripts/google_objects_download.sh
Credit: Google.
This Google Drive Folder contains pre-trained multi-language-conditioned
checkpoints for n=1,10,100,1000
and validation/test result JSONs for all tasks. The *val-results.json
files contain the name of the best checkpoint (from validation) to be evaluated on the test
set.
Note: Google Drive might complain about bandwidth restrictions. I recommend using rclone with API access enabled.
Evaluate the best validation checkpoint on the test set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=100 \
exp_folder=cliport_exps \
checkpoint_type=test_best \
update_results=True \
disp=True
A single NVIDIA GPU with 8.5 to 9.5GB memory should be sufficient for training and evaluation.
Tested with:
The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:
train
, val
, test
datasets with demos.py
train.py
eval.py
to find the best checkpoint on val
tasks and save *val-results.json
*val-results.json
on test
tasks with eval.py
Generate a train
set of 1000 demonstrations for stack-block-pyramid-seq-seen-colors
inside $CLIPORT_ROOT/data
:
python cliport/demos.py n=1000 \
task=stack-block-pyramid-seq-seen-colors \
mode=train
You can also do a sequential sweep with -m
and comma-separated params task=towers-of-hanoi-seq-seen-colors,stack-block-pyramid-seq-seen-colors
. Use disp=True
to visualize the data generation.
Run generate_dataset.sh
to generate the full dataset and save it to $CLIPORT_ROOT/data
:
sh scripts/generate_dataset.sh data
Note: This script is not parallelized and will take a long time (maybe days) to finish. The full dataset requires ~1.6TB of storage, which includes both language-conditioned and demo-conditioned (original TransporterNets) tasks. It's recommend that you start with single-task training if you don't have enough storage space.
Make sure you have a train
(n demos) and val
(100 demos) set for the task you want to train on.
Train a cliport
agent with 1000
demonstrations on the stack-block-pyramid-seq-seen-colors
task for 200K iterations:
python cliport/train.py train.task=stack-block-pyramid-seq-seen-colors \
train.agent=cliport \
train.attn_stream_fusion_type=add \
train.trans_stream_fusion_type=conv \
train.lang_fusion_type=mult \
train.n_demos=1000 \
train.n_steps=201000 \
train.exp_folder=exps \
dataset.cache=False
Iteratively evaluate all the checkpoints on val
and save the results in exps/<task>-train/checkpoints/<task>-val-results.json
:
python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=val \
n_demos=100 \
train_demos=1000 \
checkpoint_type=val_missing \
exp_folder=exps
Choose the best checkpoint from validation to run on the test
set and save the results in exps/<task>-train/checkpoints/<task>-test-results.json
:
python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=100 \
train_demos=1000 \
checkpoint_type=test_best \
exp_folder=exps
Train multi-task models by specifying task=multi-language-conditioned
, task=multi-attr-packing-box-pairs-unseen-colors
etc.
python cliport/train.py train.task=multi-language-conditioned \
train.agent=cliport \
train.attn_stream_fusion_type=add \
train.trans_stream_fusion_type=conv \
train.lang_fusion_type=mult \
train.n_demos=1000 \
train.n_steps=601000 \
dataset.cache=False \
train.exp_folder=exps \
dataset.type=multi
Important: You need to generate the full dataset of tasks specified in dataset.py
before multi-task training or modify the list of tasks here.
Run validation with a trained multi-language-conditioned
multi-task model on stack-block-pyramid-seq-seen-colors
:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=val \
n_demos=100 \
train_demos=1000 \
checkpoint_type=val_missing \
type=single \
exp_folder=exps
Evaluate the best checkpoint on the test
set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=100 \
train_demos=1000 \
checkpoint_type=test_best \
type=single \
exp_folder=exps
To save high-resolution videos of agent executions, set record.save_video=True
:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=100 \
exp_folder=cliport_exps \
checkpoint_type=test_best \
update_results=True \
disp=True \
record.save_video=True
This will save videos inside ${model_dir}/${exp_folder}/${eval_task}-${agent}-n${train_demos}-train/videos/
.
Note: Rendering at high-resolutions is super slow and will take a long time to finish.
*seen
and *unseen
tasks are identical, and only the val and test sets differ for purposes of evaluating generalization performance. So you might not need two duplicate train sets or train two separate models.320 x 160
might be too small for some tasks with tiny objects, especially for packing Google objects. Larger resolutions might help improve legibility.packing-seen-google-object-*
tasks, the multi-task models are never trained on the full seen
split of Google Scanned Objects. So a single-task model trained on packing-seen-google-object-*
will have seen more objects than the comparable multi-task model.cliport
.Install Docker and NVIDIA Docker.
Modify docker_build.py and docker_run.py to your needs.
Build the image:
python scripts/docker_build.py
Start container:
python scripts/docker_run.py --nvidia_docker
cd ~/cliport
Use scripts/docker_run.py --headless
if you are on a headless machines like a remote server or cloud instance.
It depends on the complexity of the task. With 5-10 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 1, 10, 100, and 1000 demonstrations.
This means either there is some sort of bias in the dataset that the agent is exploiting, or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the input, then it's going to be hard for agent to figure out what you mean.
CLIPort does not predict height values. You can either: (1) come up with a heuristic based on the heightmap to determine the height position, or (2) train a simple MLP like in TransportNets-6DOF to predict z-values.
Note that CLIPort is not doing "object detection". CLIPort fine-tunes CLIP's representations to "detect actions" in SE(2). CLIP by itself has no understanding of actions or affordances; recognizing and localizing objects (e.g. detecting hammer) does not tell you anything about how to manipulate them (e.g. grasping the hammer by the handle).
The default settings should work well. Although recently, I have been playing around with using FiLM (Perez et. al, 2017) to fuse language features inspired by BC-0 (Jang et. al, 2021). Qualitatively, it seems like FiLM is better for reading text etc. but I haven't conducted a full quantitative analysis. Try it out yourself with train.agent=two_stream_clip_film_lingunet_lat_transporter
(non-residual FiLM).
Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals.
By default, training samples are augmented with SE(2) rotations sampled from N(0, 60 deg)
. For tasks with rotational symmetries (like moving pieces on a chessboard) you need to be careful with this rotation augmentation parameter.
This work use code from the following open-source projects and datasets:
Original: https://github.com/google-research/ravens
License: Apache 2.0
Changes: All PyBullet tasks are directly adapted from the Ravens codebase. The original TransporterNets models were reimplemented in PyTorch.
Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to CLIP-ResNet50 to save intermediate features for skip connections.
Original: Dataset
License: Creative Commons BY 4.0
Changes: Fixed center-of-mass (COM) to be geometric-center for selected objects.
Original: https://github.com/milesial/Pytorch-UNet/
License: GPL 3.0
Changes: Used as is in unet.py. Note: This part of the code is GPL 3.0.
CLIPort
@inproceedings{shridhar2021cliport,
title = {CLIPort: What and Where Pathways for Robotic Manipulation},
author = {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
booktitle = {Proceedings of the 5th Conference on Robot Learning (CoRL)},
year = {2021},
}
CLIP
@article{radford2021learning,
title={Learning transferable visual models from natural language supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
journal={arXiv preprint arXiv:2103.00020},
year={2021}
}
TransporterNets
@inproceedings{zeng2020transporter,
title={Transporter networks: Rearranging the visual world for robotic manipulation},
author={Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others},
booktitle={Proceedings of the 4th Conference on Robot Learning (CoRL)},
year= {2020},
}
Please file an issue with the issue tracker.