NaviLLM Save

[CVPR 2024] The code for paper 'Towards Learning a Generalist Model for Embodied Navigation'

Project README

Towards Learning a Generalist Model for Embodied Navigation

&nbsp

Duo Zheng^1,2*, Shijia Huang^1*, Lin Zhao³, Yiwu Zhong¹ and Liwei Wang^1&ddagger;

^*Equal contribution. ^‡ Corresponding author.

¹The Chinese University of Hong Kong
²Shanghai AI Laboratory
³Centre for Perceptual and Interactive Intelligence

Building a generalist agent that can interact with the world is an ultimate goal for humans, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.

Updates

Feb 24, our paper is accepted to CVPR 2024 Poster (Highlight).
Dec 13, we release the model checkpoints at this link.
Dec 6, the processed data and features can be found here.
Dec 5, our paper is released.
Nov 28, we make our code public.

Features

Support multiple VLN tasks (CVDN, SOON, R2R, and REVERIE), 3D QA (ScanQA) and LLaVA instruction tuning in a multi-task framework.
Allow addinng new tasks by customizing the dataset and agent class.
Enable to design prompts for tasks flexibly.

Method

We propose schema-based instruction and design a series of schemas (e.g., descriptions of tasks, visual observation, and navigation history), based on the characteristics of embodied tasks. Benefitting from this design, we are able to train a unified model on the data collected for diverse tasks, thereby enabling our model to address a wide spectrum of tasks, ranging from vision-language navigation and object localization, to 3D question answering, trajectory summarization, embodied question answering.

Experiments

With only a single model, NaviLLM has achieved new state-of-the-art results simultaneously on multiple benchmarks, i.e. CVDN, SOON, and ScanQA, and demonstrated comparable performance to latest models on R2R and REVERIE. Additionally, it also won the first place on CVDN leaderboard and the second place on ScanQA leaderboard.

Installation

Install the MatterPort 3D simulator. Please add the simulator path to yout python path.

export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH

Set up the Java Development Kit (JDK), if you want to enable METEOR while evaluating ScanQA. Otherwise, please comment out the related code.

export JAVA_HOME=$jdk_path
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

Create the conda environment and install the requirements.

conda create --name navillm python=3.8.16
conda activate navillm
pip install -r requirements.txt

Data Processing

The data directory is structed as follows. Please download the processed data and features from One Drive.

data
├── connectivity
├── CVDN
├── LLaVA
├── SOON
├── R2R
├── REVERIE
├── EQA
├── eva_features
│   ├── mp3d_EVA02-CLIP-L-14-336.hdf5
│   ├── scanqa_EVA02-CLIP-L-14-336.hdf5
│   └── coco_EVA02-CLIP-L-14-336.hdf5
├── obj_features
│   ├── reverie_obj_feat
│   └── soon_obj_feat
├── models
    └── Vicuna-7B

1. Orinal Datasets

R2R & REVERIE & SOON: we use the annotation provided by DUET.
CVDN: The annotation could be downloaded from the official repository.
ScanQA: Please download the annotation and frames extracted from ScanNet at here.
LLaVA: LLaVA-detail-23k is used for insturction following.
Augmented Data from R2R and REVERIE: We utilize the augmented data generated by DUET.

2. Image Features

The image features are extracted with EVA-CLIP-02-Large (428M). And we also provide scripts used for extracting features from MP3D, ScanQA, COCO at scripts/data_tools. To use EVA-CLIP-02, please install the corresponding environment following the instruction of th original reposity.

cd scripts/data_tools
sh extract_features_mp3d.sh         # for Matterport3D
#   sh extract_features_scanqa.sh   # for ScanQA
#   sh extract_features_coco.sh     # for COCO

3. Object Features

We leverage the object features extracted from ViT-B16 by HM3DAutoVLN, and put the processed features of REVERIE and SOON at data/obj_features. You could either disable the object features by removing the flag --enable_og.

4. Models

The LLM is built upon Vicuna-7B-v1.1. Please download the pre-trained model and put it at data/models. Using Vicuna-7B-v0 will have a certain degree of decrease compared the original results (#7).

Model Checkpoints

We release the model checkpoints and corresponding training logs as follows.

	Log	Cost	CVDN	SOON		R2R		REVERIE		ScanQA
		Time (day)	GP	SR	SPL	SR	SPL	SR	SPL	EM	Rouge-L
model_without_pretrain	here	~1.5	5.91	35.44	28.09	67	58	44.56	36.63	23.3	38.2
model_with_pretrain	here	~3	6.16	38.33	29.24	67	59	42.15	35.68	22.1	37.6

Previous works have consistently shown notable improvements after pre-training on augmented data from 2R and REVERIE. However, in our experiment, we find only a slight enhancement on R2R, CVDN, and SOON after pre-training. We speculate that the quality of the data may play a more crucial role than its quantity for our method.

Training & Inference

1. Pretraining: The model is trained for 10,000 steps in the pretraining stage with a batch size of 64. In the pre-training stage, we perform teacher forcing training on the combined dataset from CVDN, SOON, R2R, REVERIE, ScanQA, and augmented data from R2R and REVERIE.

sh scripts/pretrain.sh

2. Multi-task Tuning with Pretraining: The model is trained for 5,000 steps in the multi-task fine-tuning stage with a batch size of 64. In the multi-task fine-tuning stage, we alternate between teacher forcing and student forcing on the combined dataset from CVDN, SOON, R2R, REVERIE, ScanQA, and LLaVA-23k.

sh scripts/multi_w_pretrain.sh

3. Multi-task Tuning without Pretraining:

Since the performance of direct multi-task finetuning is comparable to the two-stage training, we recommend multi-task finetuning without pretraining here. It takes approximately 20 hours with 8 Nvidia A100 GPUs.

sh scripts/multi_wo_pretrain.sh

4. Inference: During the testing phase, we employ a sampling strategy with a temperature of 0.01 for action generation in the SOON and REVERIE tasks, to encourage more exploration. For other tasks, we opt for a greedy strategy in generating actions.

sh scripts/evaluation/eval_cvdn.sh  # eval_soon.sh/eval_r2r.sh/eval_reverie.sh/eval_scanqa.sh

Acknowledgements

We would like to thank MatterPort 3D for their contributions to the open-sourced platform and community. Additionally, this work benefits from DUET, HM3DAutoVLN, and VLN-SIG. Thanks for their awesome works!

Citation

If you find our NaviLLM useful for your research, please consider giving this repository a star and citing our paper as follows:

@misc{zheng2023learning,
      title={Towards Learning a Generalist Model for Embodied Navigation}, 
      author={Duo Zheng and Shijia Huang and Lin Zhao and Yiwu Zhong and Liwei Wang},
      year={2023},
      eprint={2312.02010},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Open Source Agenda is not affiliated with "NaviLLM" Project. README Source: zd11024/NaviLLM

Stars

Open Issues

Last Commit

1 month ago

Repository

zd11024/NaviLLM

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/navillm"><img src="https://www.opensourceagenda.com/projects/navillm/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022