[NeurIPS2022] Egocentric Video-Language Pretraining
TL;DR: We pioneer Egocentric Video-Language Pretraining from pretraining dataset, model and development benchmark; the resulted pretrained model exhibits strong performance on five downstream tasks across three egocentric datasets.
conda env create -f environment.yml
source activate egovlp
You can skip the source video download if pretraining is not required.
Follow the guideline here, download the following to {PATH_TO_EGO4D}
manifest.csv
and benchmark metadata, e.g., nlq_train.json
for NLQ.dataset
and add a soft link by ln -s {PATH_TO_EGO4D} dataset/ego4d
.For effectively pretraining, we compress videos in the following way:
utils/video_resize.py
.utils/video_chunk.py
.Download the EgoClip metadata from here and put it to dataset/egoclip.csv
.
For the usage of EgoClip, please see our dataloader data_loader/EgoClip_EgoMCQ_dataset.py
. The data format of EgoClip is:
import pandas as pd
metadata = pd.read_csv('dataset/egoclip_metadata.csv', sep='\t', error_bad_lines=False)
print(metadata.shape[0])
print(metadata.iloc[0])
# Out:
3847723 # Num of clips for EgoClip
clip_idx 0 # the idx of clip
video_uid 001e3e4e-2743-47fc-8564-d5efd11f9e90 # the uid of source video
video_dur 128.033333 # the duration of source video
narration_source narration_pass_1 # the source of annotator
narration_ind 0 # the idx of narration
narration_time 3.3445 # the narration timestamp
clip_start 2.967651 # the start timestamp of clip
clip_end 3.721266 # the end timestamp of clip
clip_text #C C picks a bag of clothes from the floor # the narration of clip
tag_verb [93] # the verb idx of the narration
tag_noun [192, 115, 12] # the noun idx of the narration
^ The terms tag_verb
and tag_noun
are used for EgoNCE pretraining objective, which considers synonyms. For example, pick
, collect
, gather
are all belong to the verb parent with idx 93: take_(pick,_grab,_get)
.
The mapping dictionary can be found here.
dataset/egomcq.json
.inter-video
or intra-video
.data_loader/EgoClip_EgoMCQ_dataset.py
.This code is built on PyTorch with DistributedDataParallel (DDP). We pretrain EgoVLP on 4 nodes, each with 8 A100 GPUs (10 epochs in about two days).
Train on EgoClip: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_egoclip.py --config configs/pt/egoclip.json
Test on EgoMCQ: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_egoclip.py --config configs/eval/egomcq.json
Monitor the EgoMCQ curve during pretraining: tensorboard --logdir results --bind_all
pretrained/
^ This checkpoint is used for EPIC-Kitchens, NLQ, MQ, OSSC, and PNR tasks, except for Charades-Ego. Since we found that VLP (CC3M+WebVid2M, EgoClip) alway degrades significantly on Charades-Ego after the first epoch, we evaluate Charades-Ego using the first pretraining epoch weights of EgoVLP in EgoVLP_PT_EPO1.
^^ You can use our checkpoint to power other egocentric video benchmarks. :)
dataset/epic-kitchens/
Model | Mode | # Frames | Video-Text PT | Weights | mAP (V2T) | mAP (T2V) | mAP (Avg) | nDCG (V2T) | nDCG (T2V) | nDCG (Avg) |
---|---|---|---|---|---|---|---|---|---|---|
EgoVLP | Zero-shot | 4 | EgoClip w/ EgoNCE | EgoVLP_PT_BEST | 19.4 | 13.9 | 16.6 | 24.1 | 22.0 | 23.1 |
EgoVLP | Fine-tuning w/ MI-MM |
16 | EgoClip w/ EgoNCE | EgoVLP_FT_EPIC | 49.9 | 40.5 | 45.0 | 60.9 | 57.9 | 59.4 |
EgoVLP+ | Fine-tuning w/ Adaptive-MI-MM + Dual-softmax | 16 | EgoClip w/ EgoNCE | EgoVLP_FT_EPIC+ | 53.8 | 40.9 | 47.4 | 63.3 | 59.6 | 61.4 |
^ EgoVLP+ means our submission for Multi-Instance Retrieval@EPIC-Kitchens Challenge 2022, which equips Adaptive MI-MM loss and Dual-softmax for prediction.
Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_epic.py --config configs/ft/epic.json
Test: python3 run/test_epic.py
dataset/charades/
utils/charades_meta.py
Model | Mode | # Frames | Video-Text PT | Weights | mAP |
---|---|---|---|---|---|
EgoVLP | Zero-shot | 16 | EgoClip w/ EgoNCE | EgoVLP_PT_EPO1 | 25.0 |
EgoVLP | Fine-tuning w/ InfoNCE | 16 | EgoClip w/ EgoNCE | EgoVLP_FT_CHARADES | 32.1 |
Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_charades.py --config configs/ft/charades.json
Test: python3 run/test_charades.py
run/test_nlq.py
.
python3 run/test_nlq.py --subsample 'text'
or use our pretrained text encoder.^ We provide our VSLNet codebase which adapts EgoVLP features as an example, you can refer to the data loader and text encoder.
^ Our EgoVLP brings consistent improvement over multiple NLQ challenge baselines.
Model | Video-Text Pre-extrated Features | R@1, IoU=0.3 | R@5, IoU=0.3 | R@1, IoU=0.5 | R@5, IoU=0.5 |
---|---|---|---|---|---|
VSLNet | SlowFast + BERT | 5.45 | 10.74 | 3.12 | 6.63 |
VSLNet | EgoVLP | 10.84 | 18.84 | 6.81 | 13.45 |
CONE | SlowFast + BERT | 10.40 | 22.74 | 5.03 | 11.87 |
CONE | EgoVLP | 14.15 | 30.33 | 8.18 | 18.02 |
run/test_mq.py
.
^ We provide our VSGN codebase which adapts EgoVLP features as an example, you can refer to the data loader.
^ Our EgoVLP brings consistent improvement over multiple MQ challenge baselines.
Model | Video Pre-extrated Features | R@1, IoU=0.5 | R@5, IoU=0.5 | mAP |
---|---|---|---|---|
VSGN | SlowFast | 25.16 | 46.18 | 6.03 |
VSGN | EgoVLP | 30.14 | 51.98 | 11.39 |
ActionFormer | SlowFast + Omnivore | 33.46 | - | 17.17 |
ActionFormer | SlowFast + Omnivore + EgoVLP | 36.84 | - | 20.90 |
python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_oscc.py --config configs/ft/oscc.json
Model | Video-Text Pretrained | OSCC Acc % |
---|---|---|
TimeSformer | ImageNet Init. | 70.3 |
TimeSformer | EgoVLP | 73.9 |
python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 run/train_pnr.py --config configs/ft/pnr.json
Model | Video-Text Pretrained | PNR Err % |
---|---|---|
TimeSformer | ImageNet Init. | 0.616 |
TimeSformer | EgoVLP | 0.622 |
^ We found VLP effect is minor in the PNR task.
If you find our work helps, please cite our paper.
@article{kevin2022egovlp,
title={Egocentric Video-Language Pretraining},
author={Lin, Kevin Qinghong and Wang, Alex Jinpeng and Soldan, Mattia and Wray, Michael and Yan, Rui and Xu, Eric Zhongcong and Gao, Difei and Tu, Rongcheng and Zhao, Wenzhe and Kong, Weijie and others},
journal={arXiv preprint arXiv:2206.01670},
year={2022}
}
This repo is maintained by Kevin. Questions and discussions are welcome via [email protected]
.
We are willing to merge results and codes if transfer our EgoVLP to other egocentric tasks or datasets.
This codebase is based on Frozen.
Thanks to Alex for the help with DDP and Mattia for the help with NLQ and MQ benchmarks.
MIT