Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation

Yong Liu, Zhuoyan Luo, Yicheng Xiao, Yitong Wang, Shuyan Li, Xiu Li, Yujiu Yang, and Yansong Tang

📖 Abstract

This paper concentrates on Multi-modal Referring Video Segmentation task, where a well optimized model is able to recognize and segment the target objects referred by the given guidance signals, e.g., language description. Early approaches model this task as a sequence prediction problem. The lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships. Some recent works propose to perform temporal modeling with vanilla attention mechanism. However, the condensed visual representation tends to be messy about target information due to occlusion or motion blur. Unlimited non-local operation would spread such noise to all the sequences and interfere with the extraction of global representations. To address the above issue, we present Semantic-assisted Object Cluster network (SOC) and the improved SOC++ in this paper. Our method unifies temporally selective interaction and cross-modal alignment to achieve video-level understanding. In SOC++, a proxy-assisted multi-modal fusion module is introduced to perform preliminary bidirectional activation. Then a semantic integration module with progressive frame-to-video structure facilitates joint space learning across modalities and time steps. Considering that potential noisy visual embeddings would impair the overall representation of target objects in unconstrained inter-frame interactions, we propose to perform tendentious video aggregation through emphasizing the indicative role of the informative frames with lower entropy in this part. A multi-modal query contrastive supervision is also utilized to help construct well-aligned joint space at the video level. Moreover, to integrate the advantage of high-level video information and the low-level details of each frame, we introduce a dynamic query fusion module that performs joint updating of these embeddings. We conduct extensive experiments on popular referring video segmentation benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations.

📗 Framework

🛠️ Environment Setup

Please see the official code of our conference version for environment setup and data preparation: https://github.com/RobertLuo1/NeurIPS2023_SOC. Note that for MeViS we process it into the same format as Ref-YouTubeVOS.

Bakcbone Model

Please put the ckpt of pretrained backbone in the ``pretrained" folder.

pretrained
└── pretrained_swin_transformer
└── pretrained_roberta

For pretrained_swin_transformer folder download Video-Swin-Base
For pretrained_roberta folder download config.json pytorch_model.bin tokenizer.json vocab.json from huggingface (roberta-base)

Pretrained Checkpoints

The trained models for A2D-Sentences and MeViS are put here

🚀 Training

MeViS Run the "./scripts/train_mevis.sh.
```
bash ./scripts/train_mevis.sh
```
A2D Run the scripts "./scripts/train_a2d.sh"
```
bash ./scripts/train_a2d.sh
```
Ref-Youtube-VOS Run the "./scripts/train_ytb.sh.
```
bash ./scripts/train_ytb.sh
```

Evaluation

A2D-Sentences

Run the scripts ./scripts/eval_a2d.sh and remember to specify the checkpoint_path in the config file.
MeViS

Run the scripts ./scripts/infer_mevis.sh and remember to specify the checkpoint_path in the config file.
Ref-Youtube-VOS

Run the scripts ./scripts/infer_ref_ytb.sh and remember to specify the checkpoint_path in the config file.

Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful work Referformer and MTTR

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asserts		asserts
configs		configs
datasets		datasets
davis2017		davis2017
models		models
scripts		scripts
tools		tools
.gitignore		.gitignore
README.md		README.md
demo_video.py		demo_video.py
eval_davis.py		eval_davis.py
infer_davis.py		infer_davis.py
infer_mevis.py		infer_mevis.py
infer_refytb.py		infer_refytb.py
main.py		main.py
metrics.py		metrics.py
misc.py		misc.py
predict.py		predict.py
requirements.txt		requirements.txt
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation

📖 Abstract

📗 Framework

🛠️ Environment Setup

Bakcbone Model

Pretrained Checkpoints

🚀 Training

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation

📖 Abstract

📗 Framework

🛠️ Environment Setup

Bakcbone Model

Pretrained Checkpoints

🚀 Training

Evaluation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages