WonJun Moon

Minority-Oriented Vicinity Expansion with Attentive Aggregation for Video Long-Tailed Recognition (AAAI23)

2023-06-24T00:00:00+00:00

WonJun Moon Hyun Seok Seong Jae-Pil Heo

Sungkyunkwan University

Abstract

A dramatic increase in real-world video volume with extremely diverse and emerging topics naturally forms a long-tailed video distribution in terms of their categories, and it spotlights the need for Video Long-Tailed Recognition (VLTR). In this work, we summarize the challenges in VLTR and explore how to overcome them. The challenges are: (1) it is impractical to re-train the whole model for high-quality features, (2) acquiring frame-wise labels requires extensive cost, and (3) long-tailed data triggers biased training. Yet, most existing works for VLTR unavoidably utilize image-level features extracted from pretrained models which are task-irrelevant, and learn by video-level labels. Therefore, to deal with such (1) task-irrelevant features and (2) video-level labels, we introduce two complementary learnable feature aggregators. Learnable layers in each aggregator are to produce task-relevant representations, and each aggregator is to assemble the snippet-wise knowledge into a video representative. Then, we propose Minority-Oriented Vicinity Expansion (MOVE) that explicitly leverages the class frequency into approximating the vicinity distributions to alleviate (3) biased training. By combining these solutions, our approach achieves state-of-the-art results on large-scale VideoLT and synthetically induced Imbalanced-MiniKinetics200. With VideoLT features from ResNet-50, it attains 18% and 58% relative improvements on head and tail classes over the previous state-of-the-art method, respectively.

Method

Aggregators

Weak supervision problem (one class label per multiple-frame video) cannot be handled as noisy-label problem because of biased training in long-tailed circumstances.
Pretrained features need to be finetuned for the downstream task.
Thus, we use aggregators to aggregate frames into a video prototype. (2 aggregators to minimize the information loss while aggregating the frame-level features.)

Minority Oriented Vicinity Expansion

Long-tailed distribution causes biased training.
To expand and extend the boundaries of the minority classes, we use dynamic extrapolation and calibrated interpolation.
Dynamic extrapolation is implemented between two prototypes of the same instance to prevent generating outliers.
And to only diversify the only tail classes, we propose dynamic frame sampler.
After, calibrated interpolation is implemented to extend the boundaries of the tail classes.

Experiment

In addition to VideoLT, the large scale video long-tailed dataset, we also introduce and run experiments on Imbalanced-MiniKinetics200.

Dataset (Imbalanced-MiniKinetics200)

Imbalanced-MiniKinetics200 consists of 200 classes which are extracted from Kinetics-400.
Features and original files are available through the google drive link. Imb.MiniKinetics200 Dataset

Bibtex

@inproceedings{moon2023minority,
  title={Minority-Oriented Vicinity Expansion with Attentive Aggregation for Video Long-Tailed Recognition},
  author={Moon, WonJun and Seong, Hyun Seok and Heo, Jae-Pil},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={37},
  number={2},
  pages={1931--1939},
  year={2023}
}

Link

Arxiv Paper Code Video ]]>

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection (CVPR23)

2023-06-24T00:00:00+00:00

WonJun Moon^*1 SangEek Hyun^*1 SangUk Park² Dongchan Park² Jae-Pil Heo¹

Sungkyunkwan University¹ Pyler²

^* : Equal Contribution

Abstract

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model’s capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets.

Method

Cross-Attentive Transformer Encoder

Self-Attention is not enough to attention different modalities since continuous frames are visually too similar and there is a modality gap.
Cross-Attention explicitly ensures that textual info is embedded in video representation.
Fusing multi-modal features using a cross-attention layer enables the use of temporal-conditional query which is a temporal version of DAB-DETR.

Learning from Negative Relationship

Negative-pair learning ensures the learning of general relationship between video and text.
We expect tthe high involvement of text info since video-text similarity gets more distinguishable.

Input-Adaptive Saliency Predictor

Naive use of single MLP as a saliency predictor cannot cover diverse nature of video-text pairs.
Instead, we define saliency token that is adaptively transformed into input-dependent saliency token via attention layers.

Experiment

Bibtex

@inproceedings{moon2023query,
  title={Query-dependent video representation for moment retrieval and highlight detection},
  author={Moon, WonJun and Hyun, Sangeek and Park, SangUk and Park, Dongchan and Heo, Jae-Pil},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={23023--23033},
  year={2023}
}

Link

Arxiv Paper Code Video ]]>

Difficulty-Aware Simulator for Open Set Recognition (ECCV22)

2022-10-22T00:00:00+00:00

Abstract

Open set recognition (OSR) assumes unknown instances appear out of the blue at the inference time. The main challenge of OSR is that the response of models for unknowns is totally unpredictable. Furthermore, the diversity of open set makes it harder since instances have different difficulty levels. Therefore, we present a novel framework, DIfficulty-Aware Simulator (DIAS), that generates fakes with diverse difficulty levels to simulate the real world. We first investigate fakes from generative adversarial network (GAN) in the classifier’s viewpoint and observe that these are not severely challenging. This leads us to define the criteria for difficulty by regarding samples generated with GANs having moderate-difficulty. To produce hard-difficulty examples, we introduce Copycat, imitating the behavior of the classifier. Furthermore, moderate- and easy-difficulty samples are also yielded by our modified GAN and Copycat, respectively. As a result, DIAS outperforms state-of-the-art methods with both metrics of AUROC and F-score.

Link

Arxiv Paper Code Video ]]>

Tailoring Self-Supervision for Supervised Learning (ECCV22)

2022-10-22T00:00:00+00:00

Abstract

Recently, it is shown that deploying a proper self-supervision is a prospective way to enhance the performance of supervised learning. Yet, the benefits of self-supervision are not fully exploited as previous pretext tasks are specialized for unsupervised representation learning. To this end, we begin by presenting three desirable properties for such auxiliary tasks to assist the supervised objective. First, the tasks need to guide the model to learn rich features. Second, the transformations involved in the self-supervision should not significantly alter the training distribution. Third, the tasks are preferred to be light and generic for high applicability to prior arts. Subsequently, to show how existing pretext tasks can fulfill these and be tailored for supervised learning, we propose a simple auxiliary self-supervision task, predicting localizable rotation (LoRot). Our exhaustive experiments validate the merits of LoRot as a pretext task tailored for supervised learning in terms of robustness and generalization capability.

Link

Arxiv Paper Code Video ]]>