-
Notifications
You must be signed in to change notification settings - Fork 167
Description
Hi all,
Thanks for sharing the repository. I have a question about training deimv2_hgnetv2_n on my own dataset with 1 target class. My dataset has about 20k images, and 6k of them have no target objects (so no bounding boxes). I noticed after a certain number of epochs (82 in my case), the training process gets stuck until timeout, and all of my GPUs' volatile GPU-util in the nvidia dashboard is 100%. However, I found that if I disable data shuffle in training (shuffle: False in train_dataloader), the model can train successfully past the epoch, but quickly overfit (which is expected). Do you have any idea why this is and how to fix it? I read in another thread that it might have to do with too many samples having no bbox, but I want to hear from you. I am training on 4 GPUs with a batch size of 32. My config is included below:
__include__: [
'../dataset/dataset_custom.yml',
'../runtime.yml',
'../base/dataloader.yml',
'../base/optimizer.yml',
'../base/deimv2.yml'
]
output_dir: ./outputs/deimv2_hgnetv2_n_custom
HGNetv2:
name: 'B0'
return_idx: [2, 3]
freeze_at: -1
freeze_norm: False
use_lab: True
HybridEncoder:
in_channels: [512, 1024]
feat_strides: [16, 32]
# intra
hidden_dim: 128
use_encoder_idx: [1]
dim_feedforward: 512
# cross
expansion: 0.34
depth_mult: 0.5
version: 'dfine'
DEIMTransformer:
feat_channels: [128, 128]
feat_strides: [16, 32]
hidden_dim: 128
num_levels: 2
num_points: [6, 6]
num_layers: 3
eval_idx: -1
# FFN
dim_feedforward: 512
optimizer:
type: AdamW
params:
-
params: '^(?=.*backbone)(?!.*norm|bn).*$'
lr: 0.0004
-
params: '^(?=.*backbone)(?=.*norm|bn).*$'
lr: 0.0004
weight_decay: 0.
-
params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
weight_decay: 0.
lr: 0.0008
betas: [0.9, 0.999]
weight_decay: 0.0001
# Increase to search for the optimal ema
epoches: 160 # 160 = 148 + 12
## Our LR-Scheduler
flat_epoch: 84 # 84 # 4 + epoch // 2, e.g., 40 = 4 + 72 / 2
no_aug_epoch: 12
lr_gamma: 1.0
## Our DataAug
train_dataloader:
dataset:
transforms:
policy:
epoch: [4, 78, 148] # list
collate_fn:
ema_restart_decay: 0.9999
base_size_repeat: ~
mixup_epochs: [4, 78]
stop_epoch: 148
copyblend_prob: 0.4
copyblend_epochs: [4, 78] # CP half
area_threshold: 100
num_objects: 3
with_expand: True
expand_ratios: [0.1, 0.25]
shuffle: True
num_workers: 4
DEIMCriterion:
matcher:
# new matcher
change_matcher: True
iou_order_alpha: 4.0
matcher_change_epoch: 136