Training stuck after certain number of epochs with "shuffle=True"

Hi all,

Thanks for sharing the repository. I have a question about training deimv2_hgnetv2_n on my own dataset with 1 target class. My dataset has about 20k images, and 6k of them have no target objects (so no bounding boxes). I noticed after a certain number of epochs (82 in my case), the training process gets stuck until timeout, and all of my GPUs' volatile GPU-util in the nvidia dashboard is 100%. However, I found that if I disable data shuffle in training (`shuffle: False` in `train_dataloader`), the model can train successfully past the epoch, but quickly overfit (which is expected). Do you have any idea why this is and how to fix it? I read in another thread that it might have to do with too many samples having no bbox, but I want to hear from you. I am training on 4 GPUs with a batch size of 32. My config is included below:

```
__include__: [
  '../dataset/dataset_custom.yml',
  '../runtime.yml',
  '../base/dataloader.yml',
  '../base/optimizer.yml',
  '../base/deimv2.yml'
]

output_dir: ./outputs/deimv2_hgnetv2_n_custom

HGNetv2:
  name: 'B0'
  return_idx: [2, 3]
  freeze_at: -1
  freeze_norm: False
  use_lab: True

HybridEncoder:
  in_channels: [512, 1024]
  feat_strides: [16, 32]

  # intra
  hidden_dim: 128
  use_encoder_idx: [1]
  dim_feedforward: 512

  # cross
  expansion: 0.34
  depth_mult: 0.5

  version: 'dfine'

DEIMTransformer:
  feat_channels: [128, 128]
  feat_strides: [16, 32]
  hidden_dim: 128
  num_levels: 2
  num_points: [6, 6]

  num_layers: 3
  eval_idx: -1

  # FFN
  dim_feedforward: 512

optimizer:
  type: AdamW
  params:
    -
      params: '^(?=.*backbone)(?!.*norm|bn).*$'
      lr: 0.0004
    -
      params: '^(?=.*backbone)(?=.*norm|bn).*$'
      lr: 0.0004
      weight_decay: 0.
    -
      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
      weight_decay: 0.

  lr: 0.0008
  betas: [0.9, 0.999]
  weight_decay: 0.0001

# Increase to search for the optimal ema
epoches: 160 # 160 = 148 + 12

## Our LR-Scheduler
flat_epoch: 84 # 84    # 4 + epoch // 2, e.g., 40 = 4 + 72 / 2
no_aug_epoch: 12
lr_gamma: 1.0

## Our DataAug
train_dataloader: 
  dataset: 
    transforms:
      policy:
        epoch: [4, 78, 148]   # list 

  collate_fn:
    ema_restart_decay: 0.9999
    base_size_repeat: ~
    mixup_epochs: [4, 78]
    stop_epoch: 148
    copyblend_prob: 0.4
    copyblend_epochs: [4, 78]   # CP half
    area_threshold: 100
    num_objects: 3
    with_expand: True
    expand_ratios: [0.1, 0.25]
  
  shuffle: True
  num_workers: 4

DEIMCriterion:
  matcher:
    # new matcher
    change_matcher: True
    iou_order_alpha: 4.0
    matcher_change_epoch: 136
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stuck after certain number of epochs with "shuffle=True" #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training stuck after certain number of epochs with "shuffle=True" #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions