Skip to content

Training stuck after certain number of epochs with "shuffle=True" #34

@mastiche

Description

@mastiche

Hi all,

Thanks for sharing the repository. I have a question about training deimv2_hgnetv2_n on my own dataset with 1 target class. My dataset has about 20k images, and 6k of them have no target objects (so no bounding boxes). I noticed after a certain number of epochs (82 in my case), the training process gets stuck until timeout, and all of my GPUs' volatile GPU-util in the nvidia dashboard is 100%. However, I found that if I disable data shuffle in training (shuffle: False in train_dataloader), the model can train successfully past the epoch, but quickly overfit (which is expected). Do you have any idea why this is and how to fix it? I read in another thread that it might have to do with too many samples having no bbox, but I want to hear from you. I am training on 4 GPUs with a batch size of 32. My config is included below:

__include__: [
  '../dataset/dataset_custom.yml',
  '../runtime.yml',
  '../base/dataloader.yml',
  '../base/optimizer.yml',
  '../base/deimv2.yml'
]

output_dir: ./outputs/deimv2_hgnetv2_n_custom

HGNetv2:
  name: 'B0'
  return_idx: [2, 3]
  freeze_at: -1
  freeze_norm: False
  use_lab: True

HybridEncoder:
  in_channels: [512, 1024]
  feat_strides: [16, 32]

  # intra
  hidden_dim: 128
  use_encoder_idx: [1]
  dim_feedforward: 512

  # cross
  expansion: 0.34
  depth_mult: 0.5

  version: 'dfine'

DEIMTransformer:
  feat_channels: [128, 128]
  feat_strides: [16, 32]
  hidden_dim: 128
  num_levels: 2
  num_points: [6, 6]

  num_layers: 3
  eval_idx: -1

  # FFN
  dim_feedforward: 512

optimizer:
  type: AdamW
  params:
    -
      params: '^(?=.*backbone)(?!.*norm|bn).*$'
      lr: 0.0004
    -
      params: '^(?=.*backbone)(?=.*norm|bn).*$'
      lr: 0.0004
      weight_decay: 0.
    -
      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
      weight_decay: 0.

  lr: 0.0008
  betas: [0.9, 0.999]
  weight_decay: 0.0001

# Increase to search for the optimal ema
epoches: 160 # 160 = 148 + 12

## Our LR-Scheduler
flat_epoch: 84 # 84    # 4 + epoch // 2, e.g., 40 = 4 + 72 / 2
no_aug_epoch: 12
lr_gamma: 1.0

## Our DataAug
train_dataloader: 
  dataset: 
    transforms:
      policy:
        epoch: [4, 78, 148]   # list 

  collate_fn:
    ema_restart_decay: 0.9999
    base_size_repeat: ~
    mixup_epochs: [4, 78]
    stop_epoch: 148
    copyblend_prob: 0.4
    copyblend_epochs: [4, 78]   # CP half
    area_threshold: 100
    num_objects: 3
    with_expand: True
    expand_ratios: [0.1, 0.25]
  
  shuffle: True
  num_workers: 4

DEIMCriterion:
  matcher:
    # new matcher
    change_matcher: True
    iou_order_alpha: 4.0
    matcher_change_epoch: 136

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions