Sudden training failure (High RMSE, NaN RMSE pcutoff, mAP/mAR=0) partway through training network

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Operating System

Windows 11 Pro


### DeepLabCut version

3.0.0rc9

### What engine are you using?

pytorch

### DeepLabCut mode

multi animal

### Device type

Intel(R) Core(TM) i9-14900HX (2.20 GHz)

### Bug description &#128027;

After training previous multi-animal projects successfully on DLC 2.3.5, I created a project on a new device with DLC 3.0 and successfully progressed to training the new network. As we intend to use this pose estimation for later analysis in SimBA, we were encouraged to manually label ~1000 images (33 per video from 32 videos). We trained the network for 950 epochs on a batch size of 8 (further details below), but evaluation of the network displayed awful results, and the tracklets were empty during analysis. 

By playing around with the training, I realized that partway through training (80 epochs, or sometimes as few as 40), the slow improvement of the network abruptly cuts off -- the RMSE begins to increase, the RMSE pcutoff reduces to NaN (though in network evaluation it's usually quite high), and the mAP/mAR become 0.00. This persists for the rest of the training (however many epochs were maximum) instead of throwing an error and ending. 

### Steps To Reproduce

1. Create a multi-animal DLC project on 3.0 with 8 body parts per (2) animal(s) and progress past labelling to creating a training dataset and training the network. 
2. Training, Evaluation and Analysis configuration

- TrainingFraction: - 0.95
- iteration: 0
- default_net_type: resnet_50
- default_augmenter: albumentations
- default_track_method: ellipse
- snapshotindex: -1
- detector_snapshotindex: -1
- batch_size: 8
- 950 epochs, saved every 50 epochs

3. Observe error (40+ epochs into training)

LATER:
4. Evaluate Network errors below (I don't know if this is related to the training issue but these are problems I noticed downstream that I haven't encountered with previous multianimal DLC analyses, and I'd appreciate any insight)

### Relevant log output

```shell
Original (950 epochs) output at end of training: 
Epoch 950/950 (lr=1e-05), train loss 0.00642, valid loss 0.66643
Model performance:
  metrics/test.rmse:          78.88
  metrics/test.rmse_pcutoff:    nan
  metrics/test.mAP:            0.00
  metrics/test.mAR:            0.00

Original training (950 epochs) evaluation: 
INFO:console:Evaluation results for DLC_Resnet50_HW_ResidentIntruderDLC_1Aug7shuffle1_snapshot_760-results.csv (pcutoff: 0.6):
INFO:console:train rmse                              30.23
train rmse_pcutoff                      43.44
train mAP                                0.03
train mAR                                0.01
train id_head_Ear_left_accuracy          0.49
train id_head_Ear_right_accuracy         0.49
train id_head_Nose_accuracy              0.50
train id_head_Center_accuracy            0.48
train id_head_Lateral_left_accuracy      0.48
train id_head_Lateral_right_accuracy     0.48
train id_head_Tail_base_accuracy         0.49
train id_head_Tail_end_accuracy          0.50
test rmse                               85.63
test rmse_pcutoff                       61.84
test mAP                                 0.00
test mAR                                 0.00
test id_head_Ear_left_accuracy           0.51
test id_head_Ear_right_accuracy          0.52
test id_head_Nose_accuracy               0.51
test id_head_Center_accuracy             0.49
test id_head_Lateral_left_accuracy       0.48
test id_head_Lateral_right_accuracy      0.48
test id_head_Tail_base_accuracy          0.52
test id_head_Tail_end_accuracy           0.49


Evaluate Network error #1: 
:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\deeplabcut\pose_estimation_pytorch\data\postprocessor.py:489: RuntimeWarning: invalid value encountered in cast
  heatmap_indices = np.rint(individual_keypoints).astype(int)

Evaluate Network error #2 (Only when I check Plot) 
Traceback (most recent call last):
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\deeplabcut\gui\tabs\evaluate_network.py", line 235, in evaluate_network
    _ = launch_napari(image_dir)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\deeplabcut\gui\widgets.py", line 46, in launch_napari
    viewer.open(files, plugin=plugin, stack=stack)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\napari\components\viewer_model.py", line 1092, in open
    self._add_layers_with_plugins(
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\napari\components\viewer_model.py", line 1292, in _add_layers_with_plugins
    layer_data, hookimpl = read_data_with_plugins(
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\napari\plugins\io.py", line 77, in read_data_with_plugins
    res = _npe2.read(paths, plugin, stack=stack)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\napari\plugins\_npe2.py", line 63, in read
    layer_data, reader = io_utils.read_get_reader(
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\npe2\io_utils.py", line 66, in read_get_reader
    return _read(
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\npe2\io_utils.py", line 165, in _read
    read_func = rdr.exec(
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\npe2\manifest\contributions\_readers.py", line 61, in exec
    callable_ = super().exec(args=args, kwargs=kwargs, _registry=_registry)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\npe2\manifest\utils.py", line 61, in exec
    return self.get_callable(reg)(*args, **kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\napari_deeplabcut\_reader.py", line 79, in get_folder_parser
    layers.extend(read_images(images))
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\napari_deeplabcut\_reader.py", line 112, in read_images
    return [(imread(path), params, "image")]
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\dask_image\imread\__init__.py", line 48, in imread
    with pims.open(sfname) as imgs:
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\pims\api.py", line 161, in open
    return ImageSequence(sequence, **kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\pims\image_sequence.py", line 68, in __init__
    tmp = self.imread(self._filepaths[0], **self.kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\pims\image_sequence.py", line 85, in imread
    return imread(filename, **kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\skimage\_shared\utils.py", line 328, in fixed_func
    return func(*args, **kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\skimage\io\_io.py", line 82, in imread
    img = call_plugin('imread', fname, plugin=plugin, **plugin_args)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\skimage\_shared\utils.py", line 538, in wrapped
    return func(*args, **kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\skimage\io\manage_plugins.py", line 254, in call_plugin
    return func(*args, **kwargs)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\skimage\io\_plugins\imageio_plugin.py", line 11, in imread
    out = np.asarray(imageio_imread(*args, **kwargs))
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\imageio\v3.py", line 53, in imread
    with imopen(uri, "r", **plugin_kwargs) as img_file:
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\imageio\core\imopen.py", line 113, in imopen
    request = Request(uri, io_mode, format_hint=format_hint, extension=extension)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\imageio\core\request.py", line 249, in __init__
    self._parse_uri(uri)
  File "C:\Users\itilton\AppData\Local\anaconda3\envs\DEEPLABCUT\lib\site-packages\imageio\core\request.py", line 409, in _parse_uri
    raise FileNotFoundError("No such file: '%s'" % fn)
FileNotFoundError: No such file: 'C:\Users\itilton\Desktop\HW_ResidentIntruderDLC_1-haoyu-2025-08-07\evaluation-results-pytorch\iteration-0\HW_ResidentIntruderDLC_1Aug7-trainset95shuffle1\LabeledImages_DLC_Resnet50_HW_ResidentIntruderDLC_1Aug7shuffle1_snapshot_760\Test-237_d1_pursuit-img086.png'
```

### Anything else?

I noticed that a similar issue was brought up in #2697 , but on my side training starts strong and improves until it suddenly fails (instead of beginning with RMSE pcutoff NaN and mAP/AR=0), and I also have very high RMSE. Not sure if these are related or actually different. 

I also found that the labelling on the images in the evaluation-results-pytorch folder doesn't resemble the labelling earlier on -- there's a variety of colors for different body parts. I've attached 1) when one clicks on 'check labels' under label frames and 2) an image under the evaluation results.

<img width="221" height="449" alt="Image" src="https://github.com/user-attachments/assets/205afe0a-c1f6-4b47-8201-da1229782974" />
<img width="232" height="444" alt="Image" src="https://github.com/user-attachments/assets/d1678ab7-8bb3-4f34-a24f-f34cd371bf56" />

Thank you for any help or insight with this issue!

### Code of Conduct

- [x] I agree to follow this project's [Code of Conduct](https://github.com/DeepLabCut/DeepLabCut/blob/master/CODE_OF_CONDUCT.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Sudden training failure (High RMSE, NaN RMSE pcutoff, mAP/mAR=0) partway through training network #3089

Is there an existing issue for this?

Operating System

DeepLabCut version

What engine are you using?

DeepLabCut mode

Device type

Bug description 🐛

Steps To Reproduce

Relevant log output

Anything else?

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Sudden training failure (High RMSE, NaN RMSE pcutoff, mAP/mAR=0) partway through training network #3089

Description

Is there an existing issue for this?

Operating System

DeepLabCut version

What engine are you using?

DeepLabCut mode

Device type

Bug description 🐛

Steps To Reproduce

Relevant log output

Anything else?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions