Skip to content

Enable precision-preserving mel training + 16‑bit W&B logging for unconditional example#1

Open
turian wants to merge 11 commits intomainfrom
bigvgan_wandb
Open

Enable precision-preserving mel training + 16‑bit W&B logging for unconditional example#1
turian wants to merge 11 commits intomainfrom
bigvgan_wandb

Conversation

@turian
Copy link
Copy Markdown
Owner

@turian turian commented Nov 8, 2025

Summary

  • Add _prepare_sample_images/_log_sample_images helpers plus --image_bit_depth so sample grids can be emitted as true uint16 PNGs (with 8-bit previews for TensorBoard) and uploaded losslessly to W&B, including a temp-file workaround so wandb.Image accepts 16-bit payloads. WANDB_AUDIO_HOOK lets us call back into a user-provided module:function to attach generated audio to the same step.
    Paths: examples/unconditional_image_generation/train_unconditional.py:76, 256, 788.
  • Introduce --preserve_input_precision and a matching transform pipeline that skips the default .convert("RGB") cast, keeps planar uint16 data via PILToTensor, and only normalizes after we’ve enforced three channels. This lets us feed mel PNGs without data degradation or redundant quantization.
    Paths: examples/unconditional_image_generation/train_unconditional.py:394, 597-625.
  • Document the new flag in the unconditional README so users know how to opt into 16-bit logging and how the previews behave across TensorBoard vs. W&B.
    Path: examples/unconditional_image_generation/README.md:45.

Details

  1. Sample logging upgrades (train_unconditional.py:76-177, 788-839)
    • _prepare_sample_images scales NHWC floats into uint8/uint16 arrays and produces a TensorBoard-safe 8-bit preview when the requested --image_bit_depth is 16.
    • _log_sample_images routes to the chosen tracker: TensorBoard sees the preview tensor, while W&B either uploads uint8 arrays directly or encodes each uint16 frame to disk via Pillow before creating wandb.Image objects. Cleanup is handled even on failure.
    • When WANDB_AUDIO_HOOK=package.module:fn_name and --logger=wandb, the generated numpy images/metadata are passed to that callback. Any dict it returns is merged into the log payload, so BigVGAN (or other vocoders) can push aligned audio without modifying diffusers core code.
  2. Precision-preserving dataloader (train_unconditional.py:597-625)
    • Keeping mel PNGs in 16-bit space previously forced a lossy image.convert("RGB"). The new --preserve_input_precision flag switches to precision_augmentations, which runs PILToTensor_ensure_three_channelsConvertImageDtype(torch.float32) before spatial ops. Palette images still get promoted once, but standard uint16 PNGs stay untouched until normalization.
  3. Docs & ergonomics (README.md:45)
    • Quick-start instructions now mention --image_bit_depth 16, clarify that previews remain 8-bit, and point W&B users to the Files/Artifacts tab for the high-precision grids. This keeps the branch self-documenting for downstream researchers.

How To Use

  1. Quantization-safe training:

    accelerate launch examples/unconditional_image_generation/train_unconditional.py \
      --train_data_dir=…/mels_png --resolution 128 --image_bit_depth 16 \
      --preserve_input_precision --logger wandb
  2. Optional W&B audio:

    export WANDB_AUDIO_HOOK=scripts.audio_hooks:log_bigvgan_audio

    (Or your own module) so the hook receives images, epoch, global_step, and args and returns a dict of additional metrics/files.

Testing

  • accelerate launch … --image_bit_depth 16 --logger=tensorboard (verifies TB preview pipeline).
  • accelerate launch … --logger=wandb --image_bit_depth 16 --preserve_input_precision with a WANDB_AUDIO_HOOK pointing at our BigVGAN helper to confirm 16-bit uploads + audio payloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant