Enable precision-preserving mel training + 16‑bit W&B logging for unconditional example#1
Open
Enable precision-preserving mel training + 16‑bit W&B logging for unconditional example#1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--image_bit_depthso sample grids can be emitted as true uint16 PNGs (with 8-bit previews for TensorBoard) and uploaded losslessly to W&B, including a temp-file workaround sowandb.Imageaccepts 16-bit payloads.WANDB_AUDIO_HOOKlets us call back into a user-providedmodule:functionto attach generated audio to the same step.Paths:
examples/unconditional_image_generation/train_unconditional.py:76, 256, 788.--preserve_input_precisionand a matching transform pipeline that skips the default.convert("RGB")cast, keeps planar uint16 data viaPILToTensor, and only normalizes after we’ve enforced three channels. This lets us feed mel PNGs without data degradation or redundant quantization.Paths:
examples/unconditional_image_generation/train_unconditional.py:394, 597-625.Path:
examples/unconditional_image_generation/README.md:45.Details
train_unconditional.py:76-177, 788-839)_prepare_sample_imagesscales NHWC floats into uint8/uint16 arrays and produces a TensorBoard-safe 8-bit preview when the requested--image_bit_depthis 16._log_sample_imagesroutes to the chosen tracker: TensorBoard sees the preview tensor, while W&B either uploads uint8 arrays directly or encodes each uint16 frame to disk via Pillow before creatingwandb.Imageobjects. Cleanup is handled even on failure.WANDB_AUDIO_HOOK=package.module:fn_nameand--logger=wandb, the generated numpy images/metadata are passed to that callback. Any dict it returns is merged into the log payload, so BigVGAN (or other vocoders) can push aligned audio without modifying diffusers core code.train_unconditional.py:597-625)image.convert("RGB"). The new--preserve_input_precisionflag switches to precision_augmentations, which runsPILToTensor→_ensure_three_channels→ConvertImageDtype(torch.float32)before spatial ops. Palette images still get promoted once, but standard uint16 PNGs stay untouched until normalization.README.md:45)--image_bit_depth 16, clarify that previews remain 8-bit, and point W&B users to the Files/Artifacts tab for the high-precision grids. This keeps the branch self-documenting for downstream researchers.How To Use
Quantization-safe training:
Optional W&B audio:
export WANDB_AUDIO_HOOK=scripts.audio_hooks:log_bigvgan_audio(Or your own module) so the hook receives images, epoch, global_step, and args and returns a dict of additional metrics/files.
Testing
accelerate launch … --image_bit_depth 16 --logger=tensorboard(verifies TB preview pipeline).accelerate launch … --logger=wandb --image_bit_depth 16 --preserve_input_precisionwith aWANDB_AUDIO_HOOKpointing at our BigVGAN helper to confirm 16-bit uploads + audio payloads.