Skip to content

[Feat] FP8 per tensor quant support#4043

Open
Datta0 wants to merge 3 commits intounslothai:mainfrom
Datta0:fp8_per_tensor
Open

[Feat] FP8 per tensor quant support#4043
Datta0 wants to merge 3 commits intounslothai:mainfrom
Datta0:fp8_per_tensor

Conversation

@Datta0
Copy link
Collaborator

@Datta0 Datta0 commented Feb 13, 2026

Fixes: #3862
We basically use the same code we have for block quant matmul but change the block size to tensor size

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Datta0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive support for per-tensor FP8 quantization within the unsloth framework. By enabling the use of a single scaling factor for an entire weight tensor, it simplifies the quantization process for specific models and broadens the applicability of FP8. The changes ensure that the dequantization and linear layer operations correctly interpret and apply these new per-tensor scales, integrating seamlessly with existing block and row quantization methods.

Highlights

  • Per-tensor FP8 Quantization Support: Implemented support for per-tensor FP8 quantization, allowing a single scale value to be applied across an entire weight matrix for simplified quantization scenarios.
  • Enhanced Weight Dequantization: Modified the weight_dequant function to correctly handle per-tensor scales, distinguishing them from existing row-quantized and block-quantized scales.
  • Updated FP8BlockQuantLinear Forward Pass: The FP8BlockQuantLinear forward pass was updated to detect per-tensor scales and expand them into the appropriate block scale format for internal processing, while preserving the original scale for the backward pass.
  • Refined FP8 Linear Dispatch Logic: Adjusted the fp8_linear dispatch logic to correctly route per-tensor quantized operations through the fp8_block_quant_linear path, ensuring proper handling alongside block-quantized operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth/kernels/fp8.py
    • Added explicit handling for per-tensor scales in the weight_dequant function.
    • Refactored w8a8_block_fp8_matmul_triton to improve argument passing and introduce default block size handling.
    • Modified FP8BlockQuantLinear.forward to expand per-tensor scales to a block scale shape and preserve the original scale for the backward pass.
    • Updated fp8_linear to correctly dispatch per-tensor and block-quantized FP8 operations based on scale properties.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for FP8 per-tensor quantization by updating the logic in unsloth/kernels/fp8.py. The changes correctly handle scalar weight scales in weight_dequant and FP8BlockQuantLinear, and route per-tensor quantized operations to the appropriate functions. While the implementation for the forward pass is sound, I've identified a critical issue in the backward pass where the block_size is not being used, potentially leading to incorrect gradients for block-quantized weights with non-default block sizes. I've also noted a minor point of confusion in a comment.

I am having trouble creating individual review comments. Click here to see my feedback.

unsloth/kernels/fp8.py (351-353)

critical

The block_size is no longer saved to the context. While it's true that ctx.block_size was unused in the backward pass, this points to a potential bug. The backward pass calls weight_dequant, which in turn calls weight_dequant_block with a hardcoded default block_size of 128. If a non-default block_size is used in the forward pass (e.g., from weight.block_size), the dequantization in the backward pass will be incorrect, leading to wrong gradients.

To fix this, block_size should be saved to the context and the backward pass should be updated to use it for correct dequantization. This might require changes to weight_dequant and weight_dequant_block to accept and use the block_size.

unsloth/kernels/fp8.py (309-310)

medium

The comment at line 309 is a bit misleading. It states that the original scale is saved before any transformation, but original_weight_scale is updated on line 332 if the scale is transposed. This can be confusing. A more accurate comment would clarify that this variable holds the scale to be used in the backward pass.

        # Save the scale for the backward pass.
        original_weight_scale = weight_scale

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 400591fcf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Unable to train Devstral2

1 participant