Skip to content

Vectorize Adler32#124409

Open
stephentoub wants to merge 1 commit intodotnet:mainfrom
stephentoub:vectorizeadler
Open

Vectorize Adler32#124409
stephentoub wants to merge 1 commit intodotnet:mainfrom
stephentoub:vectorizeadler

Conversation

@stephentoub
Copy link
Member

No description provided.

Copilot AI review requested due to automatic review settings February 13, 2026 23:30
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the System.IO.Hashing Adler-32 implementation to use hardware intrinsics for vectorized processing (where available) and extends the test suite to validate correctness across a wider set of input shapes.

Changes:

  • Add SIMD-accelerated Adler-32 update paths for Arm (AdvSimd) and x86 (Vector128 / AVX2 / AVX-512 BW), with scalar fallback.
  • Refactor scalar update to share ModBase / NMax constants and introduce a scalar tail helper for vector paths.
  • Add new test coverage for many input lengths, max-byte data, and incremental append chunking, validated against a reference Adler-32 implementation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/libraries/System.IO.Hashing/src/System/IO/Hashing/Adler32.cs Introduces vectorized Adler-32 update implementations (Arm/x86) with feature detection and scalar fallback.
src/libraries/System.IO.Hashing/tests/Adler32Tests.cs Adds reference-based tests targeting multiple length boundaries, overflow-stressing inputs, and incremental append behavior.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io-hashing, @bartonjs, @vcsjones
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member Author

🤖 Copilot Code Review — PR #124409

Holistic Assessment

Motivation: The Adler-32 implementation added in #123601 was scalar-only. Adler-32 is used in zlib/deflate, so vectorization is a well-motivated performance improvement for a hot-path algorithm. The PR is justified.

Approach: The implementation follows the well-known approach from zlib/chromium of decomposing the Adler-32 s2 weighted sum into position-weighted sums computed via SIMD multiply-add intrinsics, with a prefix-sum accumulator tracking inter-block s1 contributions. Four vectorized paths are provided (AdvSimd, SSE2/SSSE3, AVX2, AVX-512BW) plus a widening fallback for Vector128 without SSSE3. This matches the pattern used by other vectorized hash implementations in this library (Crc32.Vectorized.cs, Crc64.Vectorized.cs, XxHashShared.cs).

Summary: ✅ LGTM. After extensive mathematical verification of all four vectorized paths (tracing the s1/s2 accumulation through multi-block examples), the code is correct. The test coverage is thorough with a reference implementation, and the patterns are consistent with the rest of the codebase. One minor suggestion below. No blocking issues found.


Detailed Findings

✅ Vectorized s2 Accumulation — Verified correct across all paths

I traced through the prefix-sum logic (the vps/vs3 accumulators) for each path with a 2-block example:

  • Vector128/ARM (vps pattern): vps starts at s1 * blocks, accumulates old vs1 values, then vps << 5 gives the inter-block contribution. vs2 is seeded with s2 (x86) or added via s2 += (ARM). Both produce s2_new = s2_old + BlockSize*blocks*s1 + Σ(BlockSize * prefix_byte_sums) + Σ(weighted_sums). ✅
  • Vector256/512 (vs3 pattern): vs1 starts at CreateScalar(s1) (not zero), vs3 accumulates old vs1 values including the initial s1. vs3 << 5 (or << 6 for 512) produces the same inter-block contribution. Mathematically equivalent to the vps pattern. ✅

✅ Vector512 Weight Correction — Correct

The weights vector repeats 32..1 for both 256-bit halves. sumLo << 5 adds 32 * sum(lower_32_bytes) to compensate: the lower half needs weights 64..33 but gets 32..1, so adding 32 × byte_sum for each lower byte bridges the gap. The upper half already has the correct weights 32..1 for positions 32..63. ✅

✅ Vector128 Non-SSSE3 Fallback — Correct and overflow-safe

The widening fallback sums widened ushort values from four 8-element vectors. Each ushort lane accumulates at most 4 byte values (max 4 × 255 = 1020), well within ushort range. WeightedSumWidening128 produces int16 products (max 255 × 32 = 8160 < 32767) then widens to int32 for safe accumulation. ✅

✅ Endianness Guard — Appropriate

The BitConverter.IsLittleEndian check before vectorized paths is correct. Vector.LoadUnsafe maps memory bytes to vector lane indices in memory order, but the position-weighted tap vectors assume little-endian lane numbering (byte[0] at lane 0). On big-endian platforms, this mapping may differ, so falling back to scalar is the safe choice. ✅

✅ AVX2 Lane Independence — Verified

Avx2.MultiplyAddAdjacent (VPMADDUBSW) operates on 128-bit lanes independently. The weight vector [32..17 | 16..1] correctly places weights within each lane: lower lane processes bytes 0..15 with weights 32..17, upper lane processes bytes 16..31 with weights 16..1. No cross-lane issue. ✅

✅ Test Coverage — Thorough

Tests cover: (1) various lengths hitting all vector width boundaries and tail transitions, (2) all-0xFF stress testing for accumulator overflow safety, (3) incremental append with varying chunk sizes, all validated against a clean reference implementation that applies modular reduction per byte. Good use of [Theory] with [InlineData]. ✅

💡 Missing benchmark data

The PR description doesn't include benchmark numbers. While the algorithmic improvement is well-established from zlib implementations, having BenchmarkDotNet results would help quantify the speedup for the .NET implementation specifically. This is a nice-to-have for the PR description, not a blocker — the approach is proven.

@stephentoub
Copy link
Member Author

@EgorBot -amd -intel -arm

using System.IO.Hashing;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

[MemoryDiagnoser]
public class Bench
{
    private byte[] _bytes;

    [Params(10, 100, 1000, 10_000, 100_000)]
    public int Size { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _bytes = new byte[Size];
        for (int i = 0; i < Size; i++) _bytes[i] = (byte)('a' + (i % 26));
    }

    [Benchmark]
    public uint Adler() => Adler32.HashToUInt32(_bytes);
}

@stephentoub stephentoub marked this pull request as ready for review February 14, 2026 17:08
Copilot AI review requested due to automatic review settings February 14, 2026 17:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

@stephentoub
Copy link
Member Author

On my machine with AVX2...

Before:

Method Size Mean
Adler 1 2.685 ns
Adler 10 5.039 ns
Adler 1000 477.781 ns
Adler 10000 4,743.580 ns

After:

Method Size Mean
Adler 1 2.924 ns
Adler 10 5.567 ns
Adler 1000 35.518 ns
Adler 10000 280.908 ns
using System.IO.Hashing;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

[MemoryDiagnoser]
public class Bench
{
    private byte[] _bytes;

    [Params(1, 10, 1000, 10_000)]
    public int Size { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _bytes = new byte[Size];
        for (int i = 0; i < Size; i++)
        {
            _bytes[i] = (byte)('a' + (i % 26));
        }
    }

    [Benchmark]
    public uint Adler() => Adler32.HashToUInt32(_bytes);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant