Implementation of FlashAttention-2 for Nvidia Tesla V100
-
Updated
Mar 22, 2026 - Cuda
Implementation of FlashAttention-2 for Nvidia Tesla V100
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
⚡ Optimize attention mechanisms with FlashMLA, a library of advanced sparse and dense kernels for DeepSeek models, improving performance and efficiency.
🚀 Accelerate attention mechanisms with FlashMLA, featuring optimized kernels for DeepSeek models, enhancing performance through sparse and dense attention.
Add a description, image, and links to the cuda-core topic page so that developers can more easily learn about it.
To associate your repository with the cuda-core topic, visit your repo's landing page and select "manage topics."