Awesome On-Device AI Systems

A curated list of efficient on-device AI systems, including practical inference engines, benchmarks, and state-of-the-art research papers for mobile and edge devices.

This repository bridges the gap between Systems Research (academic papers) and Practical Deployment (engineering frameworks), focusing on optimizing ML models (e.g., LLM/VLMs, ViTs, etc.) on resource-constrained hardware.

📂 Table of Contents

🚀 Inference Engines

📝 Research Papers

🚀 Inference Engines

Frameworks and runtimes designed for deploying models on edge devices.

General ML Workloads

LiteRT (formerly TensorFlow Lite) - Google's framework for on-device inference.
ExecuTorch - PyTorch’s end-to-end solution for enabling on-device AI.
ONNX Runtime - Cross-platform inference engine for ONNX models.
MNN - Lightweight deep learning framework by Alibaba.
NCNN - High-performance NN inference framework by Tencent.

LLM & GenAI Specialized

llama.cpp - LLM inference in C/C++ with minimal dependencies.
MLC LLM - Universal solution for deploying LLMs on any hardware (based on TVM).
mllm - A fast and lightweight LLM inference engine for mobile and edge devices.
OmniInfer - High-performance, on-device VLM inference with hybrid NPU acceleration.
RunAnywhere - Open-source SDK for running LLMs and multimodal models on-device across iOS, Android, and cross-platform apps.

Vendor-Specific SDKs

Qualcomm QNN - Qualcomm AI Stack for Snapdragon NPUs/DSPs.
Apple Core ML - Framework to integrate ML models into iOS/macOS apps.
FluidAudio - Local audio AI SDK for Apple platforms with ASR, speaker diarization, VAD, and TTS optimized for Apple Neural Engine.
NVIDIA TensorRT - SDK for high-performance deep learning inference on NVIDIA GPUs (including Jetson).
Intel OpenVINO - Toolkit for optimizing and deploying AI inference on Intel hardware (CPU/GPU/NPU).
MediaTek NeuroPilot - AI ecosystem and SDK for MediaTek NPUs.

📝 Research Papers

Note: Some of the works are designed for inference acceleration on cloud/server infrastructure, which has much higher computational resources, but I also include them here if they can be potentially generalized to on-device inference use cases.

Attention Acceleration

[MLSys 2025] MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
[MLSys 2025] TurboAttention: Efficient attention approximation for High Throughputs LLMs
[ASPLOS 2023] FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks
[NeurIPS 2022] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

LLM Inference on Mobile SoCs

[SenSys 2026] LLM as a System Service on Mobile Devices
[EuroSys 2026] Scaling LLM Test-Time Compute with Mobile NPU on Smartphones
[SOSP 2025] Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference
[ASPLOS 2025] Neuralink: Fast on-Device LLM Inference with Neuron Co-Activation Linking
[ASPLOS 2025] Fast On-device LLM Inference with NPUs
[arXiv 2024] PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Compiler-based ML Optimization

Hardware-aware Quantization

Inference Acceleration using Heterogeneous Computing Processors (e.g., CPU, GPU, NPU, etc.)

[MobiSys 2025] ARIA: Optimizing Vision Foundation Model Inference on Heterogeneous Mobile Processors for Augmented Reality
[PPoPP 2024] Shared Memory-contention-aware Concurrent DNN Execution for Diversely Heterogeneous SoCs
[MobiSys 2024] Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs
[MobiCom 2024] Perceptual-Centric Image Super-Resolution using Heterogeneous Processors on Mobile Devices
[Sensys 2023] Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
[MobiSys 2023] NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors
[ATC 2023] Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices
[IPSN 2023] PointSplit: Towards On-device 3D Object Detection with Heterogeneous Low-power Accelerators
[SenSys 2022] BlastNet: Exploiting Duo-Blocks for Cross-Processor Real-Time DNN Inference
[MobiSys 2022] Band: Coordinated Multi-DNN Inference on Heterogeneous Mobile Processors
[MobiSys 2022] CoDL: efficient CPU-GPU co-execution for deep learning inference on mobile devices

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
README.md		README.md
old_readme.md		old_readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome On-Device AI Systems

📂 Table of Contents

🚀 Inference Engines

General ML Workloads

LLM & GenAI Specialized

Vendor-Specific SDKs

📝 Research Papers

Attention Acceleration

LLM Inference on Mobile SoCs

Compiler-based ML Optimization

Hardware-aware Quantization

Inference Acceleration using Heterogeneous Computing Processors (e.g., CPU, GPU, NPU, etc.)

Adaptive Inference for Optimized Resource Utilization

On-device Training, Model Adaptation

Profilers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome On-Device AI Systems

📂 Table of Contents

🚀 Inference Engines

General ML Workloads

LLM & GenAI Specialized

Vendor-Specific SDKs

📝 Research Papers

Attention Acceleration

LLM Inference on Mobile SoCs

Compiler-based ML Optimization

Hardware-aware Quantization

Inference Acceleration using Heterogeneous Computing Processors (e.g., CPU, GPU, NPU, etc.)

Adaptive Inference for Optimized Resource Utilization

On-device Training, Model Adaptation

Profilers

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages