mt4g is a HIP‑based collection of microbenchmarks that explores the memory hierarchy of modern GPUs. It measures cache sizes, line sizes, latencies, resource‑sharing behaviour, ... on both NVIDIA and AMD hardware and emits automatically evaluated results as structured JSON.
See the sample_results folder.
- Unified build system for NVIDIA (
sm_XX) and AMD (gfxXXXX) targets - Benchmarks for L1/L2/L3 caches, scalar caches, shared and main memory
- Optional NVIDIA‑specific constant, read‑only and texture cache tests
- Graph generation and raw timing export
- JSON output summarising all measured metrics
- HIP SDK with the `hipcc` compiler
- GPU drivers and runtime libraries
- `HIP_PATH` environment variable pointing to the HIP installation
- `GPU_TARGET_ARCH` set to the desired architecture (e.g. `sm_70`, `gfx90a`)
- Python 3 with the `matplotlib`, `pandas` and `numpy` packages for graph generation
The project has been verified with CUDA 12.8 and hipcc 6.3.3.
A suitable HIP environment can be obtained most easily via Spack:
spack install hip # for AMD targets
spack install hip+cuda # includes NVCC backend for NVIDIA targets
spack load hip # sets HIP_PATH and exposes hipccMake sure to set HIP_PATH and CUDA_PATH when compiling for NVIDIA.
Choose the desired GPU architecture and invoke the build.
The following dependencies should be installed on the system:
- nlohmann-json
- cxxopts
To build and install mt4g, run
git clone https://github.com/caps-tum/mt4g.git
cd mt4g
mkdir build && cd build
cmake .. -DGPU_TARGET_ARCH=<sm_XX|gfxXXX>
# build options:
# -DCMAKE_BUILD_TYPE=<Release|Debug> -- to choose between release and debug builds
# -DCMAKE_INSTALL_PREFIX=<install_prefix> -- to set the install destination (default on UNIX platforms: /usr/local)
make all install -j $(nproc)<install_prefix>/bin/mt4g [options]Common options:
| Option | Description |
|---|---|
-d, --device-id <id> |
GPU device to use (default 0) |
-f, --file <name> |
Specify name of output files (default <GPU_NAME>) |
-g, --graphs |
Generate graphs for each benchmark |
-l, --location <path> |
Specify location of output files (default .) |
-o, --raw |
Write raw timing data |
-p, --report |
Create Markdown report in output directory |
-r, --random |
Randomize P-Chase arrays |
-s, --stdout |
Dump final JSON result into stdout |
-q, --quiet |
Reduce console output |
--l1, --l2, --l3 |
Run cache benchmarks for selected levels |
--scalar, --shared, --memory |
Run scalar, shared and main memory tests |
--constant, --readonly, --texture |
NVIDIA specific cache benchmarks |
--resourceshare |
Run resource sharing benchmarks |
-v, --version |
Show version of mt4g |
-h, --help |
Show full help |
If no benchmark group is chosen all available groups are executed. Unsupported groups are disabled automatically depending on the platform.
Make sure to have exclusive GPU access, otherwise results are far less reliable.
Usually, benchmark results are written as structured JSON into the file
<GPU_NAME>.json of the current working directory. However, the name and path
of the output file and directory may be changed through the flags -f/--file
and -l/--location respectively. With -s/--stdout, the final JSON output
file may be dumped into stdout instead. When graph, raw or report output is
enabled the files are written to a directory named after the detected GPU. The
--report flag writes a README.md containing the JSON summary and embeds all
generated graphs with links to raw data.
include/ - Public headers and utilities
results/ - Available sample results
src/ - Benchmark implementation and CLI helpers
docs/ - Additional documentation
CMakeLists.txt - Build configuration
See docs/usage.md for a comprehensive description of the command line interface and docs/development.md for contribution guidelines.
Pre-measured results for selected GPUs live in the
results directory. If your hardware is not yet listed,
we would greatly appreciate additional reports: run the tool with
--raw --graphs --report and open a pull request to share your measurements.
Developed at the Chair for Computer Architecture and Parallel Systems (CAPS) at the Technical University of Munich. Originally authored by Dominik Größler, completely reworked by Manuel Walter Mußbacher and currently maintained by Stepan Vanecek.
- L2 segment size measurements on AMD GPUs are currently unreliable due to the platform's complex cache behaviour.
- Constant L1.5 Cache Size detection is capped at 64 KiB. Denoted by 64 KiB + 1 and confidence = 0. (> 64 KiB)
- Bandwidths are not optimal because we currently do not use a (dynamically found) optimal number of blocks.
- Cache Line Size detection uses a heuristical approach and is therefore not guaranteed to be correct.
- Constant L1 shared with L1 is not too reliable. Hence, as a hotfix we repeat the measurements 10 times and on one unsuccessful run return not shared. We are working on a cleaner solution.
- Parallel build fails if depedencies were not fetched.
- Runs only on Linux.
This project is licensed under the Apache License 2.0.
| Cache | L1 | L2 | RO | TXT | C1 | C1.5 | SM | M |
|---|---|---|---|---|---|---|---|---|
| Size | Yes | API, Seg. | Yes | Yes | Yes | Yes | API | API |
| Line Size | Yes | Yes | Yes | Yes | Yes | Yes | – | – |
| Fetch Gran. | Yes | Yes | Yes | Yes | Yes | Yes | – | – |
| Latency | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Count | Yes | Yes, Seg. | Yes | Yes | Yes | No | – | – |
| Miss Penalty | Yes | Yes | Yes | Yes | Yes | No | – | – |
| Bandwidth | No | R/W | No | No | No | No | No | R/W |
| Shared With | RO, C1, TXT | L1, TXT | L1, RO |
| Cache | vL1d | L2 | L3 | sL1d | SM | M |
|---|---|---|---|---|---|---|
| Size | Yes | API, Seg. | API | Yes | API | API |
| Line Size | Yes | API, FB | API | Yes | – | – |
| Fetch Gran. | Yes | Yes | No | Yes | – | – |
| Latency | Yes | Yes | No | Yes | Yes | Yes |
| Count | Yes | API | API | Uni. | – | – |
| Miss Penalty | Yes | Yes | No | Yes | – | – |
| Bandwidth | No | R/W | R/W | No | No | R/W |
| Shared With | CU |
Seg. = Segment Uni. = Unique R/W = Read Bandwidth and Write Bandwidth FB = Fallback Benchmark implemented API = HIP Device Prop / HSA / AMDGPU KFD Kernel Module