Out of the cave.

We trained models in the dark. Loss numbers trickled through a Python interpreter, minutes late. The GPU spiked and stalled behind a wall we couldn't see through. We stared at nvidia-smi and hoped.

We learned to call this normal.

Then one day, we walked out.

Light — a live dashboard drawing loss curves as they happen, GPU and VRAM breathing in real time. Speed — the same CUDA kernels, but the GPU never starves. Nothing sits between you and the metal. Clarity — every epoch lands the instant it finishes. No buffering. No hoping.

You can't go back.

flodl is a native deep learning framework. Rust on libtorch.
It doesn't replace your GPU — it gets out of its way.

Terminal

$ curl -sL https://flodl.dev/init.sh | sh -s my-project
$ cd my-project
$ make run         # first build downloads libtorch (~5 min)

Prefer native? Install Rust, then: curl -sL https://raw.githubusercontent.com/fab2s/floDl/main/download-libtorch.sh | sh

Get Started View on GitHub

What you get

One command scaffolds a complete project. Edit src/main.rs and train.

Annotated training template

A working model with graph builder, optimizer, scheduler, gradient clipping, and monitor. Every line has a PyTorch comment.

Docker setup included

CPU and CUDA Dockerfiles, docker-compose.yml, and a Makefile with make build, make test, make run.

Fast rebuild cycle

floDl is pre-optimized at opt-level = 3 in dev builds. After the first compile, your code rebuilds in ~2s.

Build your model

The fluent graph builder reads as data flow. Residuals, parallel branches, loops, and routing compose cleanly.

src/main.rs

use flodl::*;

let model = FlowBuilder::from(Linear::new(4, 32)?)
    .through(GELU)                        // activation
    .through(LayerNorm::new(32)?)       // normalization
    .also(Linear::new(32, 32)?)          // residual: output = input + Linear(input)
    .through(Linear::new(32, 1)?)        // output projection
    .build()?;

// That's a trainable model. Train it like PyTorch:
let params = model.parameters();
let mut optimizer = Adam::new(&params, 0.001);
model.set_training(true);

for epoch in 0..num_epochs {
    optimizer.zero_grad();
    let pred = model.forward(&input)?;
    let loss = mse_loss(&pred, &target)?;
    loss.backward()?;
    clip_grad_norm(&params, 1.0)?;
    optimizer.step()?;
}

Why floDl

A Rust-native framework for researchers who care about what happens between the GPU kernels. Read the trajectory thesis.

⚙

Deterministic memory

Tensor memory freed by Drop the instant it leaves scope. No GC, no finalizers, no VRAM budget heuristics. Five phases of memory management replaced by impl Drop for Tensor.

◆

Fluent graph builder

Describe architectures as readable data flow. from/through/also/split/merge/loop_body/gate/switch compose into anything — residual nets to mixture-of-experts.

◇

Built-in training monitor

Live web dashboard with metrics charts, CPU/GPU/VRAM tracking, and ETA. Zero external dependencies. monitor.serve(3000) and open a browser.

⇌

PyTorch parity

Same libtorch GPU kernels. Same training loop pattern. Linear, Conv2d, LayerNorm, BatchNorm, GRU/LSTM cells, all standard losses and optimizers.

⚡

Wide GPU support

Links libtorch's stable C API, not a specific CUDA toolkit version. Older GPUs (Pascal, Maxwell) work out of the box — no version pinning required. If nvidia-smi runs, floDl can train on it.

✓

Compile-time safety

Every fallible op returns Result<T>. The borrow checker prevents data races. No silent error propagation, no nil pointer panics.

▦

Easy setup

Docker workflow via init.sh for zero-install builds. Or go native: download-libtorch.sh auto-detects your GPU and sets up libtorch — just add Rust.

Up to 31% faster than PyTorch

10 models, 10 interleaved rounds, locked GPU clocks. flodl wins 8 of 10, ties 2, zero regressions.

Model	PyTorch	flodl	Delta
transformer	3183.0 ms	2199.8 ms	-31%
mlp	291.1 ms	207.0 ms	-29%
residual_tower	406.9 ms	309.7 ms	-24%
feedback_fixed	275.3 ms	231.3 ms	-16%
convnet	1298.0 ms	1298.2 ms	0%

RTX 5060 Ti, GPU at 3090 MHz. v0.2.2 vs PyTorch 2.10.0. The convnet tie proves both frameworks dispatch identical CUDA kernels — the speed gap is pure framework overhead.

Live dashboard from an earlier FBRL letter model run (v0.1.1, GTX 1060) — 19% faster with the training monitor running.

Full benchmark report →