Introducing nanochat: The best ChatGPT that $100 can buy. #1
Replies: 64 comments 19 replies
-
|
What does the performance look like if, say, it was scaled down to FP4 on a RTX Blackwell card? |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for all your work ❤️ |
Beta Was this translation helpful? Give feedback.
-
|
I love this work |
Beta Was this translation helpful? Give feedback.
-
|
I am teaching Rustaceans to use Rust in ML right now on Ukrainian Rustcamp and this code will be useful for them. Thanks for open sourcing it! |
Beta Was this translation helpful? Give feedback.
-
|
can I make one with 10 $ ????? l |
Beta Was this translation helpful? Give feedback.
-
|
wow thank you! |
Beta Was this translation helpful? Give feedback.
-
|
incredible work! @karpathy you keep pushing the bar higher |
Beta Was this translation helpful? Give feedback.
-
|
awesome contribution to the open source community. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for this. Your code & teaching are very much appreciated |
Beta Was this translation helpful? Give feedback.
-
|
I pray the LLM God Karpathy (*>人<) |
Beta Was this translation helpful? Give feedback.
-
|
Thank you openly sharing your knowledge! |
Beta Was this translation helpful? Give feedback.
-
|
Thank you a lot. It would be great if you could share the trained model, as not everyone will be able to spend $100 on this, so they can at least test the inference. |
Beta Was this translation helpful? Give feedback.
-
|
Awesome! Now I need to run it on my old PC |
Beta Was this translation helpful? Give feedback.
-
|
Incredible work—been in-love with your from-scratch implementations forever. Great service to open-source research community! 👏 A couple of clarifications that might help readers:
A few questions:
|
Beta Was this translation helpful? Give feedback.
-
|
Omg I am gonna love reading through this. Thank you so much! |
Beta Was this translation helpful? Give feedback.
-
|
Can someone post step by step procedure for setting up:
|
Beta Was this translation helpful? Give feedback.
-
|
感谢有你,感谢分享,很好用 |
Beta Was this translation helpful? Give feedback.
-
|
Amazing - i am already fine tuning my own model. thank you! |
Beta Was this translation helpful? Give feedback.
-
|
Sure, sure. I'll give it a try. Thank you for your reminder. Your project is a great one. I'm a media person and enthusiast in the field of large models. I've decided to help you promote it. You deserve this promotion.
…---Original---
From: "Valters ***@***.***>
Date: Thu, Oct 16, 2025 10:19 AM
To: ***@***.***>;
Cc: "Shuaixin ***@***.******@***.***>;
Subject: Re: [karpathy/nanochat] Introducing nanochat: The best ChatGPT that$100 can buy. (Discussion #1)
For fun? Yes, use a cheap gpu provider/marketplace, I got one 5090 ~$0.35/hr, batch size 16, depth 16 instead of 20, add weight checkpointing; 10-30 hours of training should get decent results (got 85% mfu)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Love this! Thank you so much!! Side note: It should be Sunday. |
Beta Was this translation helpful? Give feedback.
-
|
Sure, I'm currently a freshman in college and a big fan of large models. I really like your open-source project. Could I ask you for some advice on learning about large models?
…---Original---
From: "Noviar ***@***.***>
Date: Thu, Oct 16, 2025 10:34 AM
To: ***@***.***>;
Cc: "Shuaixin ***@***.******@***.***>;
Subject: Re: [karpathy/nanochat] Introducing nanochat: The best ChatGPT that$100 can buy. (Discussion #1)
Love this! Thank you so much!!
Side note:
If yesterday was Friday, then tomorrow will be Saturday.
It should be Sunday.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
Wowwwww, thank You. |
Beta Was this translation helpful? Give feedback.
-
|
Wow, can I mix both Tinker Finetuning API by Thinking Machines and NanoChat to maximise their use? |
Beta Was this translation helpful? Give feedback.
-
|
cool |
Beta Was this translation helpful? Give feedback.
-
|
I ran into a problem with two nearly identical machines each with a single 4090, but only one was training correctly. On the working one the samples were sentish-sh after an hour and loss approached 4.0. The other's loss plateaued around 7.8 after many hours, and the samples never got better than repeating "the" and "," characters. The environments are identical as far as I can tell. I tried a shotgun of random fixes and eventually found that disabling AMP and fused attention fixed things on the "broken" machine. I'm not sure if there is something wrong on my system, there is a subtle bug in the training code, or if this is just an unlucky quirk of numerical precision failing in just the right way that it happened to tank training on this one box. But if you find your training loss gets stuck, maybe try turning this off. (Disabling this lowers the speed about 20 to 25 percent.) UPDATE: If I try different depth models, some do stall out in the same way on the "working" system. All depth models do train correctly setting just |
Beta Was this translation helpful? Give feedback.
-
|
I love this |
Beta Was this translation helpful? Give feedback.
-
|
Love it! |
Beta Was this translation helpful? Give feedback.
-
|
Has anyone attempted this on a NVIDIA DGX Spark? Will be slower but should work similar to single GPU mode. |
Beta Was this translation helpful? Give feedback.
-
|
A minor correction: Each shard is a simple parquet file of about 0.25**B** characters ... |
Beta Was this translation helpful? Give feedback.
-
|
There's gotta be a better way to model a conversation than tokens delimiting speaker. Skip to a new rotary embedding cycle (in addition) or something? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Ok so we just booted up an 8xH100 box from e.g. Lambda GPU Cloud. This is costing us about ~$24/hr, so there is no time to lose.
Environment setup
Clone the project:
We wish to train the best ChatGPT that $100 can buy, which we call a "speedrun". Reference the script speedrun.sh, which is designed to just run right away on a blank box start to end. However, in this post I will step through it part by part so that I can comment in detail on all sections of it. We first have to make sure the new&hot uv project manager is installed. Install uv, create a new virtual environment in
.venv, get all the dependencies, and activate the environment so that when we typepythonwe're using the virtual env python, not the system python:Next, we need to install Rust/Cargo so that we can compile our custom Rust tokenizer. I know - it's a bit much to have a new/custom tokenizer, unfortunately the Python version in my earlier minbpe project is way too slow and the huggingface tokenizers is too bloated and confusing. So we have our own new tokenizer for training (tested to be equal to Python), but we will still use OpenAI's tiktoken for efficient inference. So here we go, compile our tokenizer:
Train the tokenizer
Next we need the pretraining data so that we can 1) train the tokenizer and 2) pretrain the model. The pretraining data is just the text of a lot of webpages, and for this part we will use the FineWeb-EDU dataset. Normally, we'd be able to just use huggingface
datasets.load_dataset(), but I didn't like how it's too heavy, bloated and obscures some very simple logic, so I re-packaged the entire dataset into simple, fully shuffled shards that we can easily and efficiently access at will and re-uploaded the sample-100B version of it as karpathy/fineweb-edu-100b-shuffle. On this page you can also preview example text in the dataset. Each shard is a simple parquet file of about 0.25M characters and takes up about 100MB on disk (gzip compressed). There are 1822 shards in total but we only need 240 of them to train a depth=20 model (more on this later). So let's download all of the data now. This is about ~24GB download here, but it's fairly zippy on a cloud box normally:All of this is by default going into
~/.cache/nanochat. Once the download is done, let's train our tokenizer, which translates back and forth between strings and sequences of symbols from a codebook. By default we are training a vocab size of2**16 = 65,536tokens (a nice number), of which a few tokens are reserved as special (to be used for the chat schema later). The training set is 2B characters, which only takes ~1 minute. The training algorithm is identical to the one used by OpenAI (regex splitting, byte-level BPE). See my video on tokenization for a lot more information. Right after, we can evaluate the tokenizer:The evaluation tells us that we're achieving a compression ratio of about 4.8 (meaning 4.8 characters of original text become 1 token on average). We can also see a comparison to the GPT-2 and GPT-4 tokenizer. Compared to GPT-2 (which has 50257 tokens), ours is much better across the board in compressing text, except for math by a little bit:
We're not doing so hot compared to GPT-4, but you have to keep in mind that GPT-4 has a much larger vocab size (100,277). In particular, GPT-4 is a lot better in multilingual (FineWeb has a very strong focus on English, so that makes sense!), but also on code and math:
Still, we actually beat GPT-4 by a tiny bit even though we have a lower vocab size on fineweb, because that's the dataset we actually trained on, so our tokenizer matches that document distribution very well (e.g. we might get an edge on compressing English).
Pretraining
Before we kick off pretraining, we need to download one more file that I call the "eval bundle". During pretraining, the script will periodically evaluate the CORE metric. You can see some details in the DCLM paper, but essentially it is a nice, normalized, broad measure of how good the model is across a large number of datasets in autocompletion. Datasets like HellaSwag, jeopardy, bigbench QA wikidata, ARC-Easy/Challenge, copa, commonsense qa, piqa, lambada, winograd, boolq, etc etc (22 total). Download, unzip and place the eval bundle directory into the base directory as
~/.cache/nanochat/eval_bundle:curl -L -o eval_bundle.zip https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip unzip -q eval_bundle.zip rm eval_bundle.zip mv eval_bundle "$HOME/.cache/nanochat"One more setup I'd advise (though it's optional) is to set up wandb for seeing nice plots during training. uv already installed wandb for us up above, but you still have to set up an account and log in with:
We can now kick off pretraining! This is the most computationally heavy part, where we are training the LLM to compress internet web text by predicting the next token in the sequence, and where the LLM gains a lot of knowledge about the world:
Here we are launching training on 8 GPUs via the scripts/base_train.py script. We're training a Transformer with 20 layers. By default, each GPU is processing 32 rows of 2048 tokens per forward/backward for a total of
32*2048 = 2**19 = 524,288~= 0.5M tokens per step of the optimization. If you have wandb set up, append--run=speedrun(all training scripts accept it) to set the run name and log to it. When you launch training, you'll see something like this (stripping a bunch of stuff for brevity):We see that the Transformer has 1280 channels and 10 heads in Attention, each of dim=128. It has ~560M parameters. In order to meet Chinchilla scaling law recommendations, this means we want 560M X 20 ~= 11.2B tokens to train on. As each step of the optimization is 524,288 tokens, this means 11.2B / 0.5M ~= 21400 iterations. Taking the estimated number of FLOPs per token and multiplying by total number of tokens tells us that this will be a ~4e19 FLOPs capability model. The learning rate is automatically scaled down as 1/sqrt(dim), as larger models prefer smaller learning rates. We're using Muon to optimize the matrices and AdamW to optimize the embedding and unembedding. There are no other trainable parameters (biases, rmsnorms params, etc.) in this model. Training will periodically report the "Validation bpb", which is bits per byte on the validation dataset. Bits per byte is a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant. So if you have a tokenizer with a small vocab size or a big one, this number will be comparable, unlike raw cross-entropy loss. Notice that each step is taking about 0.5s,
lrmis the learning rate decay multiplier (it will linearly ramp down to 0 near the end of training), the reported MFU (model flops utilization) looks good at almost ~half, meaning that we are utilizing a lot of the bfloat16 compute available to us.We now wait for about 3 hours for 4e19 FLOPs to elapse... You should see something like this in your wandb plots:
bpb going down over time is good (the model is predicting the next token more accurately). In addition, the CORE score is going up. Instead of just approximated metrics, we can evaluate the model more fully as:
We see that we reach train/val bits per byte (bpb) of ~0.81 and the CORE metric goes up to 0.22. For comparison, the eval bundle contains the GPT-2 model CORE scores. In particular, CORE of 0.22 is a little bit more than GPT-2 large (at 0.21) but a little bit less than GPT-2 xl (i.e. "the" GPT-2, at 0.26). The model at this point is a fancy autocomplete, so we can run a few prompts to get a sense of the knowledge stored in the model. The file
base_loss.pyruns these. The prompts are:And the completed text is:
So the model knows about Paris (France), that Au is gold, that Saturday follows Friday, that "cold" is the opposite of "hot", and even the planes of the solar system. However, it's not so sure about the color of the sky yet, or how to do simple maths. Still, not too bad for a model trained for $72. The inference uses a custom
Engineclass, which uses KV Caching for efficient inference, as well as a simple implementation of the two common inference stages: prefill and decode. Our Engine class also supports tool use (of Python interpreter), which will be useful when training on GSM8K (more on that later).Midtraining
Next up is midtraining, which further finetunes the model on smol-SmolTalk. Everything is algorithmically identical to pretraining, but the dataset now becomes conversations, and the model adapts itself to the new special tokens that now structure multi-turn Conversation objects. Each conversation now looks something like this, loosely following the OpenAI Harmony chat format:
The tokens rendered as <|example|> are special tokens, following the format of OpenAI special tokens. The midtraining stage is quite useful for a number of adaptations in the model:
Our midtraining mixture looks like this by default:
And we kick it off as follows:
This run only takes about 8 minutes, a lot shorter than pretraining at ~3 hours. Now that the model is a proper Chat model and it can take on the role of an Assistant answering User queries, we can evaluate it:
We get the following results for the model at this stage:
We see that:
I don't really have a nice graphic to illustrate this step, but here is an example of midtraining a different, bigger model earlier just to give you a sense of what it looks like for these metrics to go up during a finetuning run:
Supervised Finetuning (SFT)
Following midtraining is the Supervised Finetuning (SFT) stage. This is an additional round of finetuning on Conversations, but ideally here you'd cherry pick just the most beautiful/good data, and this is also where you'd do things like safety training (e.g. assistant refusals). Our model isn't even sure about the color of the sky so we're probably safe on the biohazard side of things for now. One domain adaptation that happens here is that SFT stretches out rows of data and pads them, exactly mimicking the test-time format. In other words, examples are not just randomly concatenated into long rows like in pre/mid-training, where it is done for efficiency of training. Fixing this domain mismatch serves as another little "tightening the screws" boost. We can run SFT and re-evaluate:
This again only runs for about 7 minutes and you should notice a small bump in metrics:
Finally, we can take on the role of a User and talk to our model! We could have already done it after midtraining, but it's a bit nicer here. Talk to it either in your terminal window (line 1), or via the web UI (2):
The
chat_webscript will serve the Engine using FastAPI. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you're on, followed by the port, so for example http://209.20.xxx.xxx:8000/, etc.That will gloriously look something like this:
It won't win any physics or poem competitions anytime soon, but again - it seems cool how far we can go with so little budget, and this project is nowhere near tuned enough.
Reinforcement Learning (RL)
The final stage of the speedrun (though it is commented out by default) is Reinforcement Learning. RLHF is a nice way to gain a few percent of performance and mitigate a lot of model shortcomings that come from the sampling loop itself - e.g. hallucinations, infinite loops, etc. But at our scale these are not a major consideration. That said, of all the datasets we're working with so far, GSM8K is the one that has a clear/objective reward function (the correct answer to a math problem). So we can run the RL (/GRPO) script to hillclimb on the answers directly in a simple RL loop that interleaves sampling and training:
During RL, the model goes over all GSM8K problems in the training set, samples completions, then we reward them, and train on the ones that got high rewards. We're using a highly simplified GRPO training loop, e.g. we don't use trust regions (throw away reference model and KL regularization), we are on policy (throw away the PPO ratios+clip), we use GAPO style normalization (token-level, not sequence-level normalization), and the advantage is simple reward shift by mean (throw away z-score normalization with dividing by sigma). So we're left with something that looks quite a bit more like REINFORCE, but keeping the GR ("group relative") part in calculating advantages from the rewards. It works ok at this scale and task simplicity. See script for more details.
RL is commented out by default right now because it's not super well-tuned, and we don't have full and general RLHF. We only have RL on GSM8K specifically, which is why we're also restricting the evaluation to only gsm8k with
-aflag. It also runs for quite a while because reinforcement learning is sucking supervision bits through a straw. E.g. the default settings run for about 1.5 hours and look like this:We can see that reward goes up (i.e. the model is learning), the accuracy (pass@1) is climbing, and so is pass@8 (i.e. we're given 8 opportunities to get the right answer). It's also promising that pass@8 >> pass@1, indicating that there is still gap to be claimed here with more RL and more epochs. The improvements are more prominent on larger models, e.g. I ran up to d30 so far. I'm not going to spend as much time on this because honestly this part is not super well-tuned, and it creates a GSM-specific model, not a general chat model.
Report card
The final thing I'd like to point out is the
report.mdfile that appears in your project folder. It contains a lot of the details related to the run, as well as a nice summary table at the end:Total wall clock time: 3h51m
Note that since the support for RL right now is a little bit mixed, I exclude it from the total wall clock time calculation. Up to and including SFT, the whole thing ran in 3h51m, for a total cost of
(3 + 51/60) * 24 = $92.4, (with RL this is a bit closer to 5 hours right now). We even have $8 left for ice cream.Your turn
With nanochat, you can tune anything. Change the tokenizer, change any of the data, tune the hyperparameters, improve the optimization... there are many ideas to try. You may also wish to train bigger models. The codebase is set up to do that quite easily, simply use
--depthto change the number of layers and everything else is based off of that as the single slider of complexity. For example, the number of channels will grow, the learning rates will adjust, etc. In principle, just by changing the depth you can sweep out an entire miniseries of nanochat. You should also see strictly better results by using a larger depth and waiting longer. The place you'd pass it in is during thebase_train.pypretraining stage. For example, to get a GPT-2 capability model of CORE about 0.25, d26 is a good number to try. But to train larger models, we now have to tune the max device batch size, e.g. decreasing it from 32 to 16:The code will notice and automatically compensate, calculating that it needs to now do a gradient accumulation loop of 2 iterations to meet the target desired batch size of 0.5M. To train a d30, we have to decrease it further again:
And so on. Feel free to also read the code, I tried very hard to keep it readable, commented, clean and accessible. And of course, feel free to package it all up and ask your favorite LLM as well, or even simpler, use DeepWiki from Devin/Cognition to ask questions of this repo. Just change the URL of the repo from github.com to deepwiki.com, i.e. nanochat DeepWiki.
That's it, tune any part of the entire pipeline, re-run, and have fun! Ask any questions here, in the Issues/Discussions, or on my Discord in the channel
#nanochat.Beta Was this translation helpful? Give feedback.
All reactions