$ cd architectures/autoresearch
·6 min read·v0.0.0

AutoResearch Architecture

Autonomous LLM pretraining research agent by Andrej Karpathy — an AI agent that modifies code, trains models, evaluates results, and iterates indefinitely while you sleep.

autonomous-agentllm-pretrainingml-researchgptoptimizerkarpathy
View repository →
Core Engine (GPT + Training)
Optimizer (Muon + AdamW)
Data Pipeline
AI Agent (LLM Researcher)
Program / Config
CLI / Runtime
External (HuggingFace, GPU)

System Layers

Agent Layer (the autonomous researcher)
🤖Claude / Codex AgentLLM-powered researcher
📝program.mdAgent skill / instructions
Git WorkflowBranch, commit, revert
📊results.tsvExperiment log
Training Engine (the single file the agent edits)
🧠GPT ModelTransformer + RoPE + Value Embeds
🔁Training Loop5-min time budget, grad accum
MuonAdamWMuon (matrices) + AdamW (rest)
📈LR SchedulesWarmup + warmdown + decay
Data & Evaluation Layer (fixed, read-only)
📦Data DownloaderClimbMix-400B parquet shards
🔢BPE Tokenizerrustbpe + tiktoken, 8192 vocab
📄DataLoaderBOS-aligned, best-fit packing
evaluate_bpbFixed BPB metric (ground truth)
Infrastructure
💻Single NVIDIA GPUH100 (tested), CUDA
📦PyTorch 2.9torch.compile, bf16 autocast
🌍HuggingFace HubDataset hosting
📤Flash Attention 3kernels (varunneal / community)
📦uvPackage manager + runner

Core Flow — Autonomous Research Loop

1
SetupAgent reads program.md, creates branch autoresearch/<tag>, reads all source files, verifies data exists in ~/.cache/autoresearch/
2
Baseline RunExecute uv run train.py > run.log 2>&1 with unmodified code. Record initial val_bpb in results.tsv
3
HypothesizeAgent proposes a change: architecture tweak, hyperparameter adjustment, optimizer modification, or simplification in train.py
4
Edit & CommitModify train.py (the only mutable file), then git commit the change to the experiment branch
5
Train (5 min)Run training for fixed TIME_BUDGET = 300s wall clock. Model trains on ClimbMix-400B with gradient accumulation, bf16, torch.compile
6
EvaluateCompute val_bpb via evaluate_bpb() on pinned validation shard. Extract metrics with grep from log
7
Keep or DiscardIf val_bpb improved: keep the commit and advance. If worse: git reset back. Log result to results.tsv
8
Loop ForeverAgent runs autonomously and indefinitely (~12 experiments/hour, ~100 overnight). Human wakes up to a log of discoveries

Integration Model

AI Agent (the Researcher)
Any LLM agent: Claude, Codex, etc.
Reads program.md as its skill / instruction set
Edits only train.py — everything else is read-only
Uses git for version control (branch per run)
Logs results to results.tsv (untracked)
Runs indefinitely with no human intervention
Compute & Data Stack
Single NVIDIA GPU (H100 tested), CUDA backend
Flash Attention 3 via kernels package (Hopper-aware)
torch.compile with bf16 autocast for speed
ClimbMix-400B dataset from HuggingFace (parquet shards)
rustbpe + tiktoken for fast BPE tokenization
uv for dependency management and script execution

Key Subsystem — GPT Model + MuonAdamW Optimizer

train.py (the single mutable file)
├── GPTConfig             ← Dataclass: depth, heads, embed dim, window pattern
├── CausalSelfAttention   ← GQA + RoPE + Value Embeddings + FA3
├── MLP                   ← Linear → ReLU² → Linear (squared ReLU activation)
├── Block                 ← Pre-norm (RMSNorm) + Attn + MLP with residual
├── GPT                   ← Embedding + N Blocks + Logit Softcap + LM Head
├── MuonAdamW             ← Hybrid optimizer: Muon for 2D matrices, AdamW for rest
├── polar_express_coeffs  ← Precomputed coefficients for Newton-Schulz orthogonalization
├── Hyperparameters       ← DEPTH, ASPECT_RATIO, LRs, BATCH_SIZE, WEIGHT_DECAY
└── Training Loop         ← Time-budgeted loop with LR warmup/warmdown
Value Embeddings
ResFormer-style: alternating layers get input-dependent gated value residuals added to V projections
Sliding Window Attention
SSSL pattern: 3 short-window + 1 long-window layers cycling, last layer always full context
Muon Optimizer
Nesterov momentum + Polar Express orthogonalization + NorMuon variance reduction for matrix params
Residual Scaling
Learnable per-layer resid_lambdas and x0_lambdas mixing current hidden state with initial embeddings
Logit Softcap
Tanh-based logit capping at 15 to prevent extreme values and stabilize training
BPB Evaluation
Bits-per-byte metric: vocab-size-independent, sums per-token cross-entropy weighted by UTF-8 byte lengths

Data & Output Model

ClimbMix-400B Dataset
6,542 parquet shards on HuggingFace, Column: text (string), Pinned val shard: shard_06542, Stored: ~/.cache/autoresearch/data/
BPE Tokenizer
rustbpe-trained, tiktoken-wrapped, vocab_size: 8192, 4 special tokens (reserved_0..3), GPT-4 style split pattern, Stored: ~/.cache/autoresearch/tokenizer/
results.tsv
commit (7-char hash), val_bpb (float, lower=better), memory_gb (peak VRAM), status: keep | discard | crash, description (text)
run.log Output
val_bpb, training_seconds, total_seconds, peak_vram_mb, mfu_percent, total_tokens_M, num_steps, num_params_M, depth
DataLoader Batch
inputs: [B, T] int64 (token IDs), targets: [B, T] int64 (shifted), BOS-aligned best-fit packed, Zero padding 100% utilization, Pin-memory + async GPU copy
Model Config (default)
depth: 8, n_embd: 512, n_head: 4, head_dim: 128, seq_len: 2048, vocab: 8192, aspect_ratio: 64, window: SSSL pattern

Package / Directory Map

autoresearch/
├── train.py               THE mutable file: GPT model, MuonAdamW optimizer, training loop, hyperparams
├── prepare.py             Read-only: constants, data download, tokenizer training, dataloader, eval
├── program.md             Agent instructions: setup, experiment loop, logging rules, constraints
├── pyproject.toml         Dependencies: torch, kernels, rustbpe, tiktoken, numpy, pandas, matplotlib
├── analysis.ipynb         Jupyter notebook for plotting experiment progress from results.tsv
├── README.md              Project overview, quickstart, design choices, platform notes
├── .python-version        Python version pinning (3.10+)
├── .gitignore             Ignores cache, logs, results
├── uv.lock                Locked dependency graph
├── progress.png           Sample experiment progress chart
└── ~/.cache/autoresearch/
  ├── data/              Downloaded parquet shards (shard_00000..06542)
  └── tokenizer/         tokenizer.pkl + token_bytes.pt
The Key Insight

AutoResearch inverts the traditional ML research workflow: instead of a human writing code and running experiments, the human writes a program.md that programs an AI agent to be the researcher. The entire system is deliberately minimal — just three files — with the constraint that the agent can only modify train.py while a fixed 5-minute time budget makes every experiment directly comparable. This creates an autonomous research loop that runs ~100 experiments overnight, evolving the training code through a keep/discard selection process analogous to evolution.