✨ v0.2.0 · DSPy-native

Write a task spec.
Run evolution.
Get correct, optimized code.

verigen is a DSPy-native framework for verifiable code generation through evolutionary optimization. Define what you need — it evolves the implementation.

How It Works → View on GitHub

verigen run examples/palindrome/

$ verigen run examples/palindrome/ --max-iterations 10
[0] initial ✓ score=0.6312 8812ms ★
[1] mutate ✗ score=0.4210 5231ms
[2] mutate ✓ score=0.7895 14203ms ★
[3] mutate ✓ score=0.6344 5801ms
[4] mutate ✓ score=0.9241 36124ms ★
[5] mutate ✓ score=0.9528 46124ms ★
[6] mutate ✓ score=0.8935 21304ms
═ threshold_reached: 0.95 @ iter 5 ═
★ Best score: 0.9528 → output/best_program.py

How It Works

Three files → optimized code

Define your task with a seed program, an evaluator, and a description. verigen handles the rest.

initial.py

Seed code with # EVOLVE-BLOCK markers. The LLM generates the full program; markers guide where to mutate.

evaluate.py

Exports evaluate(code) → dict with score, passed, feedback. Hard constraints stop bad candidates, continuous metrics drive improvement.

program.md

Task description + rich context. The first heading is the short description; full content guides the LLM toward the right implementation.

The Evolution Loop

initial.py → DSPy Generate → Evaluate → Passed? → Mutate → Evaluate → Score ↑ ? → best_program.py

✓ Keep If passed=True and score improves · ✗ Reject If hard constraints fail · ⏹ Stop Early at score threshold or plateau

Features

Built for the code evolution loop

Everything you need to evolve correct, performant Python code — without leaving the terminal.

🧬

Evolutionary Mutation

LLM-guided code improvement with feedback from each evaluation. Greedy hill-climbing that keeps only better candidates.

🧪

Hard Constraint Verification

Candidates that fail tests are automatically rejected. If the seed generation can't pass, the loop stops immediately — no wasted iterations.

📊

Continuous Metrics

Optimize latency, accuracy, throughput — whatever your evaluate() returns. Higher score wins, guided by sigmoid normalization.

🔗

DSPy Native

Leverages DSPy modules, signatures, and assertions. Works with OpenAI, Anthropic, Google, Ollama, llama.cpp, vLLM — any DSPy-compatible provider.

🔄

Live Progress

See per-iteration scores, timing, and best-so-far marks in real time. Statuses: completed, threshold_reached, plateau, initial_failed.

📦

pi Skill Ready

Ships as a pi agent Skill. Auto-discovered at the project root — your agent can scaffold tasks, run evolution, and return optimized code.

Code

How it looks in practice

A simple task to double an integer — seed code, evaluator, and the result after evolution.

📄 initial.py

def solve(n: int) -> int:
    """Return n * 2."""
    # EVOLVE-BLOCK-START
    raise NotImplementedError("Replace this!")
    # EVOLVE-BLOCK-END

📄 evaluate.py

import time

def evaluate(code: str) -> dict:
    ns = {}
    exec(code, ns)
    fn = ns["solve"]

    # Correctness
    assert fn(5) == 10
    assert fn(0) == 0

    # Performance
    t0 = time.perf_counter()
    for _ in range(1000): fn(100)
    avg_us = (time.perf_counter() - t0) / 1000 * 1e6

    return {
        "score": avg_us / (avg_us + 1.0),
        "passed": True,
        "feedback": f"Avg {avg_us:.1f}µs/call",
        "metrics": {"avg_us": avg_us},
    }

Example Tasks

From strings to graphs

Task	Pattern	Difficulty	Notes
examples/palindrome/	String processing	Easy	Simple correctness + speed
tasks/game_of_life/	Matrix computation	Medium	Padding vs vectorized
tasks/levenshtein/	DP algorithm	Medium	2D → 1D row optimization
tasks/lru_cache/	Data structure	Medium	OrderedDict → linked list
tasks/topological_sort/	Graph algorithm	Medium	Kahn's vs DFS optimization
tasks/prefilter/	AST walking	Medium	Dogfooding: ast.walk vs Python recursion

CLI

verigen run <task-dir>

All options for the CLI interface.

--max-iterations, -n Maximum evolution iterations 30

--score-threshold, -t Early stop when score ≥ threshold (e.g. 0.70 = ~2.3× reference) —

--model DSPy LM: openai/gpt-4o, ollama_chat/qwen3.6, etc. auto

--api-base API base URL (e.g. http://localhost:8080/v1) —

--output, -o Output directory for results ./output/

--timeout Evaluation subprocess timeout (seconds) 30

Python API

Use verigen programmatically

Import and integrate directly into your Python workflows.

🐍 Python · full control

from verigen import VerifiableCodeGen, load_task

# Load a task
task = load_task("tasks/palindrome/")

# Configure evolution
gen = VerifiableCodeGen(
    max_iterations=50,
    score_threshold=0.95,
)

# Run
result = gen(task)

# Results
print(f"Score: {result.best_score:.4f}")
print(result.best_code)

# Full iteration history
for entry in result.trace.entries:
    badge = "✓" if entry.passed else "✗"
    print(f"[{entry.iteration:3d}] {entry.phase:7s} {badge} score={entry.score:.4f}")

Write a task spec.Run evolution.Get correct, optimized code.