✨ v0.2.0 · DSPy-native

Write a task spec.
Run evolution.
Get correct, optimized code.

verigen is a DSPy-native framework for verifiable code generation through evolutionary optimization. Define what you need — it evolves the implementation.

verigen run examples/palindrome/
$ verigen run examples/palindrome/ --max-iterations 10
[0] initial ✓ score=0.6312 8812ms ★
[1] mutate ✗ score=0.4210 5231ms
[2] mutate ✓ score=0.7895 14203ms
[3] mutate ✓ score=0.6344 5801ms
[4] mutate ✓ score=0.9241 36124ms
[5] mutate ✓ score=0.9528 46124ms ★
[6] mutate ✓ score=0.8935 21304ms
═ threshold_reached: 0.95 @ iter 5 ═
Best score: 0.9528   output/best_program.py

Three files → optimized code

Define your task with a seed program, an evaluator, and a description. verigen handles the rest.

01

initial.py

Seed code with # EVOLVE-BLOCK markers. The LLM generates the full program; markers guide where to mutate.

02

evaluate.py

Exports evaluate(code) → dict with score, passed, feedback. Hard constraints stop bad candidates, continuous metrics drive improvement.

03

program.md

Task description + rich context. The first heading is the short description; full content guides the LLM toward the right implementation.

initial.py DSPy Generate Evaluate Passed? Mutate Evaluate Score ↑ ? best_program.py
✓ Keep If passed=True and score improves · ✗ Reject If hard constraints fail · ⏹ Stop Early at score threshold or plateau

Built for the code evolution loop

Everything you need to evolve correct, performant Python code — without leaving the terminal.

🧬

Evolutionary Mutation

LLM-guided code improvement with feedback from each evaluation. Greedy hill-climbing that keeps only better candidates.

🧪

Hard Constraint Verification

Candidates that fail tests are automatically rejected. If the seed generation can't pass, the loop stops immediately — no wasted iterations.

📊

Continuous Metrics

Optimize latency, accuracy, throughput — whatever your evaluate() returns. Higher score wins, guided by sigmoid normalization.

🔗

DSPy Native

Leverages DSPy modules, signatures, and assertions. Works with OpenAI, Anthropic, Google, Ollama, llama.cpp, vLLM — any DSPy-compatible provider.

🔄

Live Progress

See per-iteration scores, timing, and best-so-far marks in real time. Statuses: completed, threshold_reached, plateau, initial_failed.

📦

pi Skill Ready

Ships as a pi agent Skill. Auto-discovered at the project root — your agent can scaffold tasks, run evolution, and return optimized code.

How it looks in practice

A simple task to double an integer — seed code, evaluator, and the result after evolution.

📄 initial.py
def solve(n: int) -> int:
    """Return n * 2."""
    # EVOLVE-BLOCK-START
    raise NotImplementedError("Replace this!")
    # EVOLVE-BLOCK-END
📄 evaluate.py
import time

def evaluate(code: str) -> dict:
    ns = {}
    exec(code, ns)
    fn = ns["solve"]

    # Correctness
    assert fn(5) == 10
    assert fn(0) == 0

    # Performance
    t0 = time.perf_counter()
    for _ in range(1000): fn(100)
    avg_us = (time.perf_counter() - t0) / 1000 * 1e6

    return {
        "score": avg_us / (avg_us + 1.0),
        "passed": True,
        "feedback": f"Avg {avg_us:.1f}µs/call",
        "metrics": {"avg_us": avg_us},
    }

From strings to graphs

TaskPatternDifficultyNotes
examples/palindrome/String processingEasySimple correctness + speed
tasks/game_of_life/Matrix computationMediumPadding vs vectorized
tasks/levenshtein/DP algorithmMedium2D → 1D row optimization
tasks/lru_cache/Data structureMediumOrderedDict → linked list
tasks/topological_sort/Graph algorithmMediumKahn's vs DFS optimization
tasks/prefilter/AST walkingMediumDogfooding: ast.walk vs Python recursion

verigen run <task-dir>

All options for the CLI interface.

--max-iterations, -n Maximum evolution iterations 30
--score-threshold, -t Early stop when score ≥ threshold (e.g. 0.70 = ~2.3× reference)
--model DSPy LM: openai/gpt-4o, ollama_chat/qwen3.6, etc. auto
--api-base API base URL (e.g. http://localhost:8080/v1)
--output, -o Output directory for results ./output/
--timeout Evaluation subprocess timeout (seconds) 30

Use verigen programmatically

Import and integrate directly into your Python workflows.

🐍 Python · full control
from verigen import VerifiableCodeGen, load_task

# Load a task
task = load_task("tasks/palindrome/")

# Configure evolution
gen = VerifiableCodeGen(
    max_iterations=50,
    score_threshold=0.95,
)

# Run
result = gen(task)

# Results
print(f"Score: {result.best_score:.4f}")
print(result.best_code)

# Full iteration history
for entry in result.trace.entries:
    badge = "✓" if entry.passed else "✗"
    print(f"[{entry.iteration:3d}] {entry.phase:7s} {badge} score={entry.score:.4f}")

Ready to evolve your code?

Install verigen and run your first evolution in minutes.

$ pip install verigen
# or: pip install -e /path/to/verigen
$ verigen run examples/palindrome/