Like cProfile — but for AI agents. Every LLM call, tool invocation, token count, and error recorded as a tree you can inspect. Two lines to add, no backend required.
pip install peekr
Works with
Call peekr.instrument() once. It patches LLM SDKs (OpenAI, Anthropic, Bedrock) and agent frameworks (LangChain, LlamaIndex, CrewAI) automatically — no code changes anywhere else.
Add @trace to your tool functions. They become child spans that nest under LLM calls automatically.
Run peekr view --io traces.jsonl to see the full tree — inputs, outputs, latency, token counts, and errors.
CPU profilers only see timing. Agents fail for reasons that have nothing to do with performance.
Not what you think was sent — what was actually sent. Peekr captures every message so you can catch malformed context before it reaches the model.
agent.run 2100ms └─ tool.fetch_user 12ms in: {"user_id": 42} out: null ← returned null, agent didn't check └─ openai.chat [gpt-4o] 2088ms 4821tok in: [{"role": "system", "content": "User profile: null..."}] ^ LLM received garbage
The Hallucination evaluator scores each LLM response against its context — 1.0 = every claim grounded, 0.0 = invented. Scores ride on the span and are queryable in SQL.
openai.chat [gpt-4o] 843ms 312tok in: "The Eiffel Tower was completed in 1889 in Paris..." out: "Built in 1923 by Frank Lloyd Wright." ← invented facts eval_scores: {Hallucination: 0.0, Rubric: 0.5}
The trace shows exactly where time went. Most developers assume the LLM is the bottleneck and start swapping models. It almost never is.
agent.run 4300ms └─ tool.search_web 3800ms ← 88% of total time. Cache this. └─ tool.rerank 18ms └─ openai.chat 490ms ← not your problem
Token counts across traces reveal patterns invisible in your code. Growing tokens every run is the signature of unbounded history.
Trace 1: 18,432 tokens · $0.018 Trace 2: 21,104 tokens · $0.021 Trace 3: 24,891 tokens · $0.025 ← unbounded growth
The trace shows what your tools actually returned — not what you assume they returned. Compare local and prod traces side by side.
# local tool.fetch_inventory 8ms out: [{"id": 1, "qty": 42}] # prod tool.fetch_inventory 8ms out: [] ← empty. Data pipeline bug.
Two lines for LLM calls. One decorator for your tools. Nothing else.
import peekr peekr.instrument() from peekr import trace import openai @trace def search_web(query: str) -> list: return fetch_results(query) # LLM calls traced automatically openai.chat.completions.create( model="gpt-4o", messages=[...] )
Trace a3f2b1c0 1243ms 891tok ──────────────────────────────── agent.run 1243ms └─ tool.search_web 210ms in: "climate policy" out: ["result1", ...] └─ openai.chat [gpt-4o] 1033ms 891tok out: "Based on recent..."
One command. One self-contained HTML file. No server, no signup. Designed for RAG and memory/agent pipelines.
# one command
peekr dashboard traces.db -o report.html
open report.html
● Hallucination health: 66/100 needs attention 30 of 134 calls flagged. ↓ 12 pts vs baseline 0.78. › Hallucination dropped 27 pts from baseline. › Worst channel: gpt-4o-mini · acme · /api/qa — 0.31 / 8 calls. › 4 of 12 citations were invented.
The "Likely causes & next steps" panel runs eight diagnostic rules over your data and surfaces concrete fixes — retrieval misses, chunking issues, missing refusal prompts, channel concentration, error spikes.
[HIGH] Model is inventing citations (33%) Out of 12 patterns in outputs, 8 don't appear in source context. Signature of a RAG flow where retrieval missed. What to try: 1. Log retrieved chunks for a span 2. "Cite only sources in context" 3. Verify citations post-hoc 4. Hybrid retrieval (BM25 + dense)
Rows are your models, tenants, and endpoints. Columns are time buckets. Red cells mean hallucinating in that window. Click any cell to refilter the whole dashboard to that channel.
model 10:00 11:00 12:00 13:00 14:00 gpt-4o-mini 0.89 0.71 0.42 0.31 0.28 gpt-4o 0.91 0.88 0.76 0.81 0.79 claude-opus-4-5 0.93 0.92 0.94 0.91 0.92 green grounded → red hallucinating
Click a low point on the time-series and you jump to a worst-offender card showing the source context, the model's answer (with contradicted/unsupported claims highlighted), and a tailored "What to try for this call" panel.
#1 ⬤0.00 gpt-4o-mini · acme · /api/qa Q: When was the Eiffel Tower built and by whom? ┌─SOURCE CONTEXT────────────┐ ┌─MODEL ANSWER──────────────┐ │ The Eiffel Tower was │ │ Built in 1923 by │ │ completed in 1889 for │ │ Frank Lloyd Wright for │ │ the Paris World's Fair... │ │ the London Olympics. │ └───────────────────────────┘ └───────────────────────────┘ contradicted "1923" contradicted "Frank Lloyd Wright" unsupported "London Olympics" ▌ What to try for this call: • Numeric contradiction (1923) — add "be exact about dates" • Proper noun substitution — instruct not to substitute names • Move retrieved context closer to the question
Start local. Graduate to SQLite when you need queries. Bring your own backend when you're ready.
Writes one span per line. Grep-able, diff-able, works everywhere. Perfect for local debugging.
peekr.instrument()
WAL mode for multi-process writes. Query across runs with SQL. Works inside Docker and CI.
peekr.instrument(storage="sqlite")
Implement one export(span) method to ship spans to Datadog, your own backend, or anywhere else.
add_exporter(MyExporter())
SQLite storage means every trace is queryable. No dashboard needed.
SELECT name, ROUND(AVG(duration_ms)) avg_ms FROM spans GROUP BY name ORDER BY avg_ms DESC;
SELECT json_extract(attributes,'$.model') model,
SUM(json_extract(attributes,'$.tokens_total')) tokens
FROM spans GROUP BY model;
SELECT name, trace_id,
json_extract(attributes,'$.error') msg
FROM spans
WHERE status = 'error';
SELECT trace_id,
SUM(json_extract(attributes,'$.tokens_total')) total
FROM spans
GROUP BY trace_id
ORDER BY start_time;
SELECT trace_id,
json_extract(attributes,'$.eval_scores.Hallucination') h,
json_extract(attributes,'$.output') out
FROM spans
WHERE h IS NOT NULL AND h < 0.5
ORDER BY h ASC LIMIT 20;
SELECT json_extract(attributes,'$.model') model,
AVG(json_extract(attributes,'$.eval_scores.Hallucination')) groundedness,
AVG(json_extract(attributes,'$.eval_scores.Rubric')) quality
FROM spans GROUP BY model;
Most alternatives either require an account, tie you to a framework, or don't capture LLM context at all.
| Print statements | LangSmith | OpenTelemetry | peekr | |
|---|---|---|---|---|
| Zero config | 〜 | ✗ | ✗ | ✓ |
| No account required | ✓ | ✗ | ✓ | ✓ |
| Works with any framework | ✓ | ✗ | ✓ | ✓ |
| Captures token counts | ✗ | ✓ | ✗ | ✓ |
| Captures inputs & outputs | 〜 | ✓ | ✗ | ✓ |
| Parent/child span tree | ✗ | ✓ | ✓ | ✓ |
| Queryable locally | ✗ | ✗ | 〜 | ✓ |
| Data stays on your machine | ✓ | ✗ | 〜 | ✓ |
From basic tracing to evals, experiments, and a data flywheel — all in one library.
One call patches OpenAI, Anthropic, and Bedrock. No wrappers, no env vars, no accounts.
Spans link to parents via Python's ContextVar. Works across async/await without manual threading.
WAL mode for concurrent writes. Query traces across runs directly with SQL.
Wrap any sync or async function. Captures inputs, outputs, latency, and errors.
peekr.session(user_id="alice", tenant_id="acme") groups spans by user and customer org. Both first-class on every span.
tenant_id + retention_class are first-class columns — indexed in SQLite, top-level in JSONL. Filter without json_extract.
Fire when error rate, latency, token spend, or cost spikes cross your threshold.
Score every LLM response automatically with Rubric, NotEmpty, and NoError. Scores ride on each span as eval_scores.
Hallucination() scores how grounded each output is in its context (0.0–1.0). Plug in your retrieved docs for RAG flows.
Hallucination(detailed=True) splits outputs into atomic claims and verdicts each one — supported, contradicted, or unsupported.
peekr dashboard emits one self-contained HTML file — health hero, channel-and-time heatmap, AI-generated recommendations, and per-call action items.
Rate traces good/bad. Export labelled data as OpenAI fine-tuning JSONL in one command.
@peekr.experiment(variants=[...]) routes traffic and tags spans. Analyse results with SQL.
Re-run any stored trace with the same inputs. Debug production issues locally.
One method to ship spans to Datadog, Grafana, or your own backend.
Nothing leaves your machine by default. Use capture_io=False for sensitive functions.
The OSS SDK is MIT licensed forever — that's not changing. When a single-process file isn't the right fit any more (multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage), Peekr Cloud is the optional managed backend.
The wire format is already in v0.3 — tenant_id and retention_class exist specifically so the spans you produce today work without modification when you connect:
import peekr peekr.instrument( tenant_id="acme", exporter=peekr.HTTPExporter( endpoint="https://ingest.peekr.cloud", api_key="pk_live_…", ), )Join the waitlist →
The OSS SDK is open source, free forever, and takes 3 seconds. A star helps other developers find it and keeps the project growing.
⭐ Star on GitHub