peekr — cProfile for AI agents

A profiler for every layer of your agent

CPU profilers only see timing. Agents fail for reasons that have nothing to do with performance.

🔍

"My agent gave the wrong answer"

Fix: see what the LLM actually received

Not what you think was sent — what was actually sent. Peekr captures every message so you can catch malformed context before it reaches the model.

agent.run  2100ms
   └─ tool.fetch_user  12ms
         in:  {"user_id": 42}
         out: null                     ← returned null, agent didn't check
   └─ openai.chat [gpt-4o]  2088ms  4821tok
         in:  [{"role": "system", "content": "User profile: null..."}]
                                                   ^ LLM received garbage

🌀

"My agent is hallucinating"

Fix: score every output for groundedness, catch the regression

The Hallucination evaluator scores each LLM response against its context — 1.0 = every claim grounded, 0.0 = invented. Scores ride on the span and are queryable in SQL.

openai.chat [gpt-4o]  843ms  312tok
   in:  "The Eiffel Tower was completed in 1889 in Paris..."
   out: "Built in 1923 by Frank Lloyd Wright."     ← invented facts
   eval_scores: {Hallucination: 0.0, Rubric: 0.5}

⏱️

"My agent is too slow"

Fix: cache or parallelize the right thing

The trace shows exactly where time went. Most developers assume the LLM is the bottleneck and start swapping models. It almost never is.

agent.run  4300ms
   └─ tool.search_web  3800ms   ← 88% of total time. Cache this.
   └─ tool.rerank          18ms
   └─ openai.chat         490ms   ← not your problem

💸

"My API bill is too high"

Fix: summarize history after 5 turns, cut costs 60–80%

Token counts across traces reveal patterns invisible in your code. Growing tokens every run is the signature of unbounded history.

Trace 1:  18,432 tokens   · $0.018
Trace 2:  21,104 tokens   · $0.021
Trace 3:  24,891 tokens   · $0.025   ← unbounded growth

🐛

"It works locally but fails in prod"

Fix: the bug is in your data pipeline, not your agent

The trace shows what your tools actually returned — not what you assume they returned. Compare local and prod traces side by side.

# local   tool.fetch_inventory  8ms   out: [{"id": 1, "qty": 42}]
# prod    tool.fetch_inventory  8ms   out: []  ← empty. Data pipeline bug.

An observability dashboard that tells you what to fix

One command. One self-contained HTML file. No server, no signup. Designed for RAG and memory/agent pipelines.

terminal

# one command
peekr dashboard traces.db -o report.html
open report.html

health hero · what's happening

● Hallucination health: 66/100
                needs attention
30 of 134 calls flagged. ↓ 12 pts
vs baseline 0.78.

› Hallucination dropped 27 pts
  from baseline.
› Worst channel: gpt-4o-mini · acme
  · /api/qa — 0.31 / 8 calls.
› 4 of 12 citations were invented.

🧭

Diagnoses common RAG failures

For each pattern detected: cause + what to try

The "Likely causes & next steps" panel runs eight diagnostic rules over your data and surfaces concrete fixes — retrieval misses, chunking issues, missing refusal prompts, channel concentration, error spikes.

[HIGH] Model is inventing citations (33%)
   Out of 12 patterns in outputs, 8 don't
   appear in source context. Signature of
   a RAG flow where retrieval missed.

   What to try:
   1. Log retrieved chunks for a span
   2. "Cite only sources in context"
   3. Verify citations post-hoc
   4. Hybrid retrieval (BM25 + dense)

🔥

Localises regressions in one click

Channel × time heatmap shows where and when

Rows are your models, tenants, and endpoints. Columns are time buckets. Red cells mean hallucinating in that window. Click any cell to refilter the whole dashboard to that channel.

model              10:00 11:00 12:00 13:00 14:00
gpt-4o-mini        0.89  0.71  0.42  0.31  0.28
gpt-4o             0.91  0.88  0.76  0.81  0.79
claude-opus-4-5    0.93  0.92  0.94  0.91  0.92
                   green grounded → red hallucinating

🎯

Every flagged call ships with its own fix list

Per-span action box, computed from that call's failure pattern

Click a low point on the time-series and you jump to a worst-offender card showing the source context, the model's answer (with contradicted/unsupported claims highlighted), and a tailored "What to try for this call" panel.

#1 ⬤0.00  gpt-4o-mini · acme · /api/qa
Q: When was the Eiffel Tower built and by whom?

┌─SOURCE CONTEXT────────────┐ ┌─MODEL ANSWER──────────────┐
│ The Eiffel Tower was      │ │ Built in 1923 by           │
│ completed in 1889 for     │ │ Frank Lloyd Wright for     │
│ the Paris World's Fair... │ │ the London Olympics.       │
└───────────────────────────┘ └───────────────────────────┘

  contradicted  "1923"
  contradicted  "Frank Lloyd Wright"
  unsupported   "London Olympics"

  ▌ What to try for this call:
  • Numeric contradiction (1923) — add "be exact about dates"
  • Proper noun substitution — instruct not to substitute names
  • Move retrieved context closer to the question

Dashboard docs →

Scales with you

Start local. Graduate to SQLite when you need queries. Bring your own backend when you're ready.

Default

JSONL

Writes one span per line. Grep-able, diff-able, works everywhere. Perfect for local debugging.

peekr.instrument()

New

SQLite

WAL mode for multi-process writes. Query across runs with SQL. Works inside Docker and CI.

peekr.instrument(storage="sqlite")

Advanced

Custom

Implement one export(span) method to ship spans to Datadog, your own backend, or anywhere else.

add_exporter(MyExporter())

Query your agent like a database

SQLite storage means every trace is queryable. No dashboard needed.

Find your slowest tool calls

SELECT name, ROUND(AVG(duration_ms)) avg_ms
FROM spans
GROUP BY name
ORDER BY avg_ms DESC;

Track token spend by model

SELECT json_extract(attributes,'$.model') model,
       SUM(json_extract(attributes,'$.tokens_total')) tokens
FROM spans GROUP BY model;

Find all errors

SELECT name, trace_id,
       json_extract(attributes,'$.error') msg
FROM spans
WHERE status = 'error';

Trace cost growth over time

SELECT trace_id,
       SUM(json_extract(attributes,'$.tokens_total')) total
FROM spans
GROUP BY trace_id
ORDER BY start_time;

Find the worst hallucinations

SELECT trace_id,
       json_extract(attributes,'$.eval_scores.Hallucination') h,
       json_extract(attributes,'$.output') out
FROM spans
WHERE h IS NOT NULL AND h < 0.5
ORDER BY h ASC LIMIT 20;

Eval scores by model

SELECT json_extract(attributes,'$.model') model,
       AVG(json_extract(attributes,'$.eval_scores.Hallucination')) groundedness,
       AVG(json_extract(attributes,'$.eval_scores.Rubric')) quality
FROM spans GROUP BY model;

	Print statements	LangSmith	OpenTelemetry	peekr
Zero config	〜	✗	✗	✓
No account required	✓	✗	✓	✓
Works with any framework	✓	✗	✓	✓
Captures token counts	✗	✓	✗	✓
Captures inputs & outputs	〜	✓	✗	✓
Parent/child span tree	✗	✓	✓	✓
Queryable locally	✗	✗	〜	✓
Data stays on your machine	✓	✗	〜	✓

Everything you need. Nothing you don't.

From basic tracing to evals, experiments, and a data flywheel — all in one library.

⚡

Zero config

One call patches OpenAI, Anthropic, and Bedrock. No wrappers, no env vars, no accounts.

🔗

Automatic nesting

Spans link to parents via Python's ContextVar. Works across async/await without manual threading.

🗄️

SQLite storage

WAL mode for concurrent writes. Query traces across runs directly with SQL.

🎯

@trace decorator

Wrap any sync or async function. Captures inputs, outputs, latency, and errors.

👤

Session tracing

peekr.session(user_id="alice", tenant_id="acme") groups spans by user and customer org. Both first-class on every span.

🏢

Multi-tenant schema

tenant_id + retention_class are first-class columns — indexed in SQLite, top-level in JSONL. Filter without json_extract.

🚨

Alerts

Fire when error rate, latency, token spend, or cost spikes cross your threshold.

🧠

LLM-as-judge eval

Score every LLM response automatically with Rubric, NotEmpty, and NoError. Scores ride on each span as eval_scores.

🌀

Hallucination detection

Hallucination() scores how grounded each output is in its context (0.0–1.0). Plug in your retrieved docs for RAG flows.

🧬

RAGAS-style claim decomposition

Hallucination(detailed=True) splits outputs into atomic claims and verdicts each one — supported, contradicted, or unsupported.

📈

Observability dashboard

peekr dashboard emits one self-contained HTML file — health hero, channel-and-time heatmap, AI-generated recommendations, and per-call action items.

👍

Feedback + export

Rate traces good/bad. Export labelled data as OpenAI fine-tuning JSONL in one command.

🔬

A/B experiments

@peekr.experiment(variants=[...]) routes traffic and tags spans. Analyse results with SQL.

⏮️

Trace replay

Re-run any stored trace with the same inputs. Debug production issues locally.

🔌

Custom exporters

One method to ship spans to Datadog, Grafana, or your own backend.

🔒

Privacy first

Nothing leaves your machine by default. Use capture_io=False for sensitive functions.

COMING SOON

Peekr Cloud

The OSS SDK is MIT licensed forever — that's not changing. When a single-process file isn't the right fit any more (multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage), Peekr Cloud is the optional managed backend.

The wire format is already in v0.3 — tenant_id and retention_class exist specifically so the spans you produce today work without modification when you connect:

import peekr

peekr.instrument(
    tenant_id="acme",
    exporter=peekr.HTTPExporter(
        endpoint="https://ingest.peekr.cloud",
        api_key="pk_live_…",
    ),
)

Join the waitlist →

Agents are black boxes.
Peekr makes them transparent.

Instrument

Decorate your tools

Inspect

A profiler for every layer of your agent

"My agent gave the wrong answer"

"My agent is hallucinating"

"My agent is too slow"

"My API bill is too high"

"It works locally but fails in prod"

Add it like a profiler. Remove it like a profiler.

An observability dashboard that tells you what to fix

Diagnoses common RAG failures

Localises regressions in one click

Every flagged call ships with its own fix list

Scales with you

JSONL

SQLite

Custom

Query your agent like a database

How it compares

Everything you need. Nothing you don't.

Zero config

Automatic nesting

SQLite storage

@trace decorator

Session tracing

Multi-tenant schema

Alerts

LLM-as-judge eval

Hallucination detection

RAGAS-style claim decomposition

Observability dashboard

Feedback + export

A/B experiments

Trace replay

Custom exporters

Privacy first

Peekr Cloud

If peekr saved you a debugging session,
give it a star ⭐

Agents are black boxes.Peekr makes them transparent.

Instrument

Decorate your tools

Inspect

A profiler for every layer of your agent

"My agent gave the wrong answer"

"My agent is hallucinating"

"My agent is too slow"

"My API bill is too high"

"It works locally but fails in prod"

Add it like a profiler. Remove it like a profiler.

An observability dashboard that tells you what to fix

Diagnoses common RAG failures

Localises regressions in one click

Every flagged call ships with its own fix list

Scales with you

JSONL

SQLite

Custom

Query your agent like a database

How it compares

Everything you need. Nothing you don't.

Zero config

Automatic nesting

SQLite storage

@trace decorator

Session tracing

Multi-tenant schema

Alerts

LLM-as-judge eval

Hallucination detection

RAGAS-style claim decomposition

Observability dashboard

Feedback + export

A/B experiments

Trace replay

Custom exporters

Privacy first

Peekr Cloud

If peekr saved you a debugging session,give it a star ⭐

Agents are black boxes.
Peekr makes them transparent.

If peekr saved you a debugging session,
give it a star ⭐