Open Source · MIT · Free forever

Agents are black boxes.
Peekr opens them.

Like cProfile for your AI agent — with claim-level hallucination detection built in. Two lines. No account. No backend.

Get started → ⭐ Star on GitHub Read the docs
pip install peekr

Python 3.9+· SDKs + frameworks· Sync & async· No signup

Works with

OpenAI gpt-4o, o1, o3 · streaming ✓
Anthropic Claude 3 / 4 · streaming ✓
AWS Bedrock Converse API · streaming ✓
LangChain chains · tools · agents · retrievers
LlamaIndex query · retrieve · agent steps
CrewAI crew · task · agent spans
Step 1

Instrument

Call peekr.instrument() once. It patches LLM SDKs (OpenAI, Anthropic, Bedrock) and agent frameworks (LangChain, LlamaIndex, CrewAI) automatically — no code changes anywhere else.

Step 2

Decorate your tools

Add @trace to your tool functions. They become child spans that nest under LLM calls automatically.

Step 3

Inspect

Run peekr view --io traces.jsonl to see the full tree — inputs, outputs, latency, token counts, and errors.

Four problems. One library.

Wrong answers, hallucinations, slow responses, and runaway costs — Peekr shows you exactly what to fix.

🔍

"My agent gave the wrong answer"

Fix: see what the LLM actually received

Malformed tool output is the silent killer. Peekr captures the full call tree — tool inputs, tool outputs, what the LLM got — so you find the mismatch in seconds, not hours.

agent.run  2100ms
   └─ tool.fetch_user  12ms
         in:  {"user_id": 42}
         out: null                     ← returned null, agent didn't check
   └─ openai.chat [gpt-4o]  2088ms  4821tok
         in:  [{"role": "system", "content": "User profile: null..."}]
                                                   ^ LLM received garbage
🌀

"My agent is hallucinating"

Fix: know which exact claim was wrong — not just a score

Peekr breaks every LLM response into individual sentences and verdicts each one: supported, contradicted, or unsupported. You see exactly which claim failed, not just that something did.

openai.chat [gpt-4o]  843ms  312tok
   in:  "The Eiffel Tower was completed in 1889 in Paris..."
   out: "Built in 1923 by Frank Lloyd Wright."     ← invented facts
   eval_scores: {Hallucination: 0.0, Rubric: 0.5}
⏱️

"My agent is too slow"

Fix: the LLM is almost never the bottleneck

Most teams swap models first. The trace shows wall-clock time split across every tool and LLM call — 80% of the time the fix is a cache, not a different model.

agent.run  4300ms
   └─ tool.search_web  3800ms   ← 88% of total time. Cache this.
   └─ tool.rerank          18ms
   └─ openai.chat         490ms   ← not your problem
💸

"My API bill is too high"

Fix: token counts growing run-over-run means unbounded history

Growing tokens every run is the signature of unbounded conversation history. Peekr shows the slope, identifies the feature causing it, and tells you exactly where to cap it.

Trace 1:  18,432 tokens   · $0.018
Trace 2:  21,104 tokens   · $0.021
Trace 3:  24,891 tokens   · $0.025   ← unbounded growth

Add it like a profiler. Remove it like a profiler.

Two lines for LLM calls. One decorator for your tools. Nothing else.

agent.py
import peekr
peekr.instrument()

from peekr import trace
import openai

@trace
def search_web(query: str) -> list:
    return fetch_results(query)

# LLM calls traced automatically
openai.chat.completions.create(
    model="gpt-4o",
    messages=[...]
)
peekr view --io traces.jsonl
Trace a3f2b1c0  1243ms  891tok
────────────────────────────────
agent.run  1243ms
  └─ tool.search_web  210ms
       in:  "climate policy"
       out: ["result1", ...]
  └─ openai.chat [gpt-4o]
       1033ms  891tok
       out: "Based on recent..."

An observability dashboard that tells you what to fix

One command. One self-contained HTML file. No server, no signup. Designed for RAG and memory/agent pipelines.

terminal
# one command
peekr dashboard traces.db -o report.html
open report.html
health hero · what's happening
 Hallucination health: 66/100
                needs attention
30 of 134 calls flagged. ↓ 12 pts
vs baseline 0.78.

› Hallucination dropped 27 pts
  from baseline.
› Worst channel: gpt-4o-mini · acme
  · /api/qa — 0.31 / 8 calls.
› 4 of 12 citations were invented.
🧭

Diagnoses common RAG failures

For each pattern detected: cause + what to try

The "Likely causes & next steps" panel runs eight diagnostic rules over your data and surfaces concrete fixes — retrieval misses, chunking issues, missing refusal prompts, channel concentration, error spikes.

[HIGH] Model is inventing citations (33%)
   Out of 12 patterns in outputs, 8 don't
   appear in source context. Signature of
   a RAG flow where retrieval missed.

   What to try:
   1. Log retrieved chunks for a span
   2. "Cite only sources in context"
   3. Verify citations post-hoc
   4. Hybrid retrieval (BM25 + dense)
🔥

Localises regressions in one click

Channel × time heatmap shows where and when

Rows are your models, tenants, and endpoints. Columns are time buckets. Red cells mean hallucinating in that window. Click any cell to refilter the whole dashboard to that channel.

model              10:00 11:00 12:00 13:00 14:00
gpt-4o-mini        0.89  0.71  0.42  0.31  0.28
gpt-4o             0.91  0.88  0.76  0.81  0.79
claude-opus-4-5    0.93  0.92  0.94  0.91  0.92
                   green grounded → red hallucinating
🎯

Every flagged call ships with its own fix list

Per-span action box, computed from that call's failure pattern

Click a low point on the time-series and you jump to a worst-offender card showing the source context, the model's answer (with contradicted/unsupported claims highlighted), and a tailored "What to try for this call" panel.

#1 ⬤0.00  gpt-4o-mini · acme · /api/qa
Q: When was the Eiffel Tower built and by whom?

┌─SOURCE CONTEXT────────────┐ ┌─MODEL ANSWER──────────────┐
│ The Eiffel Tower was      │ │ Built in 1923 by           │
│ completed in 1889 for     │ │ Frank Lloyd Wright for     │
│ the Paris World's Fair... │ │ the London Olympics.       │
└───────────────────────────┘ └───────────────────────────┘

  contradicted  "1923"
  contradicted  "Frank Lloyd Wright"
  unsupported   "London Olympics"

  ▌ What to try for this call:
  • Numeric contradiction (1923) — add "be exact about dates"
  • Proper noun substitution — instruct not to substitute names
  • Move retrieved context closer to the question
Dashboard docs →

Scales with you

Start local. Graduate to SQLite when you need queries. Bring your own backend when you're ready.

Default

JSONL

Writes one span per line. Grep-able, diff-able, works everywhere. Perfect for local debugging.

peekr.instrument()
Advanced

Custom

Implement one export(span) method to ship spans to Datadog, your own backend, or anywhere else.

add_exporter(MyExporter())

Query your agent like a database

SQLite storage means every trace is queryable. No dashboard needed.

Find your slowest tool calls
SELECT name, ROUND(AVG(duration_ms)) avg_ms
FROM spans
GROUP BY name
ORDER BY avg_ms DESC;
Track token spend by model
SELECT json_extract(attributes,'$.model') model,
       SUM(json_extract(attributes,'$.tokens_total')) tokens
FROM spans GROUP BY model;
Find all errors
SELECT name, trace_id,
       json_extract(attributes,'$.error') msg
FROM spans
WHERE status = 'error';
Trace cost growth over time
SELECT trace_id,
       SUM(json_extract(attributes,'$.tokens_total')) total
FROM spans
GROUP BY trace_id
ORDER BY start_time;
Find the worst hallucinations
SELECT trace_id,
       json_extract(attributes,'$.eval_scores.Hallucination') h,
       json_extract(attributes,'$.output') out
FROM spans
WHERE h IS NOT NULL AND h < 0.5
ORDER BY h ASC LIMIT 20;
Eval scores by model
SELECT json_extract(attributes,'$.model') model,
       AVG(json_extract(attributes,'$.eval_scores.Hallucination')) groundedness,
       AVG(json_extract(attributes,'$.eval_scores.Rubric')) quality
FROM spans GROUP BY model;

How it compares

Most alternatives either require an account, tie you to a framework, or don't capture LLM context at all.

Helicone LangSmith OpenTelemetry peekr
Zero config
No account required
Works with any framework
Captures token counts
Captures inputs & outputs
Parent/child span tree
Queryable locally
Data stays on your machine

Everything you need. Nothing you don't.

From basic tracing to evals, experiments, and a data flywheel — all in one library.

Zero config

One call patches OpenAI, Anthropic, and Bedrock. No wrappers, no env vars, no accounts.

🔗

Automatic nesting

Spans link to parents via Python's ContextVar. Works across async/await without manual threading.

🗄️

SQLite storage

WAL mode for concurrent writes. Query traces across runs directly with SQL.

🎯

@trace decorator

Wrap any sync or async function. Captures inputs, outputs, latency, and errors.

👤

Session tracing

peekr.session(user_id="alice", tenant_id="acme") groups spans by user and customer org. Both first-class on every span.

🏢

Multi-tenant schema

tenant_id + retention_class are first-class columns — indexed in SQLite, top-level in JSONL. Filter without json_extract.

🚨

Alerts

Fire when error rate, latency, token spend, or cost spikes cross your threshold.

🧠

LLM-as-judge eval

Score every LLM response automatically with Rubric, NotEmpty, and NoError. Scores ride on each span as eval_scores.

🌀

Hallucination detection

Hallucination() scores how grounded each output is in its context (0.0–1.0). Plug in your retrieved docs for RAG flows.

🧬

RAGAS-style claim decomposition

Hallucination(detailed=True) splits outputs into atomic claims and verdicts each one — supported, contradicted, or unsupported.

📈

Observability dashboard

peekr dashboard emits one self-contained HTML file — health hero, channel-and-time heatmap, AI-generated recommendations, and per-call action items.

👍

Feedback + export

Rate traces good/bad. Export labelled data as OpenAI fine-tuning JSONL in one command.

🔬

A/B experiments

@peekr.experiment(variants=[...]) routes traffic and tags spans. Analyse results with SQL.

⏮️

Trace replay

Re-run any stored trace with the same inputs. Debug production issues locally.

🔌

Custom exporters

One method to ship spans to Datadog, Grafana, or your own backend.

🔒

Privacy first

Nothing leaves your machine by default. Use capture_io=False for sensitive functions.

🛡️

Guardrails

PIIRedact, Blocklist, HallucinationBlock — enforce rules on inputs and outputs. Pre-call blocking, post-call violation recording.

⚖️

Compliance packs (Cloud)

17 regulatory packs — FDCPA, HIPAA, FINRA, GDPR, EU AI Act, UAE PDPL, KSA PDPL and more. Enforced on every LLM call. Rules update without SDK changes.

🌐

FastAPI middleware

app.add_middleware(peekr.FastAPIMiddleware) — one line gives every request a root span. All LLM calls nest under it automatically as children.

Peekr Cloud

The OSS SDK is MIT licensed forever — that's not changing. When a single-process file isn't the right fit any more (multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage), Peekr Cloud is the optional managed backend.

The wire format is the same — tenant_id and retention_class are first-class on every span, so the traces you produce locally ship to the Cloud dashboard unchanged:

import peekr

peekr.instrument(
    tenant_id="acme",
    exporter=peekr.HTTPExporter(
        endpoint="https://ingest.peekr.starkspherelabs.com",
        api_key="pk_live_…",
    ),
)
Sign up free →

If peekr saved you a debugging session,
give it a star ⭐

The OSS SDK is open source, free forever, and takes 3 seconds. A star helps other developers find it and keeps the project growing.

⭐ Star on GitHub