Open Source · MIT · Free forever

Agents are black boxes.
Peekr makes them transparent.

Like cProfile — but for AI agents. Every LLM call, tool invocation, token count, and error recorded as a tree you can inspect. Two lines to add, no backend required.

Get started → ⭐ Star on GitHub Read the docs
pip install peekr

Python 3.9+· SDKs + frameworks· Sync & async· No signup

Works with

OpenAI gpt-4o, o1, o3 · streaming ✓
Anthropic Claude 3 / 4 · streaming ✓
AWS Bedrock Converse API · streaming ✓
LangChain chains · tools · agents · retrievers
LlamaIndex query · retrieve · agent steps
CrewAI crew · task · agent spans
Step 1

Instrument

Call peekr.instrument() once. It patches LLM SDKs (OpenAI, Anthropic, Bedrock) and agent frameworks (LangChain, LlamaIndex, CrewAI) automatically — no code changes anywhere else.

Step 2

Decorate your tools

Add @trace to your tool functions. They become child spans that nest under LLM calls automatically.

Step 3

Inspect

Run peekr view --io traces.jsonl to see the full tree — inputs, outputs, latency, token counts, and errors.

A profiler for every layer of your agent

CPU profilers only see timing. Agents fail for reasons that have nothing to do with performance.

🔍

"My agent gave the wrong answer"

Fix: see what the LLM actually received

Not what you think was sent — what was actually sent. Peekr captures every message so you can catch malformed context before it reaches the model.

agent.run  2100ms
   └─ tool.fetch_user  12ms
         in:  {"user_id": 42}
         out: null                     ← returned null, agent didn't check
   └─ openai.chat [gpt-4o]  2088ms  4821tok
         in:  [{"role": "system", "content": "User profile: null..."}]
                                                   ^ LLM received garbage
🌀

"My agent is hallucinating"

Fix: score every output for groundedness, catch the regression

The Hallucination evaluator scores each LLM response against its context — 1.0 = every claim grounded, 0.0 = invented. Scores ride on the span and are queryable in SQL.

openai.chat [gpt-4o]  843ms  312tok
   in:  "The Eiffel Tower was completed in 1889 in Paris..."
   out: "Built in 1923 by Frank Lloyd Wright."     ← invented facts
   eval_scores: {Hallucination: 0.0, Rubric: 0.5}
⏱️

"My agent is too slow"

Fix: cache or parallelize the right thing

The trace shows exactly where time went. Most developers assume the LLM is the bottleneck and start swapping models. It almost never is.

agent.run  4300ms
   └─ tool.search_web  3800ms   ← 88% of total time. Cache this.
   └─ tool.rerank          18ms
   └─ openai.chat         490ms   ← not your problem
💸

"My API bill is too high"

Fix: summarize history after 5 turns, cut costs 60–80%

Token counts across traces reveal patterns invisible in your code. Growing tokens every run is the signature of unbounded history.

Trace 1:  18,432 tokens   · $0.018
Trace 2:  21,104 tokens   · $0.021
Trace 3:  24,891 tokens   · $0.025   ← unbounded growth
🐛

"It works locally but fails in prod"

Fix: the bug is in your data pipeline, not your agent

The trace shows what your tools actually returned — not what you assume they returned. Compare local and prod traces side by side.

# local   tool.fetch_inventory  8ms   out: [{"id": 1, "qty": 42}]
# prod    tool.fetch_inventory  8ms   out: []  ← empty. Data pipeline bug.

Add it like a profiler. Remove it like a profiler.

Two lines for LLM calls. One decorator for your tools. Nothing else.

agent.py
import peekr
peekr.instrument()

from peekr import trace
import openai

@trace
def search_web(query: str) -> list:
    return fetch_results(query)

# LLM calls traced automatically
openai.chat.completions.create(
    model="gpt-4o",
    messages=[...]
)
peekr view --io traces.jsonl
Trace a3f2b1c0  1243ms  891tok
────────────────────────────────
agent.run  1243ms
  └─ tool.search_web  210ms
       in:  "climate policy"
       out: ["result1", ...]
  └─ openai.chat [gpt-4o]
       1033ms  891tok
       out: "Based on recent..."

An observability dashboard that tells you what to fix

One command. One self-contained HTML file. No server, no signup. Designed for RAG and memory/agent pipelines.

terminal
# one command
peekr dashboard traces.db -o report.html
open report.html
health hero · what's happening
 Hallucination health: 66/100
                needs attention
30 of 134 calls flagged. ↓ 12 pts
vs baseline 0.78.

› Hallucination dropped 27 pts
  from baseline.
› Worst channel: gpt-4o-mini · acme
  · /api/qa — 0.31 / 8 calls.
› 4 of 12 citations were invented.
🧭

Diagnoses common RAG failures

For each pattern detected: cause + what to try

The "Likely causes & next steps" panel runs eight diagnostic rules over your data and surfaces concrete fixes — retrieval misses, chunking issues, missing refusal prompts, channel concentration, error spikes.

[HIGH] Model is inventing citations (33%)
   Out of 12 patterns in outputs, 8 don't
   appear in source context. Signature of
   a RAG flow where retrieval missed.

   What to try:
   1. Log retrieved chunks for a span
   2. "Cite only sources in context"
   3. Verify citations post-hoc
   4. Hybrid retrieval (BM25 + dense)
🔥

Localises regressions in one click

Channel × time heatmap shows where and when

Rows are your models, tenants, and endpoints. Columns are time buckets. Red cells mean hallucinating in that window. Click any cell to refilter the whole dashboard to that channel.

model              10:00 11:00 12:00 13:00 14:00
gpt-4o-mini        0.89  0.71  0.42  0.31  0.28
gpt-4o             0.91  0.88  0.76  0.81  0.79
claude-opus-4-5    0.93  0.92  0.94  0.91  0.92
                   green grounded → red hallucinating
🎯

Every flagged call ships with its own fix list

Per-span action box, computed from that call's failure pattern

Click a low point on the time-series and you jump to a worst-offender card showing the source context, the model's answer (with contradicted/unsupported claims highlighted), and a tailored "What to try for this call" panel.

#1 ⬤0.00  gpt-4o-mini · acme · /api/qa
Q: When was the Eiffel Tower built and by whom?

┌─SOURCE CONTEXT────────────┐ ┌─MODEL ANSWER──────────────┐
│ The Eiffel Tower was      │ │ Built in 1923 by           │
│ completed in 1889 for     │ │ Frank Lloyd Wright for     │
│ the Paris World's Fair... │ │ the London Olympics.       │
└───────────────────────────┘ └───────────────────────────┘

  contradicted  "1923"
  contradicted  "Frank Lloyd Wright"
  unsupported   "London Olympics"

  ▌ What to try for this call:
  • Numeric contradiction (1923) — add "be exact about dates"
  • Proper noun substitution — instruct not to substitute names
  • Move retrieved context closer to the question
Dashboard docs →

Scales with you

Start local. Graduate to SQLite when you need queries. Bring your own backend when you're ready.

Default

JSONL

Writes one span per line. Grep-able, diff-able, works everywhere. Perfect for local debugging.

peekr.instrument()
Advanced

Custom

Implement one export(span) method to ship spans to Datadog, your own backend, or anywhere else.

add_exporter(MyExporter())

Query your agent like a database

SQLite storage means every trace is queryable. No dashboard needed.

Find your slowest tool calls
SELECT name, ROUND(AVG(duration_ms)) avg_ms
FROM spans
GROUP BY name
ORDER BY avg_ms DESC;
Track token spend by model
SELECT json_extract(attributes,'$.model') model,
       SUM(json_extract(attributes,'$.tokens_total')) tokens
FROM spans GROUP BY model;
Find all errors
SELECT name, trace_id,
       json_extract(attributes,'$.error') msg
FROM spans
WHERE status = 'error';
Trace cost growth over time
SELECT trace_id,
       SUM(json_extract(attributes,'$.tokens_total')) total
FROM spans
GROUP BY trace_id
ORDER BY start_time;
Find the worst hallucinations
SELECT trace_id,
       json_extract(attributes,'$.eval_scores.Hallucination') h,
       json_extract(attributes,'$.output') out
FROM spans
WHERE h IS NOT NULL AND h < 0.5
ORDER BY h ASC LIMIT 20;
Eval scores by model
SELECT json_extract(attributes,'$.model') model,
       AVG(json_extract(attributes,'$.eval_scores.Hallucination')) groundedness,
       AVG(json_extract(attributes,'$.eval_scores.Rubric')) quality
FROM spans GROUP BY model;

How it compares

Most alternatives either require an account, tie you to a framework, or don't capture LLM context at all.

Print statements LangSmith OpenTelemetry peekr
Zero config
No account required
Works with any framework
Captures token counts
Captures inputs & outputs
Parent/child span tree
Queryable locally
Data stays on your machine

Everything you need. Nothing you don't.

From basic tracing to evals, experiments, and a data flywheel — all in one library.

Zero config

One call patches OpenAI, Anthropic, and Bedrock. No wrappers, no env vars, no accounts.

🔗

Automatic nesting

Spans link to parents via Python's ContextVar. Works across async/await without manual threading.

🗄️

SQLite storage

WAL mode for concurrent writes. Query traces across runs directly with SQL.

🎯

@trace decorator

Wrap any sync or async function. Captures inputs, outputs, latency, and errors.

👤

Session tracing

peekr.session(user_id="alice", tenant_id="acme") groups spans by user and customer org. Both first-class on every span.

🏢

Multi-tenant schema

tenant_id + retention_class are first-class columns — indexed in SQLite, top-level in JSONL. Filter without json_extract.

🚨

Alerts

Fire when error rate, latency, token spend, or cost spikes cross your threshold.

🧠

LLM-as-judge eval

Score every LLM response automatically with Rubric, NotEmpty, and NoError. Scores ride on each span as eval_scores.

🌀

Hallucination detection

Hallucination() scores how grounded each output is in its context (0.0–1.0). Plug in your retrieved docs for RAG flows.

🧬

RAGAS-style claim decomposition

Hallucination(detailed=True) splits outputs into atomic claims and verdicts each one — supported, contradicted, or unsupported.

📈

Observability dashboard

peekr dashboard emits one self-contained HTML file — health hero, channel-and-time heatmap, AI-generated recommendations, and per-call action items.

👍

Feedback + export

Rate traces good/bad. Export labelled data as OpenAI fine-tuning JSONL in one command.

🔬

A/B experiments

@peekr.experiment(variants=[...]) routes traffic and tags spans. Analyse results with SQL.

⏮️

Trace replay

Re-run any stored trace with the same inputs. Debug production issues locally.

🔌

Custom exporters

One method to ship spans to Datadog, Grafana, or your own backend.

🔒

Privacy first

Nothing leaves your machine by default. Use capture_io=False for sensitive functions.

COMING SOON

Peekr Cloud

The OSS SDK is MIT licensed forever — that's not changing. When a single-process file isn't the right fit any more (multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage), Peekr Cloud is the optional managed backend.

The wire format is already in v0.3 — tenant_id and retention_class exist specifically so the spans you produce today work without modification when you connect:

import peekr

peekr.instrument(
    tenant_id="acme",
    exporter=peekr.HTTPExporter(
        endpoint="https://ingest.peekr.cloud",
        api_key="pk_live_…",
    ),
)
Join the waitlist →

If peekr saved you a debugging session,
give it a star ⭐

The OSS SDK is open source, free forever, and takes 3 seconds. A star helps other developers find it and keeps the project growing.

⭐ Star on GitHub