Documentation

Peekr records every LLM call, tool invocation, token count, and error as a tree you can inspect. This page covers everything from installation to advanced usage.

On this page

Getting started

Installation Quickstart

Examples

Debug wrong answers Find slow steps Reduce token costs Prod vs local bugs

API reference

instrument() SQLite storage @trace Manual spans CLI viewer Custom exporters

Features

Sessions Multi-tenant traces Alerts Eval (LLM-as-judge) Hallucination detection Drift dashboard Feedback + export A/B experiments Trace replay Peekr Cloud Span fields

Installation

terminal
pip install peekr                   # base — no LLM SDK required
pip install "peekr[openai]"         # with OpenAI
pip install "peekr[anthropic]"      # with Anthropic
pip install "peekr[all]"            # both

Requires Python 3.9+. No accounts, no backend, no environment variables.

Quickstart

Add two lines before your agent runs:

agent.py
import peekr
peekr.instrument()

import openai

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

Every LLM call is now traced automatically. Peekr writes to traces.jsonl and prints to the console.

View your traces:

terminal
peekr view traces.jsonl          # tree view
peekr view --io traces.jsonl     # include inputs and outputs

output
Trace a3f2b1c0  843ms  312tok
────────────────────────────────────────────────
openai.chat.completions [gpt-4o]  843ms  312tok

Example: Debug a wrong answer

Your agent returns an incorrect response and you don't know why. Add @trace to your tool functions and run with --io:

agent.py
import peekr
peekr.instrument()
from peekr import trace

@trace
def fetch_user(user_id: int) -> dict:
    return db.get(user_id)   # returns None if not found

@trace(name="agent.run")
def run(user_id: int):
    user = fetch_user(user_id)
    # bug: no null check before passing to LLM
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": f"User: {user}"}]
    )

peekr view --io traces.jsonl
agent.run  2100ms
   └─ tool.fetch_user  12ms
         in:  {"args": [42], "kwargs": {}}
         out: null                     ← found it
   └─ openai.chat.completions [gpt-4o]  2088ms
         in:  [{"role": "system", "content": "User: null..."}]

The LLM received null as the user object. The fix is a null check in run(), not a prompt change.

Example: Find slow steps

Wrap every step in your agent with @trace and look at the durations:

agent.py
from peekr import trace

@trace
def search_web(query: str) -> list: ...

@trace
def rerank_results(results: list) -> list: ...

@trace(name="agent.run")
def run(query: str):
    results = search_web(query)
    ranked  = rerank_results(results)
    return openai.chat.completions.create(...)

peekr view traces.jsonl
agent.run  4300ms
   └─ tool.search_web    3800ms   ← 88% of time
   └─ tool.rerank_results   18ms
   └─ openai.chat          490ms

Cache search_web results for repeated queries, or run it in parallel with other setup work. The LLM is not the bottleneck.

Example: Reduce token costs

Run your agent a few times on the same task and compare token counts across traces:

terminal
peekr view traces.jsonl

output
Trace a3f2b1c0  18,432 tokens
Trace b2e4c8f1  21,104 tokens
Trace c5d9e2a7  24,891 tokens

Token count growing each run is the signature of unbounded history — the agent appends every message to the next call. Fix: summarize or truncate the conversation after a fixed number of turns.

agent.py
# Before: growing history
messages = conversation_history  # gets longer every turn

# After: summarize after 5 turns
if len(conversation_history) > 10:  # 5 exchanges = 10 messages
    summary = summarize(conversation_history)
    messages = [{"role": "system", "content": summary}]

Example: Prod vs local bugs

Your agent passes tests locally but fails in production. Capture traces in both environments and compare tool outputs:

agent.py
@trace
def fetch_inventory(sku: str) -> list:
    return inventory_api.get(sku)

local trace
tool.fetch_inventory  8ms
   in:  {"sku": "ABC-123"}
   out: [{"id": 1, "qty": 42}]   ← data present locally

prod trace
tool.fetch_inventory  8ms
   in:  {"sku": "ABC-123"}
   out: []                       ← empty in prod

The agent logic is identical. The inventory API returns different data in prod — likely a missing data migration or environment-specific config. Fix the data source, not the agent.

instrument()

Call once, before any LLM calls. Patches the OpenAI and Anthropic SDKs.

python
peekr.instrument(
    console=True,                # print spans live (default: True)
    storage="jsonl",             # "jsonl" | "sqlite" | "both"
    jsonl_path="traces.jsonl",   # JSONL output path
    db_path="traces.db",         # SQLite output path
)

Parameter	Type	Default	Description
`console`	bool	True	Print each span to stdout as it completes
`storage`	str	"jsonl"	"jsonl", "sqlite", or "both"
`jsonl_path`	str	"traces.jsonl"	Path for JSONL output
`db_path`	str	"traces.db"	Path for SQLite output

SQLite storage

SQLite uses WAL mode so multiple processes — Docker containers, CI workers, parallel agents — can write spans safely at the same time. And because it's a real database, you can query across all your runs without any extra tooling.

python
# Enable SQLite
peekr.instrument(storage="sqlite")

# Write to both JSONL and SQLite
peekr.instrument(storage="both")

View with the same CLI command:

terminal
peekr view traces.db
peekr view --io traces.db

Or query directly with any SQLite client:

terminal — useful queries
# Slowest tool calls
sqlite3 traces.db "SELECT name, ROUND(AVG(duration_ms)) avg_ms
  FROM spans GROUP BY name ORDER BY avg_ms DESC"

# Token spend by model
sqlite3 traces.db "SELECT json_extract(attributes,'$.model') model,
  SUM(json_extract(attributes,'$.tokens_total')) tokens
  FROM spans GROUP BY model"

# All errors
sqlite3 traces.db "SELECT name, trace_id,
  json_extract(attributes,'$.error') msg
  FROM spans WHERE status = 'error'"

# Token growth over time (detect unbounded history)
sqlite3 traces.db "SELECT trace_id,
  SUM(json_extract(attributes,'$.tokens_total')) total
  FROM spans GROUP BY trace_id ORDER BY start_time"

SQLite is ideal for Docker and CI where multiple processes share a single file. JSONL is better for quick local debugging where you want to grep or tail -f.

@trace decorator

Wraps a function as a span. Works on sync and async functions.

python
from peekr import trace

# Auto-names from module.function
@trace
def search_web(query: str) -> list: ...

# Custom name
@trace(name="tool.search")
def search(query: str) -> list: ...

# Opt out of capturing inputs/outputs (latency + status still recorded)
@trace(capture_io=False)
def fetch_api_key() -> str: ...

# Async
@trace
async def fetch_user(user_id: int) -> dict: ...

Parameter	Type	Default	Description
`name`	str \| None	module.function	Custom span name
`capture_io`	bool	True	Record function args and return value

Inputs and outputs are serialized to JSON and truncated at 500 characters. Use capture_io=False for functions that handle secrets or large payloads.

Manual spans

For cases where a decorator doesn't fit — e.g. a loop, a context manager, or code you can't modify:

python
from peekr import start_span, end_span

span, token = start_span("my.operation")
span.attributes["custom_key"] = "value"
try:
    result = do_work()
    span.status = "ok"
except Exception as e:
    span.status = "error"
    span.attributes["error"] = str(e)
    raise
finally:
    end_span(span, token)     # always call — even on error

Any spans started inside do_work() will automatically nest as children of this span.

CLI viewer

terminal
peekr view traces.jsonl          # tree view
peekr view --io traces.jsonl     # + inputs and outputs

Each trace is shown as a tree grouped by trace_id. The --io flag prints up to 120 characters of the serialized input and output for each span.

Custom exporters

Any object with an export(span) method works as an exporter:

python
from peekr.exporters import add_exporter

class HttpExporter:
    def export(self, span):
        requests.post(
            "https://your-backend.com/spans",
            json=span.to_dict()
        )

peekr.instrument()
add_exporter(HttpExporter())

Multiple exporters can be active at once. The built-in JSONLExporter and ConsoleExporter are added by instrument(). You can add your own on top.

Span fields

Every span written to traces.jsonl is a JSON object with these fields:

Field	Type	Description
`trace_id`	string	Groups all spans in one agent run
`span_id`	string	Unique ID for this span
`parent_id`	string \| null	ID of the parent span, or null for root
`name`	string	Span name
`start_time`	float	Unix timestamp
`end_time`	float	Unix timestamp
`duration_ms`	float	Wall-clock duration in milliseconds
`status`	"ok" \| "error"	Whether the span succeeded
`tenant_id`	string \| null	Customer org (B2B). First-class — top-level column in SQLite, top-level key in JSONL. Set via `peekr.session(tenant_id=...)`, `instrument(tenant_id=...)`, or env `PEEKR_TENANT_ID`.
`retention_class`	string \| null	Storage-tier hint (e.g. `"default"`, `"short"`, `"long"`, `"pii"`). OSS stores it; storage tier interprets it.
`attributes.model`	string	LLM model name (auto-captured)
`attributes.tokens_input`	int	Prompt tokens (auto-captured)
`attributes.tokens_output`	int	Completion tokens (auto-captured)
`attributes.tokens_total`	int	Total tokens (auto-captured)
`attributes.input`	string	Serialized function args (truncated)
`attributes.output`	string	Serialized return value (truncated)
`attributes.error`	string	Exception message if status is "error"
`attributes.session_id`	string	Set when span is inside a peekr.session()
`attributes.user_id`	string	Set when span is inside a peekr.session(user_id=...)
`attributes.eval_scores`	dict	Evaluator name → score (0.0–1.0) when evaluators are configured
`attributes.experiment_variant`	string	Variant name when inside a @peekr.experiment

Sessions

Group all spans for a user, tenant, or conversation by passing identifiers to peekr.session(). Uses ContextVar so it propagates correctly across async code.

python
import peekr

with peekr.session(
    user_id="user_123",           # end-user (B2C)
    tenant_id="acme",             # customer org (B2B)
    session_id="sess_abc",        # auto-generated if omitted
    retention_class="long",       # storage-tier hint
):
    run_agent()

tenant_id and retention_class are first-class columns on the span — see Multi-tenant traces.

Query by user in SQLite:

sql
SELECT trace_id, AVG(duration_ms), SUM(json_extract(attributes,'$.tokens_total'))
FROM spans
WHERE json_extract(attributes,'$.user_id') = 'user_123'
GROUP BY trace_id;

Multi-tenant traces

Every span carries two first-class fields — tenant_id (the customer org) and retention_class (a storage-tier hint) — separate from user_id (the end-user). A B2B agent can tag both without conflict.

python
import peekr
peekr.instrument(tenant_id="acme", retention_class="default")

with peekr.session(user_id="alice", tenant_id="acme",
                   retention_class="long"):
    run_agent()

Resolution order, highest priority first:

peekr.session(tenant_id=..., retention_class=...)
peekr.instrument(tenant_id=..., retention_class=...)
Env vars PEEKR_TENANT_ID / PEEKR_RETENTION_CLASS

Both fields are top-level columns in SQLite (with indices) and top-level keys in JSONL — no json_extract needed:

sql
SELECT tenant_id, COUNT(*) FROM spans GROUP BY tenant_id;

SELECT * FROM spans
WHERE retention_class = 'long' AND start_time > ?;

retention_class is a free-form string in the OSS SDK — recommended values are default, short, long, and pii. The meaning of each is enforced by your storage tier (or by Peekr Cloud).

Why first-class instead of attributes.tenant_id? So you can filter and index without JSON extraction — relevant the moment you have more than a handful of tenants or want to route ingestion. The SQLite exporter migrates pre-v0.3 databases automatically via PRAGMA user_version; legacy rows back-fill as NULL.

Alerts

Alerts fire after each complete trace (identified by the root span). Pass them to instrument():

python
peekr.instrument(alerts=[
    peekr.alert.ErrorRate(threshold=0.05, window=100),  # >5% errors in last 100 traces
    peekr.alert.CostSpike(multiplier=2.0),               # tokens 2× above rolling avg
    peekr.alert.LatencyP95(ms=5000),                     # p95 latency > 5s
    peekr.alert.TokenGrowth(runs=5),                     # growing 5 consecutive runs
])

Override on_trigger to send to Slack, PagerDuty, or anywhere:

python
class SlackAlert(peekr.alert.ErrorRate):
    def on_trigger(self, message: str) -> None:
        slack.send(f"#alerts: {message}")

peekr.instrument(alerts=[SlackAlert(threshold=0.05)])

Alert	Triggers when	Key params
`ErrorRate`	Error % in last N traces > threshold	threshold, window=100
`CostSpike`	This trace tokens > multiplier × rolling avg	multiplier, window=50
`LatencyP95`	p95 span latency in trace > ms	ms
`TokenGrowth`	Token count strictly increasing for N runs	runs=5

Eval — LLM-as-judge

Evaluators run after each LLM span completes and write scores to span.attributes["eval_scores"]. A _in_eval guard prevents infinite recursion.

python
peekr.instrument(evaluators=[
    peekr.eval.Rubric("Be concise and factually accurate"),
    peekr.eval.Hallucination(),   # groundedness check (see below)
    peekr.eval.NotEmpty(),       # output must be non-empty
    peekr.eval.NoError(),        # span must have status=ok
])

Scores are written to span.attributes["eval_scores"] as a {evaluator_name: float} dict and shown inline by peekr view --io:

peekr view --io traces.jsonl
openai.chat.completions [gpt-4o]  843ms  312tok
   in:  "Summarise this doc..."
   out: "The doc argues that..."
   eval_scores: {Rubric: 0.92, Hallucination: 0.95, NotEmpty: 1.0, NoError: 1.0}

Query scores in SQLite:

sql
SELECT name,
       AVG(json_extract(attributes,'$.eval_scores.Rubric'))        rubric_avg,
       AVG(json_extract(attributes,'$.eval_scores.Hallucination')) hallucination_avg
FROM spans
WHERE json_extract(attributes,'$.eval_scores') IS NOT NULL
GROUP BY name;

Write your own evaluator:

python
from peekr.eval import BaseEvaluator

class LengthCheck(BaseEvaluator):
    def evaluate(self, span) -> float:
        output = span.attributes.get("output", "")
        return 1.0 if len(output) < 500 else 0.0

Evaluator	What it checks	Requires
`Rubric(criteria)`	LLM scores output against your criteria (0.0–1.0)	openai or anthropic SDK
`Hallucination()`	Fraction of claims grounded in the input/context (0.0–1.0)	openai or anthropic SDK
`NotEmpty()`	Output attribute is non-empty string	Nothing
`NoError()`	Span status is "ok"	Nothing

Hallucination detection

The Hallucination evaluator scores how well an LLM output is supported by its input context. It uses an LLM-as-judge under the hood — the same fallback pattern as Rubric (OpenAI first, then Anthropic).

Score	Meaning
`1.0`	Every factual claim in the output is supported by the context
`0.0`	No claim is supported — the output is fully hallucinated
between	The fraction of claims grounded in the context

Plug it in like any other evaluator:

python
import peekr

peekr.instrument(evaluators=[peekr.eval.Hallucination()])

import openai
openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "The Eiffel Tower was completed in 1889 in Paris."},
        {"role": "user",   "content": "When was the Eiffel Tower built and by whom?"},
    ],
)

peekr view --io traces.jsonl
openai.chat.completions [gpt-4o]  843ms
   in:  [{"role": "system", "content": "The Eiffel Tower was completed in 1889 in Paris."}, ...]
   out: "The Eiffel Tower was built in 1923 by Frank Lloyd Wright."
   eval_scores: {Hallucination: 0.0}   ← invented year and architect

RAG flows: point it at retrieved documents

By default the evaluator uses the span's input (the messages sent to the LLM) as the grounding context. For RAG flows where the source documents live elsewhere — say, on a parent tool span — pass a context_extractor:

python
peekr.instrument(evaluators=[
    peekr.eval.Hallucination(
        context_extractor=lambda span: span.attributes.get("retrieved_docs", ""),
        model="gpt-4o-mini",   # optional — defaults to gpt-4o-mini / claude-haiku
    ),
])

Spans with empty output or no available context return 1.0 (nothing to judge), so non-RAG spans don't drag down your average. The judge LLM call is cheap (max 10 output tokens) and runs after the original span completes — it never blocks the main request.

Find your worst hallucinations

sql
-- Lowest-scoring outputs across all runs
SELECT trace_id,
       json_extract(attributes,'$.eval_scores.Hallucination') AS hallucination,
       json_extract(attributes,'$.output') AS output
FROM spans
WHERE hallucination IS NOT NULL AND hallucination < 0.5
ORDER BY hallucination ASC, start_time DESC
LIMIT 20;

-- Hallucination rate by model
SELECT json_extract(attributes,'$.model') model,
       AVG(json_extract(attributes,'$.eval_scores.Hallucination')) avg_groundedness,
       COUNT(*) runs
FROM spans
WHERE json_extract(attributes,'$.eval_scores.Hallucination') IS NOT NULL
GROUP BY model
ORDER BY avg_groundedness ASC;

Hallucination scoring uses an LLM judge, which is itself imperfect and costs tokens. Treat it as a useful smoke alarm — a sudden drop in average groundedness is a strong signal — not as ground truth for any single trace.

Detailed mode — RAGAS-style claim decomposition

The default mode returns a single score. detailed=True switches to a RAGAS Faithfulness-style pipeline: the judge first decomposes the output into atomic factual claims, then assigns each claim one of three verdicts:

Verdict	Meaning
`supported`	Claim is directly entailed by CONTEXT
`contradicted`	Claim directly conflicts with CONTEXT
`unsupported`	CONTEXT is silent about the claim

The score becomes supported_count / total_claims and the full breakdown lands on the span at attributes.hallucination_details:

python
peekr.instrument(evaluators=[peekr.eval.Hallucination(detailed=True)])

span.attributes.hallucination_details
{
  "total": 3, "supported": 1, "contradicted": 2, "unsupported": 0,
  "score": 0.33,
  "claims": [
    {"text": "The Eiffel Tower is in Paris",         "verdict": "supported"},
    {"text": "It was built in 1923",                  "verdict": "contradicted"},
    {"text": "It was designed by Frank Lloyd Wright", "verdict": "contradicted"}
  ]
}

This is what powers the drift dashboard's drill-down — you can see exactly which claims the model invented, not just an average score.

Detailed mode uses one judge call per span (just with more output tokens — JSON, capped at 800). Use simple mode for cheap continuous monitoring across many traces, and detailed mode for the spans you want to investigate. You can switch by re-running with a different evaluators= list.

Observability dashboard

Generate a self-contained, tabbed HTML report from your traces. Designed as a drop-in observability layer for any RAG or memory/agent pipeline — open the file in any browser, no server, no build step.

terminal
peekr dashboard traces.db -o report.html   # SQLite
peekr dashboard traces.jsonl                # JSONL → ./dashboard.html
open report.html

Five tabs, one URL

The dashboard is organised so a non-technical observer can stay on the Overview tab and still get the gist, while an engineer can drill into Traces / Quality / Diagnose for specifics. Tab state is in the URL hash so links are shareable. A persistent filter bar at the top applies across every tab.

Tab	For	Contents
`#overview`	First-impression / exec	Health hero (0–100), narrative bullets, 4 metric cards with sparklines, top 3 action items pulled from the diagnostic engine.
`#traces`	"Find me that call"	Search box (trace ID, model, content, error), sortable table, click any row → side panel with full context vs answer, claim verdicts, citations, per-call action items.
`#quality`	Trend monitoring	Rolling chart with warning (0.7) / critical (0.5) threshold lines, score distribution histogram, channel × time heatmap, claim-verdict doughnut, citation panel.
`#diagnose`	Incident response	"Likely causes & next steps" with severity-tagged cards and numbered fix lists, plus the full worst-offenders panel with side-by-side highlighted context vs answer.
`#help`	First-time setup	Setup checklist (auto-ticks live), glossary, evaluator configuration snippets, troubleshooting, keyboard shortcuts.

Keyboard shortcuts

Key	Action
`1`–`5`	Switch tabs
`/`	Jump to Traces tab and focus the search box
`R`	Clear all filters
`Esc`	Close the trace detail panel

Filter bar

One persistent bar at the top of every tab. Click any chip to toggle that filter; every panel on every tab refilters immediately. The time-range chips include 5m, 15m, 30m, 1h, 24h, 7d, 30d presets plus a Custom… option with datetime-local from/to inputs. The "When = Custom" mode seeds itself to "last 1h up to the newest timestamp" so first activation isn't empty.

Panels at a glance

Panel	What it shows	How to act on it
Health hero	One 0–100 score with a coloured dot (green/yellow/orange/red), tier label, count of flagged calls, and Δ vs baseline.	Red → open the recommendations panel below.
What's happening	3–5 plain-English bullets summarising the situation: drift, worst channel, citation invention rate, error count.	Read top-to-bottom; the highest-priority finding is first.
Filter chips	Tenant · Model · Endpoint · Time range. Stack to drill in.	Click chips to refilter every panel. Click again to clear.
Metric cards	Hallucination · Rubric · Citations · Errors. Each with sparkline, Δ vs baseline, count of scored calls, and an action hint.	The hint at the bottom tells you the next step (e.g. "30 flagged — review worst offenders below").
Likely causes & next steps	Diagnostic engine — runs eight pattern-detection rules and surfaces ranked recommendations with cause + numbered fix list.	Each card has a severity badge and a "what to try" list specific to that pattern.
Score over time	Rolling 20-call mean of every evaluator, with dashed warning (0.7) and critical (0.5) threshold lines.	Hover for trace details; click a point to jump to its worst-offender card.
Failure breakdown heatmap	Channel × time grid. Rows = your models/tenants/endpoints. Columns = time buckets. Colour = mean Hallucination.	Red rows tell you which channel is failing; rows that go green → red tell you when. Click a cell to filter.
Worst offenders	12 lowest-scoring calls. Side-by-side context vs answer with contradicted claims highlighted, claim verdicts, citation list.	Each card ends with a "What to try for this call" box prescribing fixes specific to that span's failure pattern.

Diagnostic rules

The recommendations panel inspects the filtered rows and emits cards from eight pattern-detection rules. Each card has a severity (high / medium / low / info / good), a plausible cause in plain English, and a numbered list of concrete fixes.

Pattern	Triggers when	Sample recommendation
Invented citations	> 30% of detected citation patterns aren't in context	Tighten prompt; verify citations post-hoc; try hybrid retrieval
High contradiction rate	> 20% of judged claims directly contradict context	Strengthen system prompt; move context closer to question; reduce `max_tokens`
Out-of-context elaboration	> 25% unsupported claims with low contradiction	Add refusal instruction; check retrieval recall; coverage prompt
Channel concentration	> 50% of flagged calls share one model/tenant/endpoint	Diff deploys; compare prompts; verify index coverage for that channel
Hallucination drift	Δ vs baseline < −0.1	Use heatmap to localise; cross-reference deploys; use `peekr replay`
Error spikes	> 5% of calls have `status="error"`	Check rate limits; verify fallback model quality; add retries
Citations all grounded	≥ 5 citations, 0 invented	Add an alert on citation invention rate to catch future regressions
Healthy	No patterns triggered	Set up `peekr.alert.ScoreFloor`; run the offline benchmark periodically

Per-span action items

Every worst-offender card ends with a tailored "What to try for this call" panel — separate from the aggregate recommendations. It inspects that one span's claims, citations, and context to suggest fixes targeted to its specific failure pattern:

Detected on this span	What the action box suggests
Empty / short context but long answer	Retrieval miss — inspect what your retriever returned
Invented URLs / arXiv / DOIs / titles	Per-kind prompt fix + post-hoc citation verification
Contradicted numbers / dates	"Copy numerics verbatim" instruction; `temperature=0`
Contradicted proper nouns	Explicit "don't substitute names" instruction
Mostly unsupported claims, no contradictions	Add refusal: "say I don't know if not in context"
Mostly contradicted claims	Move context closer to question; "context wins" instruction
Low score but no detailed claims	Enable `Hallucination(detailed=True)` to see what failed
Output much longer than context	Reduce `max_tokens`; long completions drift

Tag spans for the channel breakdown

The heatmap groups by attributes.model (set automatically by the patches), attributes.user_id (set via peekr.session(user_id=...)), and attributes.endpoint (you set this). Without an endpoint attribute, the endpoint row of the heatmap simply doesn't render — the others still do.

python
from peekr import trace, get_current_span

@trace
def handle_request(req):
    get_current_span().attributes["endpoint"] = req.path
    return call_llm(...)

# Or in a FastAPI middleware — one place, every request tagged
@app.middleware("http")
async def tag_span(request, call_next):
    with peekr.session(user_id=request.headers.get("X-Tenant-Id")):
        span, token = peekr.start_span(f"http.{request.method}")
        span.attributes["endpoint"] = request.url.path
        try:
            return await call_next(request)
        finally:
            peekr.end_span(span, token)

The dashboard reads from JSONL or SQLite — whatever you configured in peekr.instrument(). It's a post-hoc tool: rerun it whenever you want a fresh snapshot. For a real-time view, use peekr view --io in the terminal.

Feedback + export

Label traces as good or bad. Export labelled data as a fine-tuning dataset.

python
import peekr

# Rate a trace
peekr.feedback(trace_id="a3f2b1c0...", rating="good", note="perfect answer")
peekr.feedback(trace_id="b2e4c8f1...", rating="bad",  note="hallucinated")

# Export good traces as OpenAI fine-tuning data
peekr.export_feedback(
    db_path="traces.db",
    filter="good",
    output="training.jsonl",
    format="openai-ft",   # or "raw"
)

The openai-ft format produces one JSON object per trace:

training.jsonl
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

A/B experiments

Route traffic between variants and tag each span. Analyse results with SQL — no separate tracking tool needed.

python
from peekr import experiment

# List variants — equal split by default
@experiment(variants=["control", "test_v2"])
def run_agent(query: str, variant: str):
    model = "gpt-4o" if variant == "control" else "claude-opus-4-5"
    return call_llm(model, query)

# Dict variants — passes config too
@experiment(variants={
    "control": {"model": "gpt-4o"},
    "test":    {"model": "claude-opus-4-5"},
})
def run_agent(query: str, variant: str, variant_config: dict):
    return call_llm(variant_config["model"], query)

Analyse in SQLite:

sql
SELECT json_extract(attributes,'$.experiment_variant') variant,
       COUNT(*) runs,
       AVG(CASE WHEN status='error' THEN 1.0 ELSE 0.0 END) error_rate,
       AVG(json_extract(attributes,'$.tokens_total')) avg_tokens,
       AVG(duration_ms) avg_ms
FROM spans
WHERE json_extract(attributes,'$.experiment_variant') IS NOT NULL
GROUP BY variant;

Trace replay

Re-run a stored trace with the same inputs. Useful for reproducing production bugs locally or verifying a fix against a real failure.

python
from peekr.replay import replay_trace

# Re-run from SQLite
new_trace_id = replay_trace(trace_id="a3f2b1c0...", db_path="traces.db")
print(f"New trace: {new_trace_id}")

# Re-run from JSONL
new_trace_id = replay_trace(trace_id="a3f2b1c0...", jsonl_path="traces.jsonl")

Or use the CLI:

terminal
peekr replay a3f2b1c0
peekr replay a3f2b1c0 --db traces.db
peekr replay a3f2b1c0 --jsonl traces.jsonl

Replay re-runs the stored LLM inputs through the live SDK. The agent itself is not re-invoked — only the LLM calls are replayed. This means tool calls are not replicated, but you get a new trace showing exactly what the model produces with those inputs today.

Peekr Cloud

The OSS SDK runs in your process, writes to local files, and is MIT licensed forever. When a single-process file isn't the right fit any more — multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage — Peekr Cloud is the optional managed backend.

The wire format is already in the SDK. tenant_id and retention_class exist in v0.3 specifically so the spans you're producing today work without modification when you connect.

python
import peekr

peekr.instrument(
    tenant_id="acme",
    exporter=peekr.HTTPExporter(
        endpoint="https://ingest.peekr.cloud",
        api_key="pk_live_…",
    ),
)

HTTPExporter ships as a stub in v0.3 — the constructor signature is stable so you can wire your call sites today, and they won't change when the implementation lands. Until then it raises NotImplementedError on .export() so a misconfigured pipeline fails loudly rather than silently dropping spans.

Get on the waitlist → GitHub Discussions.