Documentation

Peekr records every LLM call, tool invocation, token count, and error as a tree you can inspect. This page covers everything from installation to advanced usage.

On this page
Getting started
Installation Quickstart

Installation

terminal
pip install peekr # base — no LLM SDK required pip install "peekr[openai]" # with OpenAI pip install "peekr[anthropic]" # with Anthropic pip install "peekr[all]" # both

Requires Python 3.9+. No accounts, no backend, no environment variables.

Quickstart

Add two lines before your agent runs:

agent.py
import peekr peekr.instrument() import openai response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}] )

Every LLM call is now traced automatically. Peekr writes to traces.jsonl and prints to the console.

View your traces:

terminal
peekr view traces.jsonl # tree view peekr view --io traces.jsonl # include inputs and outputs
output
Trace a3f2b1c0 843ms 312tok ──────────────────────────────────────────────── openai.chat.completions [gpt-4o] 843ms 312tok

Example: Debug a wrong answer

Your agent returns an incorrect response and you don't know why. Add @trace to your tool functions and run with --io:

agent.py
import peekr peekr.instrument() from peekr import trace @trace def fetch_user(user_id: int) -> dict: return db.get(user_id) # returns None if not found @trace(name="agent.run") def run(user_id: int): user = fetch_user(user_id) # bug: no null check before passing to LLM return openai.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": f"User: {user}"}] )
peekr view --io traces.jsonl
agent.run 2100ms └─ tool.fetch_user 12ms in: {"args": [42], "kwargs": {}} out: null ← found it └─ openai.chat.completions [gpt-4o] 2088ms in: [{"role": "system", "content": "User: null..."}]

The LLM received null as the user object. The fix is a null check in run(), not a prompt change.

Example: Find slow steps

Wrap every step in your agent with @trace and look at the durations:

agent.py
from peekr import trace @trace def search_web(query: str) -> list: ... @trace def rerank_results(results: list) -> list: ... @trace(name="agent.run") def run(query: str): results = search_web(query) ranked = rerank_results(results) return openai.chat.completions.create(...)
peekr view traces.jsonl
agent.run 4300ms └─ tool.search_web 3800ms ← 88% of time └─ tool.rerank_results 18ms └─ openai.chat 490ms

Cache search_web results for repeated queries, or run it in parallel with other setup work. The LLM is not the bottleneck.

Example: Reduce token costs

Run your agent a few times on the same task and compare token counts across traces:

terminal
peekr view traces.jsonl
output
Trace a3f2b1c0 18,432 tokens Trace b2e4c8f1 21,104 tokens Trace c5d9e2a7 24,891 tokens

Token count growing each run is the signature of unbounded history — the agent appends every message to the next call. Fix: summarize or truncate the conversation after a fixed number of turns.

agent.py
# Before: growing history messages = conversation_history # gets longer every turn # After: summarize after 5 turns if len(conversation_history) > 10: # 5 exchanges = 10 messages summary = summarize(conversation_history) messages = [{"role": "system", "content": summary}]

Example: Prod vs local bugs

Your agent passes tests locally but fails in production. Capture traces in both environments and compare tool outputs:

agent.py
@trace def fetch_inventory(sku: str) -> list: return inventory_api.get(sku)
local trace
tool.fetch_inventory 8ms in: {"sku": "ABC-123"} out: [{"id": 1, "qty": 42}] ← data present locally
prod trace
tool.fetch_inventory 8ms in: {"sku": "ABC-123"} out: [] ← empty in prod

The agent logic is identical. The inventory API returns different data in prod — likely a missing data migration or environment-specific config. Fix the data source, not the agent.

instrument()

Call once, before any LLM calls. Patches the OpenAI and Anthropic SDKs.

python
peekr.instrument( console=True, # print spans live (default: True) storage="jsonl", # "jsonl" | "sqlite" | "both" jsonl_path="traces.jsonl", # JSONL output path db_path="traces.db", # SQLite output path )
ParameterTypeDefaultDescription
consoleboolTruePrint each span to stdout as it completes
storagestr"jsonl""jsonl", "sqlite", or "both"
jsonl_pathstr"traces.jsonl"Path for JSONL output
db_pathstr"traces.db"Path for SQLite output

SQLite storage

SQLite uses WAL mode so multiple processes — Docker containers, CI workers, parallel agents — can write spans safely at the same time. And because it's a real database, you can query across all your runs without any extra tooling.

python
# Enable SQLite peekr.instrument(storage="sqlite") # Write to both JSONL and SQLite peekr.instrument(storage="both")

View with the same CLI command:

terminal
peekr view traces.db peekr view --io traces.db

Or query directly with any SQLite client:

terminal — useful queries
# Slowest tool calls sqlite3 traces.db "SELECT name, ROUND(AVG(duration_ms)) avg_ms FROM spans GROUP BY name ORDER BY avg_ms DESC" # Token spend by model sqlite3 traces.db "SELECT json_extract(attributes,'$.model') model, SUM(json_extract(attributes,'$.tokens_total')) tokens FROM spans GROUP BY model" # All errors sqlite3 traces.db "SELECT name, trace_id, json_extract(attributes,'$.error') msg FROM spans WHERE status = 'error'" # Token growth over time (detect unbounded history) sqlite3 traces.db "SELECT trace_id, SUM(json_extract(attributes,'$.tokens_total')) total FROM spans GROUP BY trace_id ORDER BY start_time"
SQLite is ideal for Docker and CI where multiple processes share a single file. JSONL is better for quick local debugging where you want to grep or tail -f.

@trace decorator

Wraps a function as a span. Works on sync and async functions.

python
from peekr import trace # Auto-names from module.function @trace def search_web(query: str) -> list: ... # Custom name @trace(name="tool.search") def search(query: str) -> list: ... # Opt out of capturing inputs/outputs (latency + status still recorded) @trace(capture_io=False) def fetch_api_key() -> str: ... # Async @trace async def fetch_user(user_id: int) -> dict: ...
ParameterTypeDefaultDescription
namestr | Nonemodule.functionCustom span name
capture_ioboolTrueRecord function args and return value
Inputs and outputs are serialized to JSON and truncated at 500 characters. Use capture_io=False for functions that handle secrets or large payloads.

Manual spans

For cases where a decorator doesn't fit — e.g. a loop, a context manager, or code you can't modify:

python
from peekr import start_span, end_span span, token = start_span("my.operation") span.attributes["custom_key"] = "value" try: result = do_work() span.status = "ok" except Exception as e: span.status = "error" span.attributes["error"] = str(e) raise finally: end_span(span, token) # always call — even on error
Any spans started inside do_work() will automatically nest as children of this span.

CLI viewer

terminal
peekr view traces.jsonl # tree view peekr view --io traces.jsonl # + inputs and outputs

Each trace is shown as a tree grouped by trace_id. The --io flag prints up to 120 characters of the serialized input and output for each span.

Custom exporters

Any object with an export(span) method works as an exporter:

python
from peekr.exporters import add_exporter class HttpExporter: def export(self, span): requests.post( "https://your-backend.com/spans", json=span.to_dict() ) peekr.instrument() add_exporter(HttpExporter())

Multiple exporters can be active at once. The built-in JSONLExporter and ConsoleExporter are added by instrument(). You can add your own on top.

Span fields

Every span written to traces.jsonl is a JSON object with these fields:

FieldTypeDescription
trace_idstringGroups all spans in one agent run
span_idstringUnique ID for this span
parent_idstring | nullID of the parent span, or null for root
namestringSpan name
start_timefloatUnix timestamp
end_timefloatUnix timestamp
duration_msfloatWall-clock duration in milliseconds
status"ok" | "error"Whether the span succeeded
tenant_idstring | nullCustomer org (B2B). First-class — top-level column in SQLite, top-level key in JSONL. Set via peekr.session(tenant_id=...), instrument(tenant_id=...), or env PEEKR_TENANT_ID.
retention_classstring | nullStorage-tier hint (e.g. "default", "short", "long", "pii"). OSS stores it; storage tier interprets it.
attributes.modelstringLLM model name (auto-captured)
attributes.tokens_inputintPrompt tokens (auto-captured)
attributes.tokens_outputintCompletion tokens (auto-captured)
attributes.tokens_totalintTotal tokens (auto-captured)
attributes.inputstringSerialized function args (truncated)
attributes.outputstringSerialized return value (truncated)
attributes.errorstringException message if status is "error"
attributes.session_idstringSet when span is inside a peekr.session()
attributes.user_idstringSet when span is inside a peekr.session(user_id=...)
attributes.eval_scoresdictEvaluator name → score (0.0–1.0) when evaluators are configured
attributes.experiment_variantstringVariant name when inside a @peekr.experiment

Sessions

Group all spans for a user, tenant, or conversation by passing identifiers to peekr.session(). Uses ContextVar so it propagates correctly across async code.

python
import peekr with peekr.session( user_id="user_123", # end-user (B2C) tenant_id="acme", # customer org (B2B) session_id="sess_abc", # auto-generated if omitted retention_class="long", # storage-tier hint ): run_agent()

tenant_id and retention_class are first-class columns on the span — see Multi-tenant traces.

Query by user in SQLite:

sql
SELECT trace_id, AVG(duration_ms), SUM(json_extract(attributes,'$.tokens_total')) FROM spans WHERE json_extract(attributes,'$.user_id') = 'user_123' GROUP BY trace_id;

Multi-tenant traces

Every span carries two first-class fields — tenant_id (the customer org) and retention_class (a storage-tier hint) — separate from user_id (the end-user). A B2B agent can tag both without conflict.

python
import peekr peekr.instrument(tenant_id="acme", retention_class="default") with peekr.session(user_id="alice", tenant_id="acme", retention_class="long"): run_agent()

Resolution order, highest priority first:

  1. peekr.session(tenant_id=..., retention_class=...)
  2. peekr.instrument(tenant_id=..., retention_class=...)
  3. Env vars PEEKR_TENANT_ID / PEEKR_RETENTION_CLASS

Both fields are top-level columns in SQLite (with indices) and top-level keys in JSONL — no json_extract needed:

sql
SELECT tenant_id, COUNT(*) FROM spans GROUP BY tenant_id; SELECT * FROM spans WHERE retention_class = 'long' AND start_time > ?;

retention_class is a free-form string in the OSS SDK — recommended values are default, short, long, and pii. The meaning of each is enforced by your storage tier (or by Peekr Cloud).

Why first-class instead of attributes.tenant_id? So you can filter and index without JSON extraction — relevant the moment you have more than a handful of tenants or want to route ingestion. The SQLite exporter migrates pre-v0.3 databases automatically via PRAGMA user_version; legacy rows back-fill as NULL.

Alerts

Alerts fire after each complete trace (identified by the root span). Pass them to instrument():

python
peekr.instrument(alerts=[ peekr.alert.ErrorRate(threshold=0.05, window=100), # >5% errors in last 100 traces peekr.alert.CostSpike(multiplier=2.0), # tokens 2× above rolling avg peekr.alert.LatencyP95(ms=5000), # p95 latency > 5s peekr.alert.TokenGrowth(runs=5), # growing 5 consecutive runs ])

Override on_trigger to send to Slack, PagerDuty, or anywhere:

python
class SlackAlert(peekr.alert.ErrorRate): def on_trigger(self, message: str) -> None: slack.send(f"#alerts: {message}") peekr.instrument(alerts=[SlackAlert(threshold=0.05)])
AlertTriggers whenKey params
ErrorRateError % in last N traces > thresholdthreshold, window=100
CostSpikeThis trace tokens > multiplier × rolling avgmultiplier, window=50
LatencyP95p95 span latency in trace > msms
TokenGrowthToken count strictly increasing for N runsruns=5

Eval — LLM-as-judge

Evaluators run after each LLM span completes and write scores to span.attributes["eval_scores"]. A _in_eval guard prevents infinite recursion.

python
peekr.instrument(evaluators=[ peekr.eval.Rubric("Be concise and factually accurate"), peekr.eval.Hallucination(), # groundedness check (see below) peekr.eval.NotEmpty(), # output must be non-empty peekr.eval.NoError(), # span must have status=ok ])

Scores are written to span.attributes["eval_scores"] as a {evaluator_name: float} dict and shown inline by peekr view --io:

peekr view --io traces.jsonl
openai.chat.completions [gpt-4o] 843ms 312tok in: "Summarise this doc..." out: "The doc argues that..." eval_scores: {Rubric: 0.92, Hallucination: 0.95, NotEmpty: 1.0, NoError: 1.0}

Query scores in SQLite:

sql
SELECT name, AVG(json_extract(attributes,'$.eval_scores.Rubric')) rubric_avg, AVG(json_extract(attributes,'$.eval_scores.Hallucination')) hallucination_avg FROM spans WHERE json_extract(attributes,'$.eval_scores') IS NOT NULL GROUP BY name;

Write your own evaluator:

python
from peekr.eval import BaseEvaluator class LengthCheck(BaseEvaluator): def evaluate(self, span) -> float: output = span.attributes.get("output", "") return 1.0 if len(output) < 500 else 0.0
EvaluatorWhat it checksRequires
Rubric(criteria)LLM scores output against your criteria (0.0–1.0)openai or anthropic SDK
Hallucination()Fraction of claims grounded in the input/context (0.0–1.0)openai or anthropic SDK
NotEmpty()Output attribute is non-empty stringNothing
NoError()Span status is "ok"Nothing

Hallucination detection

The Hallucination evaluator scores how well an LLM output is supported by its input context. It uses an LLM-as-judge under the hood — the same fallback pattern as Rubric (OpenAI first, then Anthropic).

ScoreMeaning
1.0Every factual claim in the output is supported by the context
0.0No claim is supported — the output is fully hallucinated
betweenThe fraction of claims grounded in the context

Plug it in like any other evaluator:

python
import peekr peekr.instrument(evaluators=[peekr.eval.Hallucination()]) import openai openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "The Eiffel Tower was completed in 1889 in Paris."}, {"role": "user", "content": "When was the Eiffel Tower built and by whom?"}, ], )
peekr view --io traces.jsonl
openai.chat.completions [gpt-4o] 843ms in: [{"role": "system", "content": "The Eiffel Tower was completed in 1889 in Paris."}, ...] out: "The Eiffel Tower was built in 1923 by Frank Lloyd Wright." eval_scores: {Hallucination: 0.0} ← invented year and architect

RAG flows: point it at retrieved documents

By default the evaluator uses the span's input (the messages sent to the LLM) as the grounding context. For RAG flows where the source documents live elsewhere — say, on a parent tool span — pass a context_extractor:

python
peekr.instrument(evaluators=[ peekr.eval.Hallucination( context_extractor=lambda span: span.attributes.get("retrieved_docs", ""), model="gpt-4o-mini", # optional — defaults to gpt-4o-mini / claude-haiku ), ])
Spans with empty output or no available context return 1.0 (nothing to judge), so non-RAG spans don't drag down your average. The judge LLM call is cheap (max 10 output tokens) and runs after the original span completes — it never blocks the main request.

Find your worst hallucinations

sql
-- Lowest-scoring outputs across all runs SELECT trace_id, json_extract(attributes,'$.eval_scores.Hallucination') AS hallucination, json_extract(attributes,'$.output') AS output FROM spans WHERE hallucination IS NOT NULL AND hallucination < 0.5 ORDER BY hallucination ASC, start_time DESC LIMIT 20; -- Hallucination rate by model SELECT json_extract(attributes,'$.model') model, AVG(json_extract(attributes,'$.eval_scores.Hallucination')) avg_groundedness, COUNT(*) runs FROM spans WHERE json_extract(attributes,'$.eval_scores.Hallucination') IS NOT NULL GROUP BY model ORDER BY avg_groundedness ASC;
Hallucination scoring uses an LLM judge, which is itself imperfect and costs tokens. Treat it as a useful smoke alarm — a sudden drop in average groundedness is a strong signal — not as ground truth for any single trace.

Detailed mode — RAGAS-style claim decomposition

The default mode returns a single score. detailed=True switches to a RAGAS Faithfulness-style pipeline: the judge first decomposes the output into atomic factual claims, then assigns each claim one of three verdicts:

VerdictMeaning
supportedClaim is directly entailed by CONTEXT
contradictedClaim directly conflicts with CONTEXT
unsupportedCONTEXT is silent about the claim

The score becomes supported_count / total_claims and the full breakdown lands on the span at attributes.hallucination_details:

python
peekr.instrument(evaluators=[peekr.eval.Hallucination(detailed=True)])
span.attributes.hallucination_details
{ "total": 3, "supported": 1, "contradicted": 2, "unsupported": 0, "score": 0.33, "claims": [ {"text": "The Eiffel Tower is in Paris", "verdict": "supported"}, {"text": "It was built in 1923", "verdict": "contradicted"}, {"text": "It was designed by Frank Lloyd Wright", "verdict": "contradicted"} ] }

This is what powers the drift dashboard's drill-down — you can see exactly which claims the model invented, not just an average score.

Detailed mode uses one judge call per span (just with more output tokens — JSON, capped at 800). Use simple mode for cheap continuous monitoring across many traces, and detailed mode for the spans you want to investigate. You can switch by re-running with a different evaluators= list.

Observability dashboard

Generate a self-contained, tabbed HTML report from your traces. Designed as a drop-in observability layer for any RAG or memory/agent pipeline — open the file in any browser, no server, no build step.

terminal
peekr dashboard traces.db -o report.html # SQLite peekr dashboard traces.jsonl # JSONL → ./dashboard.html open report.html

Five tabs, one URL

The dashboard is organised so a non-technical observer can stay on the Overview tab and still get the gist, while an engineer can drill into Traces / Quality / Diagnose for specifics. Tab state is in the URL hash so links are shareable. A persistent filter bar at the top applies across every tab.

TabForContents
#overviewFirst-impression / execHealth hero (0–100), narrative bullets, 4 metric cards with sparklines, top 3 action items pulled from the diagnostic engine.
#traces"Find me that call"Search box (trace ID, model, content, error), sortable table, click any row → side panel with full context vs answer, claim verdicts, citations, per-call action items.
#qualityTrend monitoringRolling chart with warning (0.7) / critical (0.5) threshold lines, score distribution histogram, channel × time heatmap, claim-verdict doughnut, citation panel.
#diagnoseIncident response"Likely causes & next steps" with severity-tagged cards and numbered fix lists, plus the full worst-offenders panel with side-by-side highlighted context vs answer.
#helpFirst-time setupSetup checklist (auto-ticks live), glossary, evaluator configuration snippets, troubleshooting, keyboard shortcuts.

Keyboard shortcuts

KeyAction
15Switch tabs
/Jump to Traces tab and focus the search box
RClear all filters
EscClose the trace detail panel

Filter bar

One persistent bar at the top of every tab. Click any chip to toggle that filter; every panel on every tab refilters immediately. The time-range chips include 5m, 15m, 30m, 1h, 24h, 7d, 30d presets plus a Custom… option with datetime-local from/to inputs. The "When = Custom" mode seeds itself to "last 1h up to the newest timestamp" so first activation isn't empty.

Panels at a glance

PanelWhat it showsHow to act on it
Health heroOne 0–100 score with a coloured dot (green/yellow/orange/red), tier label, count of flagged calls, and Δ vs baseline.Red → open the recommendations panel below.
What's happening3–5 plain-English bullets summarising the situation: drift, worst channel, citation invention rate, error count.Read top-to-bottom; the highest-priority finding is first.
Filter chipsTenant · Model · Endpoint · Time range. Stack to drill in.Click chips to refilter every panel. Click again to clear.
Metric cardsHallucination · Rubric · Citations · Errors. Each with sparkline, Δ vs baseline, count of scored calls, and an action hint.The hint at the bottom tells you the next step (e.g. "30 flagged — review worst offenders below").
Likely causes & next stepsDiagnostic engine — runs eight pattern-detection rules and surfaces ranked recommendations with cause + numbered fix list.Each card has a severity badge and a "what to try" list specific to that pattern.
Score over timeRolling 20-call mean of every evaluator, with dashed warning (0.7) and critical (0.5) threshold lines.Hover for trace details; click a point to jump to its worst-offender card.
Failure breakdown heatmapChannel × time grid. Rows = your models/tenants/endpoints. Columns = time buckets. Colour = mean Hallucination.Red rows tell you which channel is failing; rows that go green → red tell you when. Click a cell to filter.
Worst offenders12 lowest-scoring calls. Side-by-side context vs answer with contradicted claims highlighted, claim verdicts, citation list.Each card ends with a "What to try for this call" box prescribing fixes specific to that span's failure pattern.

Diagnostic rules

The recommendations panel inspects the filtered rows and emits cards from eight pattern-detection rules. Each card has a severity (high / medium / low / info / good), a plausible cause in plain English, and a numbered list of concrete fixes.

PatternTriggers whenSample recommendation
Invented citations> 30% of detected citation patterns aren't in contextTighten prompt; verify citations post-hoc; try hybrid retrieval
High contradiction rate> 20% of judged claims directly contradict contextStrengthen system prompt; move context closer to question; reduce max_tokens
Out-of-context elaboration> 25% unsupported claims with low contradictionAdd refusal instruction; check retrieval recall; coverage prompt
Channel concentration> 50% of flagged calls share one model/tenant/endpointDiff deploys; compare prompts; verify index coverage for that channel
Hallucination driftΔ vs baseline < −0.1Use heatmap to localise; cross-reference deploys; use peekr replay
Error spikes> 5% of calls have status="error"Check rate limits; verify fallback model quality; add retries
Citations all grounded≥ 5 citations, 0 inventedAdd an alert on citation invention rate to catch future regressions
HealthyNo patterns triggeredSet up peekr.alert.ScoreFloor; run the offline benchmark periodically

Per-span action items

Every worst-offender card ends with a tailored "What to try for this call" panel — separate from the aggregate recommendations. It inspects that one span's claims, citations, and context to suggest fixes targeted to its specific failure pattern:

Detected on this spanWhat the action box suggests
Empty / short context but long answerRetrieval miss — inspect what your retriever returned
Invented URLs / arXiv / DOIs / titlesPer-kind prompt fix + post-hoc citation verification
Contradicted numbers / dates"Copy numerics verbatim" instruction; temperature=0
Contradicted proper nounsExplicit "don't substitute names" instruction
Mostly unsupported claims, no contradictionsAdd refusal: "say I don't know if not in context"
Mostly contradicted claimsMove context closer to question; "context wins" instruction
Low score but no detailed claimsEnable Hallucination(detailed=True) to see what failed
Output much longer than contextReduce max_tokens; long completions drift

Tag spans for the channel breakdown

The heatmap groups by attributes.model (set automatically by the patches), attributes.user_id (set via peekr.session(user_id=...)), and attributes.endpoint (you set this). Without an endpoint attribute, the endpoint row of the heatmap simply doesn't render — the others still do.

python
from peekr import trace, get_current_span @trace def handle_request(req): get_current_span().attributes["endpoint"] = req.path return call_llm(...) # Or in a FastAPI middleware — one place, every request tagged @app.middleware("http") async def tag_span(request, call_next): with peekr.session(user_id=request.headers.get("X-Tenant-Id")): span, token = peekr.start_span(f"http.{request.method}") span.attributes["endpoint"] = request.url.path try: return await call_next(request) finally: peekr.end_span(span, token)
The dashboard reads from JSONL or SQLite — whatever you configured in peekr.instrument(). It's a post-hoc tool: rerun it whenever you want a fresh snapshot. For a real-time view, use peekr view --io in the terminal.

Feedback + export

Label traces as good or bad. Export labelled data as a fine-tuning dataset.

python
import peekr # Rate a trace peekr.feedback(trace_id="a3f2b1c0...", rating="good", note="perfect answer") peekr.feedback(trace_id="b2e4c8f1...", rating="bad", note="hallucinated") # Export good traces as OpenAI fine-tuning data peekr.export_feedback( db_path="traces.db", filter="good", output="training.jsonl", format="openai-ft", # or "raw" )

The openai-ft format produces one JSON object per trace:

training.jsonl
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

A/B experiments

Route traffic between variants and tag each span. Analyse results with SQL — no separate tracking tool needed.

python
from peekr import experiment # List variants — equal split by default @experiment(variants=["control", "test_v2"]) def run_agent(query: str, variant: str): model = "gpt-4o" if variant == "control" else "claude-opus-4-5" return call_llm(model, query) # Dict variants — passes config too @experiment(variants={ "control": {"model": "gpt-4o"}, "test": {"model": "claude-opus-4-5"}, }) def run_agent(query: str, variant: str, variant_config: dict): return call_llm(variant_config["model"], query)

Analyse in SQLite:

sql
SELECT json_extract(attributes,'$.experiment_variant') variant, COUNT(*) runs, AVG(CASE WHEN status='error' THEN 1.0 ELSE 0.0 END) error_rate, AVG(json_extract(attributes,'$.tokens_total')) avg_tokens, AVG(duration_ms) avg_ms FROM spans WHERE json_extract(attributes,'$.experiment_variant') IS NOT NULL GROUP BY variant;

Trace replay

Re-run a stored trace with the same inputs. Useful for reproducing production bugs locally or verifying a fix against a real failure.

python
from peekr.replay import replay_trace # Re-run from SQLite new_trace_id = replay_trace(trace_id="a3f2b1c0...", db_path="traces.db") print(f"New trace: {new_trace_id}") # Re-run from JSONL new_trace_id = replay_trace(trace_id="a3f2b1c0...", jsonl_path="traces.jsonl")

Or use the CLI:

terminal
peekr replay a3f2b1c0 peekr replay a3f2b1c0 --db traces.db peekr replay a3f2b1c0 --jsonl traces.jsonl
Replay re-runs the stored LLM inputs through the live SDK. The agent itself is not re-invoked — only the LLM calls are replayed. This means tool calls are not replicated, but you get a new trace showing exactly what the model produces with those inputs today.

Peekr Cloud

The OSS SDK runs in your process, writes to local files, and is MIT licensed forever. When a single-process file isn't the right fit any more — multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage — Peekr Cloud is the optional managed backend.

The wire format is already in the SDK. tenant_id and retention_class exist in v0.3 specifically so the spans you're producing today work without modification when you connect.

python
import peekr peekr.instrument( tenant_id="acme", exporter=peekr.HTTPExporter( endpoint="https://ingest.peekr.cloud", api_key="pk_live_…", ), )

HTTPExporter ships as a stub in v0.3 — the constructor signature is stable so you can wire your call sites today, and they won't change when the implementation lands. Until then it raises NotImplementedError on .export() so a misconfigured pipeline fails loudly rather than silently dropping spans.

Get on the waitlistGitHub Discussions.