Documentation
Peekr records every LLM call, tool invocation, token count, and error as a tree you can inspect. This page covers everything from installation to advanced usage.
On this page
Installation
terminalpip install peekr # base — no LLM SDK required pip install "peekr[openai]" # with OpenAI pip install "peekr[anthropic]" # with Anthropic pip install "peekr[all]" # both
Requires Python 3.9+. No accounts, no backend, no environment variables.
Quickstart
Add two lines before your agent runs:
agent.pyimport peekr peekr.instrument() import openai response = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello"}] )
Every LLM call is now traced automatically. Peekr writes to traces.jsonl and prints to the console.
View your traces:
terminalpeekr view traces.jsonl # tree view peekr view --io traces.jsonl # include inputs and outputs
outputTrace a3f2b1c0 843ms 312tok ──────────────────────────────────────────────── openai.chat.completions [gpt-4o] 843ms 312tok
Example: Debug a wrong answer
Your agent returns an incorrect response and you don't know why. Add @trace to your tool functions and run with --io:
agent.pyimport peekr peekr.instrument() from peekr import trace @trace def fetch_user(user_id: int) -> dict: return db.get(user_id) # returns None if not found @trace(name="agent.run") def run(user_id: int): user = fetch_user(user_id) # bug: no null check before passing to LLM return openai.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": f"User: {user}"}] )
peekr view --io traces.jsonlagent.run 2100ms └─ tool.fetch_user 12ms in: {"args": [42], "kwargs": {}} out: null ← found it └─ openai.chat.completions [gpt-4o] 2088ms in: [{"role": "system", "content": "User: null..."}]
The LLM received null as the user object. The fix is a null check in run(), not a prompt change.
Example: Find slow steps
Wrap every step in your agent with @trace and look at the durations:
agent.pyfrom peekr import trace @trace def search_web(query: str) -> list: ... @trace def rerank_results(results: list) -> list: ... @trace(name="agent.run") def run(query: str): results = search_web(query) ranked = rerank_results(results) return openai.chat.completions.create(...)
peekr view traces.jsonlagent.run 4300ms └─ tool.search_web 3800ms ← 88% of time └─ tool.rerank_results 18ms └─ openai.chat 490ms
Cache search_web results for repeated queries, or run it in parallel with other setup work. The LLM is not the bottleneck.
Example: Reduce token costs
Run your agent a few times on the same task and compare token counts across traces:
terminalpeekr view traces.jsonl
outputTrace a3f2b1c0 18,432 tokens Trace b2e4c8f1 21,104 tokens Trace c5d9e2a7 24,891 tokens
Token count growing each run is the signature of unbounded history — the agent appends every message to the next call. Fix: summarize or truncate the conversation after a fixed number of turns.
agent.py# Before: growing history messages = conversation_history # gets longer every turn # After: summarize after 5 turns if len(conversation_history) > 10: # 5 exchanges = 10 messages summary = summarize(conversation_history) messages = [{"role": "system", "content": summary}]
Example: Prod vs local bugs
Your agent passes tests locally but fails in production. Capture traces in both environments and compare tool outputs:
agent.py@trace def fetch_inventory(sku: str) -> list: return inventory_api.get(sku)
local tracetool.fetch_inventory 8ms in: {"sku": "ABC-123"} out: [{"id": 1, "qty": 42}] ← data present locally
prod tracetool.fetch_inventory 8ms in: {"sku": "ABC-123"} out: [] ← empty in prod
The agent logic is identical. The inventory API returns different data in prod — likely a missing data migration or environment-specific config. Fix the data source, not the agent.
instrument()
Call once, before any LLM calls. Patches the OpenAI and Anthropic SDKs.
pythonpeekr.instrument( console=True, # print spans live (default: True) storage="jsonl", # "jsonl" | "sqlite" | "both" jsonl_path="traces.jsonl", # JSONL output path db_path="traces.db", # SQLite output path )
| Parameter | Type | Default | Description |
|---|---|---|---|
console | bool | True | Print each span to stdout as it completes |
storage | str | "jsonl" | "jsonl", "sqlite", or "both" |
jsonl_path | str | "traces.jsonl" | Path for JSONL output |
db_path | str | "traces.db" | Path for SQLite output |
SQLite storage
SQLite uses WAL mode so multiple processes — Docker containers, CI workers, parallel agents — can write spans safely at the same time. And because it's a real database, you can query across all your runs without any extra tooling.
python# Enable SQLite peekr.instrument(storage="sqlite") # Write to both JSONL and SQLite peekr.instrument(storage="both")
View with the same CLI command:
terminalpeekr view traces.db peekr view --io traces.db
Or query directly with any SQLite client:
terminal — useful queries# Slowest tool calls sqlite3 traces.db "SELECT name, ROUND(AVG(duration_ms)) avg_ms FROM spans GROUP BY name ORDER BY avg_ms DESC" # Token spend by model sqlite3 traces.db "SELECT json_extract(attributes,'$.model') model, SUM(json_extract(attributes,'$.tokens_total')) tokens FROM spans GROUP BY model" # All errors sqlite3 traces.db "SELECT name, trace_id, json_extract(attributes,'$.error') msg FROM spans WHERE status = 'error'" # Token growth over time (detect unbounded history) sqlite3 traces.db "SELECT trace_id, SUM(json_extract(attributes,'$.tokens_total')) total FROM spans GROUP BY trace_id ORDER BY start_time"
grep or tail -f.@trace decorator
Wraps a function as a span. Works on sync and async functions.
pythonfrom peekr import trace # Auto-names from module.function @trace def search_web(query: str) -> list: ... # Custom name @trace(name="tool.search") def search(query: str) -> list: ... # Opt out of capturing inputs/outputs (latency + status still recorded) @trace(capture_io=False) def fetch_api_key() -> str: ... # Async @trace async def fetch_user(user_id: int) -> dict: ...
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | None | module.function | Custom span name |
capture_io | bool | True | Record function args and return value |
capture_io=False for functions that handle secrets or large payloads.Manual spans
For cases where a decorator doesn't fit — e.g. a loop, a context manager, or code you can't modify:
pythonfrom peekr import start_span, end_span span, token = start_span("my.operation") span.attributes["custom_key"] = "value" try: result = do_work() span.status = "ok" except Exception as e: span.status = "error" span.attributes["error"] = str(e) raise finally: end_span(span, token) # always call — even on error
do_work() will automatically nest as children of this span.CLI viewer
terminalpeekr view traces.jsonl # tree view peekr view --io traces.jsonl # + inputs and outputs
Each trace is shown as a tree grouped by trace_id. The --io flag prints up to 120 characters of the serialized input and output for each span.
Custom exporters
Any object with an export(span) method works as an exporter:
pythonfrom peekr.exporters import add_exporter class HttpExporter: def export(self, span): requests.post( "https://your-backend.com/spans", json=span.to_dict() ) peekr.instrument() add_exporter(HttpExporter())
Multiple exporters can be active at once. The built-in JSONLExporter and ConsoleExporter are added by instrument(). You can add your own on top.
Span fields
Every span written to traces.jsonl is a JSON object with these fields:
| Field | Type | Description |
|---|---|---|
trace_id | string | Groups all spans in one agent run |
span_id | string | Unique ID for this span |
parent_id | string | null | ID of the parent span, or null for root |
name | string | Span name |
start_time | float | Unix timestamp |
end_time | float | Unix timestamp |
duration_ms | float | Wall-clock duration in milliseconds |
status | "ok" | "error" | Whether the span succeeded |
tenant_id | string | null | Customer org (B2B). First-class — top-level column in SQLite, top-level key in JSONL. Set via peekr.session(tenant_id=...), instrument(tenant_id=...), or env PEEKR_TENANT_ID. |
retention_class | string | null | Storage-tier hint (e.g. "default", "short", "long", "pii"). OSS stores it; storage tier interprets it. |
attributes.model | string | LLM model name (auto-captured) |
attributes.tokens_input | int | Prompt tokens (auto-captured) |
attributes.tokens_output | int | Completion tokens (auto-captured) |
attributes.tokens_total | int | Total tokens (auto-captured) |
attributes.input | string | Serialized function args (truncated) |
attributes.output | string | Serialized return value (truncated) |
attributes.error | string | Exception message if status is "error" |
attributes.session_id | string | Set when span is inside a peekr.session() |
attributes.user_id | string | Set when span is inside a peekr.session(user_id=...) |
attributes.eval_scores | dict | Evaluator name → score (0.0–1.0) when evaluators are configured |
attributes.experiment_variant | string | Variant name when inside a @peekr.experiment |
Sessions
Group all spans for a user, tenant, or conversation by passing identifiers to peekr.session(). Uses ContextVar so it propagates correctly across async code.
pythonimport peekr with peekr.session( user_id="user_123", # end-user (B2C) tenant_id="acme", # customer org (B2B) session_id="sess_abc", # auto-generated if omitted retention_class="long", # storage-tier hint ): run_agent()
tenant_id and retention_class are first-class columns on the span — see Multi-tenant traces.
Query by user in SQLite:
sqlSELECT trace_id, AVG(duration_ms), SUM(json_extract(attributes,'$.tokens_total')) FROM spans WHERE json_extract(attributes,'$.user_id') = 'user_123' GROUP BY trace_id;
Multi-tenant traces
Every span carries two first-class fields — tenant_id (the customer org) and retention_class (a storage-tier hint) — separate from user_id (the end-user). A B2B agent can tag both without conflict.
pythonimport peekr peekr.instrument(tenant_id="acme", retention_class="default") with peekr.session(user_id="alice", tenant_id="acme", retention_class="long"): run_agent()
Resolution order, highest priority first:
peekr.session(tenant_id=..., retention_class=...)peekr.instrument(tenant_id=..., retention_class=...)- Env vars
PEEKR_TENANT_ID/PEEKR_RETENTION_CLASS
Both fields are top-level columns in SQLite (with indices) and top-level keys in JSONL — no json_extract needed:
sqlSELECT tenant_id, COUNT(*) FROM spans GROUP BY tenant_id; SELECT * FROM spans WHERE retention_class = 'long' AND start_time > ?;
retention_class is a free-form string in the OSS SDK — recommended values are default, short, long, and pii. The meaning of each is enforced by your storage tier (or by Peekr Cloud).
attributes.tenant_id? So you can filter and index without JSON extraction — relevant the moment you have more than a handful of tenants or want to route ingestion. The SQLite exporter migrates pre-v0.3 databases automatically via PRAGMA user_version; legacy rows back-fill as NULL.Alerts
Alerts fire after each complete trace (identified by the root span). Pass them to instrument():
pythonpeekr.instrument(alerts=[ peekr.alert.ErrorRate(threshold=0.05, window=100), # >5% errors in last 100 traces peekr.alert.CostSpike(multiplier=2.0), # tokens 2× above rolling avg peekr.alert.LatencyP95(ms=5000), # p95 latency > 5s peekr.alert.TokenGrowth(runs=5), # growing 5 consecutive runs ])
Override on_trigger to send to Slack, PagerDuty, or anywhere:
pythonclass SlackAlert(peekr.alert.ErrorRate): def on_trigger(self, message: str) -> None: slack.send(f"#alerts: {message}") peekr.instrument(alerts=[SlackAlert(threshold=0.05)])
| Alert | Triggers when | Key params |
|---|---|---|
ErrorRate | Error % in last N traces > threshold | threshold, window=100 |
CostSpike | This trace tokens > multiplier × rolling avg | multiplier, window=50 |
LatencyP95 | p95 span latency in trace > ms | ms |
TokenGrowth | Token count strictly increasing for N runs | runs=5 |
Eval — LLM-as-judge
Evaluators run after each LLM span completes and write scores to span.attributes["eval_scores"]. A _in_eval guard prevents infinite recursion.
pythonpeekr.instrument(evaluators=[ peekr.eval.Rubric("Be concise and factually accurate"), peekr.eval.Hallucination(), # groundedness check (see below) peekr.eval.NotEmpty(), # output must be non-empty peekr.eval.NoError(), # span must have status=ok ])
Scores are written to span.attributes["eval_scores"] as a {evaluator_name: float} dict and shown inline by peekr view --io:
peekr view --io traces.jsonlopenai.chat.completions [gpt-4o] 843ms 312tok in: "Summarise this doc..." out: "The doc argues that..." eval_scores: {Rubric: 0.92, Hallucination: 0.95, NotEmpty: 1.0, NoError: 1.0}
Query scores in SQLite:
sqlSELECT name, AVG(json_extract(attributes,'$.eval_scores.Rubric')) rubric_avg, AVG(json_extract(attributes,'$.eval_scores.Hallucination')) hallucination_avg FROM spans WHERE json_extract(attributes,'$.eval_scores') IS NOT NULL GROUP BY name;
Write your own evaluator:
pythonfrom peekr.eval import BaseEvaluator class LengthCheck(BaseEvaluator): def evaluate(self, span) -> float: output = span.attributes.get("output", "") return 1.0 if len(output) < 500 else 0.0
| Evaluator | What it checks | Requires |
|---|---|---|
Rubric(criteria) | LLM scores output against your criteria (0.0–1.0) | openai or anthropic SDK |
Hallucination() | Fraction of claims grounded in the input/context (0.0–1.0) | openai or anthropic SDK |
NotEmpty() | Output attribute is non-empty string | Nothing |
NoError() | Span status is "ok" | Nothing |
Hallucination detection
The Hallucination evaluator scores how well an LLM output is supported by its input context. It uses an LLM-as-judge under the hood — the same fallback pattern as Rubric (OpenAI first, then Anthropic).
| Score | Meaning |
|---|---|
1.0 | Every factual claim in the output is supported by the context |
0.0 | No claim is supported — the output is fully hallucinated |
| between | The fraction of claims grounded in the context |
Plug it in like any other evaluator:
pythonimport peekr peekr.instrument(evaluators=[peekr.eval.Hallucination()]) import openai openai.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "The Eiffel Tower was completed in 1889 in Paris."}, {"role": "user", "content": "When was the Eiffel Tower built and by whom?"}, ], )
peekr view --io traces.jsonlopenai.chat.completions [gpt-4o] 843ms in: [{"role": "system", "content": "The Eiffel Tower was completed in 1889 in Paris."}, ...] out: "The Eiffel Tower was built in 1923 by Frank Lloyd Wright." eval_scores: {Hallucination: 0.0} ← invented year and architect
RAG flows: point it at retrieved documents
By default the evaluator uses the span's input (the messages sent to the LLM) as the grounding context. For RAG flows where the source documents live elsewhere — say, on a parent tool span — pass a context_extractor:
pythonpeekr.instrument(evaluators=[ peekr.eval.Hallucination( context_extractor=lambda span: span.attributes.get("retrieved_docs", ""), model="gpt-4o-mini", # optional — defaults to gpt-4o-mini / claude-haiku ), ])
1.0 (nothing to judge), so non-RAG spans don't drag down your average. The judge LLM call is cheap (max 10 output tokens) and runs after the original span completes — it never blocks the main request.Find your worst hallucinations
sql-- Lowest-scoring outputs across all runs SELECT trace_id, json_extract(attributes,'$.eval_scores.Hallucination') AS hallucination, json_extract(attributes,'$.output') AS output FROM spans WHERE hallucination IS NOT NULL AND hallucination < 0.5 ORDER BY hallucination ASC, start_time DESC LIMIT 20; -- Hallucination rate by model SELECT json_extract(attributes,'$.model') model, AVG(json_extract(attributes,'$.eval_scores.Hallucination')) avg_groundedness, COUNT(*) runs FROM spans WHERE json_extract(attributes,'$.eval_scores.Hallucination') IS NOT NULL GROUP BY model ORDER BY avg_groundedness ASC;
Detailed mode — RAGAS-style claim decomposition
The default mode returns a single score. detailed=True switches to a RAGAS Faithfulness-style pipeline: the judge first decomposes the output into atomic factual claims, then assigns each claim one of three verdicts:
| Verdict | Meaning |
|---|---|
supported | Claim is directly entailed by CONTEXT |
contradicted | Claim directly conflicts with CONTEXT |
unsupported | CONTEXT is silent about the claim |
The score becomes supported_count / total_claims and the full breakdown lands on the span at attributes.hallucination_details:
pythonpeekr.instrument(evaluators=[peekr.eval.Hallucination(detailed=True)])
span.attributes.hallucination_details{ "total": 3, "supported": 1, "contradicted": 2, "unsupported": 0, "score": 0.33, "claims": [ {"text": "The Eiffel Tower is in Paris", "verdict": "supported"}, {"text": "It was built in 1923", "verdict": "contradicted"}, {"text": "It was designed by Frank Lloyd Wright", "verdict": "contradicted"} ] }
This is what powers the drift dashboard's drill-down — you can see exactly which claims the model invented, not just an average score.
evaluators= list.Observability dashboard
Generate a self-contained, tabbed HTML report from your traces. Designed as a drop-in observability layer for any RAG or memory/agent pipeline — open the file in any browser, no server, no build step.
terminalpeekr dashboard traces.db -o report.html # SQLite peekr dashboard traces.jsonl # JSONL → ./dashboard.html open report.html
Five tabs, one URL
The dashboard is organised so a non-technical observer can stay on the Overview tab and still get the gist, while an engineer can drill into Traces / Quality / Diagnose for specifics. Tab state is in the URL hash so links are shareable. A persistent filter bar at the top applies across every tab.
| Tab | For | Contents |
|---|---|---|
#overview | First-impression / exec | Health hero (0–100), narrative bullets, 4 metric cards with sparklines, top 3 action items pulled from the diagnostic engine. |
#traces | "Find me that call" | Search box (trace ID, model, content, error), sortable table, click any row → side panel with full context vs answer, claim verdicts, citations, per-call action items. |
#quality | Trend monitoring | Rolling chart with warning (0.7) / critical (0.5) threshold lines, score distribution histogram, channel × time heatmap, claim-verdict doughnut, citation panel. |
#diagnose | Incident response | "Likely causes & next steps" with severity-tagged cards and numbered fix lists, plus the full worst-offenders panel with side-by-side highlighted context vs answer. |
#help | First-time setup | Setup checklist (auto-ticks live), glossary, evaluator configuration snippets, troubleshooting, keyboard shortcuts. |
Keyboard shortcuts
| Key | Action |
|---|---|
1–5 | Switch tabs |
/ | Jump to Traces tab and focus the search box |
R | Clear all filters |
Esc | Close the trace detail panel |
Filter bar
One persistent bar at the top of every tab. Click any chip to toggle that filter; every panel on every tab refilters immediately. The time-range chips include 5m, 15m, 30m, 1h, 24h, 7d, 30d presets plus a Custom… option with datetime-local from/to inputs. The "When = Custom" mode seeds itself to "last 1h up to the newest timestamp" so first activation isn't empty.
Panels at a glance
| Panel | What it shows | How to act on it |
|---|---|---|
| Health hero | One 0–100 score with a coloured dot (green/yellow/orange/red), tier label, count of flagged calls, and Δ vs baseline. | Red → open the recommendations panel below. |
| What's happening | 3–5 plain-English bullets summarising the situation: drift, worst channel, citation invention rate, error count. | Read top-to-bottom; the highest-priority finding is first. |
| Filter chips | Tenant · Model · Endpoint · Time range. Stack to drill in. | Click chips to refilter every panel. Click again to clear. |
| Metric cards | Hallucination · Rubric · Citations · Errors. Each with sparkline, Δ vs baseline, count of scored calls, and an action hint. | The hint at the bottom tells you the next step (e.g. "30 flagged — review worst offenders below"). |
| Likely causes & next steps | Diagnostic engine — runs eight pattern-detection rules and surfaces ranked recommendations with cause + numbered fix list. | Each card has a severity badge and a "what to try" list specific to that pattern. |
| Score over time | Rolling 20-call mean of every evaluator, with dashed warning (0.7) and critical (0.5) threshold lines. | Hover for trace details; click a point to jump to its worst-offender card. |
| Failure breakdown heatmap | Channel × time grid. Rows = your models/tenants/endpoints. Columns = time buckets. Colour = mean Hallucination. | Red rows tell you which channel is failing; rows that go green → red tell you when. Click a cell to filter. |
| Worst offenders | 12 lowest-scoring calls. Side-by-side context vs answer with contradicted claims highlighted, claim verdicts, citation list. | Each card ends with a "What to try for this call" box prescribing fixes specific to that span's failure pattern. |
Diagnostic rules
The recommendations panel inspects the filtered rows and emits cards from eight pattern-detection rules. Each card has a severity (high / medium / low / info / good), a plausible cause in plain English, and a numbered list of concrete fixes.
| Pattern | Triggers when | Sample recommendation |
|---|---|---|
| Invented citations | > 30% of detected citation patterns aren't in context | Tighten prompt; verify citations post-hoc; try hybrid retrieval |
| High contradiction rate | > 20% of judged claims directly contradict context | Strengthen system prompt; move context closer to question; reduce max_tokens |
| Out-of-context elaboration | > 25% unsupported claims with low contradiction | Add refusal instruction; check retrieval recall; coverage prompt |
| Channel concentration | > 50% of flagged calls share one model/tenant/endpoint | Diff deploys; compare prompts; verify index coverage for that channel |
| Hallucination drift | Δ vs baseline < −0.1 | Use heatmap to localise; cross-reference deploys; use peekr replay |
| Error spikes | > 5% of calls have status="error" | Check rate limits; verify fallback model quality; add retries |
| Citations all grounded | ≥ 5 citations, 0 invented | Add an alert on citation invention rate to catch future regressions |
| Healthy | No patterns triggered | Set up peekr.alert.ScoreFloor; run the offline benchmark periodically |
Per-span action items
Every worst-offender card ends with a tailored "What to try for this call" panel — separate from the aggregate recommendations. It inspects that one span's claims, citations, and context to suggest fixes targeted to its specific failure pattern:
| Detected on this span | What the action box suggests |
|---|---|
| Empty / short context but long answer | Retrieval miss — inspect what your retriever returned |
| Invented URLs / arXiv / DOIs / titles | Per-kind prompt fix + post-hoc citation verification |
| Contradicted numbers / dates | "Copy numerics verbatim" instruction; temperature=0 |
| Contradicted proper nouns | Explicit "don't substitute names" instruction |
| Mostly unsupported claims, no contradictions | Add refusal: "say I don't know if not in context" |
| Mostly contradicted claims | Move context closer to question; "context wins" instruction |
| Low score but no detailed claims | Enable Hallucination(detailed=True) to see what failed |
| Output much longer than context | Reduce max_tokens; long completions drift |
Tag spans for the channel breakdown
The heatmap groups by attributes.model (set automatically by the patches), attributes.user_id (set via peekr.session(user_id=...)), and attributes.endpoint (you set this). Without an endpoint attribute, the endpoint row of the heatmap simply doesn't render — the others still do.
pythonfrom peekr import trace, get_current_span @trace def handle_request(req): get_current_span().attributes["endpoint"] = req.path return call_llm(...) # Or in a FastAPI middleware — one place, every request tagged @app.middleware("http") async def tag_span(request, call_next): with peekr.session(user_id=request.headers.get("X-Tenant-Id")): span, token = peekr.start_span(f"http.{request.method}") span.attributes["endpoint"] = request.url.path try: return await call_next(request) finally: peekr.end_span(span, token)
peekr.instrument(). It's a post-hoc tool: rerun it whenever you want a fresh snapshot. For a real-time view, use peekr view --io in the terminal.Feedback + export
Label traces as good or bad. Export labelled data as a fine-tuning dataset.
pythonimport peekr # Rate a trace peekr.feedback(trace_id="a3f2b1c0...", rating="good", note="perfect answer") peekr.feedback(trace_id="b2e4c8f1...", rating="bad", note="hallucinated") # Export good traces as OpenAI fine-tuning data peekr.export_feedback( db_path="traces.db", filter="good", output="training.jsonl", format="openai-ft", # or "raw" )
The openai-ft format produces one JSON object per trace:
training.jsonl{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
A/B experiments
Route traffic between variants and tag each span. Analyse results with SQL — no separate tracking tool needed.
pythonfrom peekr import experiment # List variants — equal split by default @experiment(variants=["control", "test_v2"]) def run_agent(query: str, variant: str): model = "gpt-4o" if variant == "control" else "claude-opus-4-5" return call_llm(model, query) # Dict variants — passes config too @experiment(variants={ "control": {"model": "gpt-4o"}, "test": {"model": "claude-opus-4-5"}, }) def run_agent(query: str, variant: str, variant_config: dict): return call_llm(variant_config["model"], query)
Analyse in SQLite:
sqlSELECT json_extract(attributes,'$.experiment_variant') variant, COUNT(*) runs, AVG(CASE WHEN status='error' THEN 1.0 ELSE 0.0 END) error_rate, AVG(json_extract(attributes,'$.tokens_total')) avg_tokens, AVG(duration_ms) avg_ms FROM spans WHERE json_extract(attributes,'$.experiment_variant') IS NOT NULL GROUP BY variant;
Trace replay
Re-run a stored trace with the same inputs. Useful for reproducing production bugs locally or verifying a fix against a real failure.
pythonfrom peekr.replay import replay_trace # Re-run from SQLite new_trace_id = replay_trace(trace_id="a3f2b1c0...", db_path="traces.db") print(f"New trace: {new_trace_id}") # Re-run from JSONL new_trace_id = replay_trace(trace_id="a3f2b1c0...", jsonl_path="traces.jsonl")
Or use the CLI:
terminalpeekr replay a3f2b1c0 peekr replay a3f2b1c0 --db traces.db peekr replay a3f2b1c0 --jsonl traces.jsonl
Peekr Cloud
The OSS SDK runs in your process, writes to local files, and is MIT licensed forever. When a single-process file isn't the right fit any more — multiple services, a team that needs shared dashboards, longer retention, audit-grade trace storage — Peekr Cloud is the optional managed backend.
The wire format is already in the SDK. tenant_id and retention_class exist in v0.3 specifically so the spans you're producing today work without modification when you connect.
pythonimport peekr peekr.instrument( tenant_id="acme", exporter=peekr.HTTPExporter( endpoint="https://ingest.peekr.cloud", api_key="pk_live_…", ), )
HTTPExporter ships as a stub in v0.3 — the constructor signature is stable so you can wire your call sites today, and they won't change when the implementation lands. Until then it raises NotImplementedError on .export() so a misconfigured pipeline fails loudly rather than silently dropping spans.
Get on the waitlist → GitHub Discussions.