HomeDocumentation

Documentation

CLI reference, fixture development, and Registry API for AgentCarousel.

Installation

AgentCarousel ships as a single binary. The CLI is available as both agentcarousel and the short alias agc.

Linux / macOS (shell installer)

curl -fsSL https://install.agentcarousel.com | sh

macOS (Homebrew)

brew tap agentcarousel/agentcarousel
brew install agentcarousel

Rust / Cargo

cargo install agentcarousel

Windows

Download the .zip archive from the GitHub Releases page , extract, and add the binary to your PATH.

Run agc update to check for and install a newer version in-place. Use agc update --check to print availability without installing.

Quickstart

The fastest path to a passing fixture. No API keys required for mock runs.

# 1. Scaffold a new fixture directory
agc init --skill my-agent

# 2. Validate structure and schema
agc validate fixtures/my-agent/cases.yaml

# 3. Run tests (mock mode is the default — no API keys needed)
agc test fixtures/my-agent/cases.yaml

# 4. Evaluate with a rubric
agc eval fixtures/my-agent/cases.yaml

# 5. Export an evidence bundle
agc export <RUN-ID>

Key Concepts

Understanding these four building blocks covers 90% of everyday usage.

Fixture

A YAML file (cases.yaml) describing a skill or agent under test. Contains metadata and a list of cases with inputs and assertions.

Case

A single test scenario within a fixture. Has an id, input messages, and expected assertions.

Evidence Bundle

A .tar.gz artifact produced by a run. Contains per-case results, a determination, and an optional minisign attestation from a human domain expert.

Trust State

The registry-assigned status of a bundle: Experimental → CarouselCandidate → Stable → Trusted.

Checks fixture YAML structure and schema without executing any cases. Exits 0 on success, 2 on violations. With no paths provided, scans the current directory (respecting .agentcarousel-ignore).

# Validate a single fixture
agc validate fixtures/my-agent/cases.yaml

# Strict mode — warnings become errors (recommended in CI)
agc validate fixtures/my-agent/cases.yaml --strict

# Emit SARIF 2.1.0 for GitHub code scanning
agc validate --format sarif > results.sarif

# Validate all fixtures in the project
agc validate

Flag	Short	Description
--strict	-x	Treat warnings as errors
--format <FORMAT>	-f	human (default), json, or sarif
--schema <FILE>	-s	Override the JSON Schema file

test

Runs cases with mock generation. No live LLM calls and no API keys required. Results are written to the local history database. Mock mode is the default.

# Run all fixtures (mock mode, default)
agc test fixtures/my-agent/cases.yaml

# Filter to smoke cases only
agc test fixtures/my-agent/cases.yaml --filter-tags smoke

# Use a custom mock directory
agc test fixtures/my-agent/cases.yaml --mock-dir mocks/

# Stop on the first case failure
agc test fixtures/my-agent/cases.yaml --fail-fast

# Machine-readable output for CI
agc test fixtures/my-agent/cases.yaml --format json

Flag	Short	Description
--offline <bool>	-o	true \| false — explicitly toggle mock responses (default: true)
--filter-tags <TAG>	-g	Comma-separated tags; only matching cases run
--filter <GLOB>	-f	Glob matched against full case ids (e.g. my-agent/*)
--mock-dir <PATH>	-m	Directory containing mock response files
--fail-fast	-F	Stop on the first case failure
--format <FORMAT>	-p	human (default), json, or junit
--concurrency <N>	-c	Number of cases to run in parallel
--timeout <SEC>	-t	Per-case timeout in seconds
--timeout-run <SEC>		Cancel the entire run after N seconds

eval

Runs cases with optional judge scoring. Defaults to mock mode (no API keys needed). Switch to --execution-mode live to call a real LLM. Multiple runs (--runs N) let you sample variance and detect flakiness.

# Mock eval (default — no API keys needed)
agc eval fixtures/my-agent/cases.yaml

# Live eval with LLM-as-judge
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
agc eval fixtures/my-agent/cases.yaml \
  --execution-mode live \
  --model gpt-4o \
  --judge \
  --judge-model claude-haiku-4-5-20251001 \
  --evaluator all \
  --runs 3

# Filter to judge-tagged cases only
agc eval fixtures/my-agent/cases.yaml \
  --evaluator judge --judge --filter-tags judge

# Custom HTTP endpoint (non-standard model providers)
agc eval fixtures/my-agent/cases.yaml \
  --execution-mode live --model custom \
  --generator-endpoint http://localhost:8080/v1/chat

Flag	Short	Description
--execution-mode <MODE>	-x	mock (default) or live
--model <ID>	-m	Generator model for live mode (e.g. gpt-4o, gemini-2.5-flash, custom)
--generator-endpoint <URL>		Base URL for a custom agent endpoint (required when model is 'custom')
--judge	-j	Enable LLM-as-judge scoring for judge-scored cases
--judge-model <ID>	-J	Model used as judge
--evaluator <ID>	-e	rules (default) \| golden \| process \| judge \| all
--runs <N>	-n	Number of independent runs (default: 1)
--filter-tags <TAG>		Comma-separated tags; only matching cases run
--filter <GLOB>	-F	Glob matched against full case ids
--seed <N>	-s	RNG seed (default: 0)
--format <FORMAT>	-f	human (default), json, or junit
--timeout <SEC>	-t	Per-case timeout in seconds
--timeout-run <SEC>		Cancel the entire run after N seconds
--concurrency <N>	-c	Number of cases to run in parallel

Evaluator selection: Use --evaluator rules for deterministic assertions. Use --evaluator all with --judge for mixed fixtures where some cases use judge scoring and others use rules. Use --evaluator judge only when every case should be judge-scored.

Live runs are non-deterministic. Use --runs 3 or higher and track the effectiveness score across runs rather than relying on a single pass.

lint

Checks fixture quality beyond schema validation: smoke-tag coverage, judge-case descriptions, rubric weight sums, and bundle compliance fields. Exits 0 when no issues are found; 2 on errors.

# Check all fixtures
agc lint fixtures/

# Check a single file
agc lint fixtures/my-agent/cases.yaml

# Exit non-zero on warnings too (useful in CI)
agc lint fixtures/ --error-on-warn

Flag	Short	Description
--error-on-warn	-x	Exit with non-zero on warnings (default: only fail on errors)
--format <FORMAT>	-f	human (default) or json

init

Scaffolds a new skill or agent fixture directory under fixtures/<name>/. Always scaffold from agc init rather than writing from scratch.

# Scaffold a skill fixture
agc init --skill my-agent

# Scaffold an agent fixture
agc init --agent my-agent

Creates cases.yaml, prompt.md, bundle.manifest.json, and an empty golden/ directory.

CLI Reference — Results

report

Inspect persisted runs from the local history database: list runs, show case-level details, or diff two runs side-by-side.

# List recent runs
agc report list

# List with a limit
agc report list --limit 10

# Show details for a specific run
agc report show <RUN-ID>

# Render a local evidence directory or run.json file
agc report show ./reports/evidence-packs/my-agent/

# Diff two runs
agc report diff <RUN-ID-A> <RUN-ID-B>

stats

Shows historical pass-rate trends, per-case flakiness, and latency percentiles from the local run history database. Useful for tracking quality over time and spotting unreliable cases.

# Overview across all skills (last 50 runs)
agc stats

# Filter to a specific skill
agc stats --skill my-agent

# Analyse more history, machine-readable output
agc stats --limit 100 --format json

Flag	Description
--skill <NAME>	Filter to a specific skill or agent name
--limit <N>	Maximum number of runs to analyse, newest first (default: 50)
--format <FORMAT>	human (default) or json

export

Packages a completed run into an evidence bundle: a .tar.gz archive with an optional minisign attestation signed by a human domain expert.

# Export a specific run by id
agc export <RUN-ID>

# Export to a specific path
agc export <RUN-ID> --out ./my-bundle.tar.gz

# Export the 5 most recent runs into a directory
agc export --last 5 --out-dir ./evidence/

# List recent runs to find the id
agc report list --limit 5

Flag	Short	Description
--last <N>	-l	Export the N most recent runs (newest first)
--out <PATH>	-o	Output path for a single run export
--out-dir <DIR>	-d	Output directory when using --last

compliance

Scores run history against bundled OSCAL control catalogs and renders per-control attestation reports. Cases are mapped to controls via tags (for example fda-samd:fda-samd-medical-device-reporting). A control is reported satisfied only with three or more cases and effectiveness ≥ 0.80; anything less is partial evidence or a gap. Available frameworks: nist-ai-rmf, eu-ai-act, iso-42001, hipaa, fda-samd, nist-800-171, nist-800-172, nist-800-207.

# Per-control attestation report (Markdown)
agc compliance report --framework fda-samd --skill my-agent

# OSCAL Assessment Results JSON (machine-readable)
agc compliance report --framework hipaa --oscal > hipaa.oscal.json

# All frameworks, one file each
agc compliance report --framework all --out ./reports/

# Controls with no coverage, plus remediation advisories
agc compliance gaps --framework eu-ai-act

# Generate fixture cases pre-tagged with control IDs
agc compliance generate --skill my-agent \
  --tag nist-ai-rmf:measure-1.1 --count 3

Sub-command	Description
report	Per-control pass/fail table with effectiveness scores; --oscal emits OSCAL Assessment Results JSON
gaps	Lists NotSatisfied / PartialEvidence controls with suggested case improvements
generate	Generates fixture cases pre-tagged with control IDs for a framework

The OSCAL Assessment Results artifact is also written into every agc export evidence tarball, and agc metrics --framework <id> scopes the compliance metrics table to a single framework.

CLI Reference — Bundles & Registry

bundle

Manages fixture bundles: distributable archives combining a manifest, fixture files, and mocks.

# Pack a bundle (updates sha256s in manifest, writes .tar.gz)
agc bundle pack fixtures/my-agent

# Pack to an explicit output path
agc bundle pack fixtures/my-agent --out my-bundle.tar.gz

# Verify bundle integrity (dir, manifest file, or .tar.gz)
agc bundle verify my-bundle.tar.gz

# Pull bundle manifest and artifacts from the registry
agc bundle pull my-agent-1.0.0 \
  --url https://api.agentcarousel.com

# Pull and immediately verify
agc bundle pull my-agent-1.0.0 \
  --url https://api.agentcarousel.com --verify

Sub-command	Description
pack [DIR]	Update manifest sha256s and write a .tar.gz (default dir: current directory)
verify [PATH]	Validate structure, manifest, and file checksums (dir, manifest, or .tar.gz)
pull <BUNDLE-ID>	Download manifest and artifacts from the registry into pulled-bundles/<id>/; --verify runs verify after download

publish

Registers a bundle and ingests evidence runs in one step. Reads the registry write token from config or the AGENTCAROUSEL_API_TOKEN environment variable. If a prompt.md file exists in the bundle directory, it is sent to the registry and stored as the agent's system prompt.

# Publish bundle + most recent matching run
agc publish fixtures/my-agent \
  --url https://api.agentcarousel.com

# Dry run — resolve values without making API writes
agc publish fixtures/my-agent \
  --url https://api.agentcarousel.com --dry-run

# Submit all matching local runs (newest first)
agc publish fixtures/my-agent \
  --url https://api.agentcarousel.com --all-runs

# Limit number of runs when using --all-runs
agc publish fixtures/my-agent \
  --url https://api.agentcarousel.com --all-runs --limit 5

The registry write token must be available as AGENTCAROUSEL_API_TOKEN. Never commit tokens to fixture files or env_overrides.

trust-check

Queries the registry for the current trust state of a bundle. By default requires trusted state; use --min-trust to lower the threshold. Optionally verifies a local attestation against a minisign public key.

# Query trust state (exits non-zero if below --min-trust, default: trusted)
agc trust-check my-agent-1.0.0 \
  --url https://api.agentcarousel.com

# Accept stable or higher
agc trust-check my-agent-1.0.0 \
  --url https://api.agentcarousel.com \
  --min-trust stable

# Verify with a local attestation file + minisign public key
agc trust-check my-agent-1.0.0 \
  --url https://api.agentcarousel.com \
  --attestation ./attestation.json \
  --minisign-pubkey ./minisign.pub

Flag	Description
--url <URL>	Registry API URL (falls back to config or AGENTCAROUSEL_REGISTRY_URL env)
--min-trust <LEVEL>	experimental \| carousel-candidate \| stable \| trusted (default: trusted)
--attestation <FILE>	Local attestation JSON to verify offline with minisign
--minisign-pubkey <PATH>	minisign public key path (local file or URL)
--minisign-bin <PATH>	minisign binary name/path (default: minisign)

CLI Reference — Tooling

doctor

Checks your environment, configuration, and fixture setup for common issues in one pass: API keys, config file, history DB, fixtures directory, and JSON schema.

# Full environment check
agc doctor

# Machine-readable output
agc doctor --json

update

Checks GitHub for a newer release of the CLI and installs it in-place using an atomic rename. Uses the same install path as the running binary.

# Check and install if available
agc update

# Check availability without installing
agc update --check

completions

Prints a shell completion script to stdout for bash, zsh, or fish.

# Zsh
agc completions zsh > ~/.zsh/completions/_agc

# Bash
agc completions bash > /etc/bash_completion.d/agc

# Fish
agc completions fish > ~/.config/fish/completions/agc.fish

Configuration

The CLI resolves configuration in this order, with later sources taking precedence:

--config <path>: explicit path flag
./agentcarousel.toml: project-level config
~/.config/agentcarousel/config.toml: user-level config

Run history is stored in a local SQLite database:

Platform	Default path
macOS	~/Library/Application Support/agentcarousel/history.db
Linux	~/.local/share/agentcarousel/history.db

Override with AGENTCAROUSEL_HISTORY_DB=/path/to/history.db.

Fixture Development

Fixture Format

Fixtures are YAML files named cases.yaml inside their skill directory. The authoritative schema is schemas/skill-definition.schema.json in the repository. Always scaffold from agc init rather than writing from scratch.

Top-level fields

Field	Required	Description
schema_version	Yes	Integer. Current version is 1.
skill_or_agent	Yes	Kebab-case identifier for the subject under test.
bundle_id	No	Bundle identifier for certification tracking.
bundle_version	No	Semver. Major bumps reset the carousel iteration counter.
certification_track	No	none \| candidate \| stable \| trusted
risk_tier	No	low \| medium \| high
data_handling	No	synthetic-only \| no-pii \| pii-reviewed
defaults	No	Default timeout_secs, tags, and evaluator applied to all cases.
cases	Yes	Array of one or more case definitions.

Case fields

Field	Required	Description
id	Yes	Must start with <skill_or_agent>/. Enforced by validate.
description	No*	Human-readable intent. Required by the review checklist.
tags	No	Array of tag strings for filtering.
input	Yes	messages array and optional context / env_overrides.
expected	Yes	tool_sequence, output assertions, and rubric items.
evaluator_config	No	Per-case evaluator override; accepts effectiveness_threshold.
timeout_secs	No	Per-case timeout. Recommended: 1.5× expected latency.
seed	No	RNG seed for reproducible eval runs.

Minimal example

schema_version: 1
skill_or_agent: hello-skill
cases:
  - id: hello-skill/happy-path
    description: Responds to a greeting
    tags: [smoke, happy-path]
    input:
      messages:
        - role: user
          content: "Say hello"
    expected:
      tool_sequence: []
      output:
        - kind: contains
          value: "hello"

Output assertion kinds

kind	Description
contains	Output must contain this string (case-sensitive)
not_contains	Output must not contain this string
equals	Exact match; use for structured/tool outputs, not free text
regex	PCRE regular expression match
json_path	JSONPath expression applied to the output
golden_diff	Diff against a known-good golden output file in the golden/ directory

Case Tags

Tags let you filter which cases run. Use --filter-tags smoke on PRs for fast feedback and run all tags on main or nightly.

Tag	When to use
smoke	Fast PR gate. Every fixture should have at least one smoke case.
happy-path	Core success scenario; the most important thing the skill does when everything works.
error-handling	Graceful failure behavior for invalid or missing inputs.
edge-case	Boundary or unusual-but-valid input that is still in scope.
certification	Included in certification-focused carousel runs.
deferred	Tracked placeholder for a blocked or not-yet-implemented integration.

Evaluators

Select the evaluator that requires the least complexity. Escalate only when a simpler one cannot express the assertion.

rules

Exact match, regex, JSON path, and tool sequence assertions. Deterministic, free, fast. Use this first.

golden

Diff against a known-good output file in golden/. Use when output format is stable and you have a reference. Use --update-golden to write golden files in place.

process

External evaluator script (Python, JS) via stdin/stdout JSON contract. Use for custom logic that rules cannot express.

judge

LLM-as-judge scoring for rubric items requiring language understanding. Adds cost and variance; use only when rules, golden, and process are insufficient.

Development Process

Standard fixtures should go from blank intake to passing CI in under two hours. Follow these phases in order.

Intake

Answer four questions before writing any YAML: What is the skill_or_agent id? What user goal does this test? Are tool calls required? Is input data synthetic?

Do not proceed until you confirm: no PII in inputs, scope is bounded, mocks can be written without live network calls.

Scenario Design

Design one primary (happy-path) use case per fixture file. Add edge cases as separate cases entries, not separate files, unless they cover a substantially different workflow.

Pair the happy-path with at least one failure-mode case. Failure cases almost always reveal mock gaps earlier.

Author

Scaffold from agc init. Never start from a blank file. Write mocks before assertions: draft the mock response first, then write output assertions against what the mock actually returns.

Self-Check

Run validate --strict, then test, then eval. All three must succeed before requesting review. Run lint to catch quality issues beyond schema.

Peer Review

Open a PR or share the file with a reviewer. Use the review checklist (correctness, completeness, safety & data).

Certification-track fixtures require a domain reviewer in addition to a standard peer reviewer.

Bundle Version Bump

After review is approved: patch bump for description changes, minor bump for new cases, major bump for removed or renamed cases (resets the carousel iteration counter).

Quick self-check commands

# 1. Schema + rules validation (strict mode)
agc validate fixtures/my-agent/cases.yaml --strict

# 2. Fixture quality check
agc lint fixtures/my-agent/cases.yaml

# 3. Mock test run
agc test fixtures/my-agent/cases.yaml

# 4. Eval with rubric (if rubric items exist)
agc eval fixtures/my-agent/cases.yaml

Contributing

Contributions are welcome. The process is designed to keep fixtures aligned with released CLI behavior and to prevent scope creep.

Open a GitHub Issue first: before writing any YAML, open an issue titled Fixture: <skill-or-agent-id> and fill in the intake checklist from CONTRIBUTING.md.
Follow the development process: complete all six phases, including passing validate --strict, lint, and test.
Security: never include real API keys, credentials, or PII in fixture inputs, mock responses, or expected outputs. See SECURITY.md .

Registry API

Authentication

Public read endpoints require no credentials. Write endpoints require a bearer token.

Endpoint	Auth required
GET /health	No
GET /v1/bundles	No
GET /v1/bundles/{bundleId}/trust-state	No
GET /v1/bundles/{bundleId}/manifest	No
GET /v1/bundles/{bundleId}/file?path=...	No
POST /v1/bundles	Yes — Authorization: Bearer <token>
POST /v1/runs	Yes — Authorization: Bearer <token>

# Set the token before publishing
export AGENTCAROUSEL_API_TOKEN=<your-registry-write-token>

agc publish fixtures/my-agent \
  --url https://api.agentcarousel.com

Endpoints

Base URLs

Environment	Base URL
Production	https://api.agentcarousel.com
Local development	http://127.0.0.1:3001

GET /health

Liveness check. Returns { "ok": true }.

GET /v1/bundles

Lists all registered bundles. Returns bundle_id, bundle_version, trust_state, description, domain, created_at, and last_run_at.

GET /v1/bundles/{bundleId}/trust-state

Returns the public trust state for a bundle. No authentication required.

const bundleId = "my-agent-1.0.0";
const res = await fetch(
  `${baseUrl}/v1/bundles/${encodeURIComponent(bundleId)}/trust-state`
);
const trust = await res.json();

GET /v1/bundles/{bundleId}/file

Fetches a bundle artifact by path. The special path prompt.md serves the agent's system prompt as text/plain (stored as a DB column, not object storage). All other paths serve fixture/mock bytes from object storage.

# Fetch the agent's system prompt
GET /v1/bundles/my-agent-1.0.0/file?path=prompt.md

# Fetch a fixture file
GET /v1/bundles/my-agent-1.0.0/file?path=cases.yaml

POST /v1/bundles

Register or resolve a bundle. Idempotent. Accepts application/json or multipart/form-data. Pass a prompt field in multipart to store the agent's system prompt.

const res = await fetch(`${baseUrl}/v1/bundles`, {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": `Bearer ${token}`,
  },
  body: JSON.stringify({
    bundle_id: "org/my-agent",
    bundle_version: "1.0.0",
    risk_tier: "medium",
  }),
});

POST /v1/runs

Ingest a run evidence tarball produced by agc export. Accepts multipart/form-data with an evidence field containing the .tar.gz.

const form = new FormData();
form.append("evidence", evidenceFile); // .tar.gz blob
form.append("registry_bundle_id", "my-agent-1.0.0");

const res = await fetch(`${baseUrl}/v1/runs`, {
  method: "POST",
  headers: { "Authorization": `Bearer ${token}` },
  body: form,
});

Trust State

Every registered bundle has a trust state. New bundles start at Experimental and advance as qualifying runs accumulate.

Experimental

Initial state for all newly registered bundles.

CarouselCandidate

Bundle has passed enough qualifying runs to be considered for stable status.

Stable

Bundle meets the passing threshold across the required number of qualifying carousel runs.

Trusted

Highest state. Requires qualifying run history, domain expert review, and a signed attestation.

Trust state is visible in the public Agent Registry. Use agc trust-check to query it programmatically.

For Rust crate documentation and deeper type references, see docs.rs/agentcarousel .

Found an issue? Open a GitHub issue .