Swarm agent operating spec

This document is a spec in the spec-driven-delivery sense: it describes the behaviors any agent connected to Swarm is expected to exhibit, without prescribing how to implement them. Adopt as-is, override when your team has a reason to. The shape of the protocol matters more than the exact wording — as long as multiple agents follow the same conventions, the user's timeline stays coherent across agents, machines, branches and environments.

Reading this as a human? Paste this URL into your agent and tell it to follow the spec. Reading this as an agent? Treat each numbered practice as a behavioral contract; failure modes for the user are described alongside.

Identity

Swarm is an HTTP MCP server + PWA that turns the user's phone into a coordination channel between one or more autonomous coding agents and the user. Progress streams, decisions are requested, and user-supplied artifacts are read back through this single endpoint.

Endpoint: https://swarm.enge.io/mcp
Transport: HTTP streamable
Auth: bearer token (Sanctum personal access token, ability-scoped)

Connect

The user mints a token at https://swarm.enge.io/settings/api-tokens and paste it into the agent's MCP client config. Concrete shape varies by runtime; the canonical JSON is:

{
  "mcpServers": {
    "swarm": {
      "type": "http",
      "url": "https://swarm.enge.io/mcp",
      "headers": { "Authorization": "Bearer USER_TOKEN_HERE" }
    }
  }
}

For runtimes with a CLI (e.g. Claude Code), the equivalent one-liner:

claude mcp add \
  --scope user \
  --transport http \
  swarm https://swarm.enge.io/mcp \
  --header "Authorization: Bearer USER_TOKEN_HERE"

Token-scoped abilities: mcp:send-message, mcp:read, mcp:create-upload-url, mcp:manage-tags, mcp:ask-questions, mcp:answer-questions, mcp:uploads:read. Full agentic operation typically wants all seven; mint a narrower set for read-only or push-only agents.

The protocol

Ten practices. Each describes what the behavior is and why it exists; implementation is up to the agent. They're intentionally minimal — once any team adds an eleventh, document it next to these so other agents can adopt it.

1. Tag every push with a repo / branch / project trio

Behavior. Every send-message-tool, ask-question-tool and create-upload-url-tool call includes the three context tags whenever the work has the relevant context:

repo:<git-repo-name> — the repository the work lives in (the actual repo name, not a path).
branch:<branch-name> — the specific branch the agent is operating on.
project:<short-name> — the higher-level coordination unit. Multiple agents on multiple branches / machines / environments working on the same overall effort share one project: tag.

Why. The user runs many agents in parallel. Without consistent tagging, the timeline becomes a soup the user can't filter. With consistent tagging, the user filters to one project tag and sees every agent's progress side by side, regardless of which repo/branch/machine each is operating on.

Situational tags layer on top: release:<version>, incident:<id>, bug:<id>, ci:<status>, anything the user defines. The first tag prefixes the push lock-screen title, so order it from most to least informative.

Tags are find-or-created — agents reuse the exact same string verbatim across every related call. Max 10 per message, each ≤ 512 chars. Agents call list-tags-tool first when uncertain which vocabulary the user already has, and reuse before inventing.

send-message-tool({
  body: "Migration applied, all 12 tests green.",
  tags: [
    "repo:acme-api",
    "branch:feat/auth-refresh",
    "project:auth-refresh",
    "release:v1.4.0"
  ]
})

2. Send attachments via signed URLs by default

Behavior. The default attachment flow is a two-step: create-upload-url-tool → HTTP PUT the raw bytes to the returned presigned URL → pass the upload_key to send-message-tool / ask-question-tool. data_base64 is reserved as a last resort, only for very small files where the round-trip cost of presigning + PUT is genuinely heavier than the inline cost.

Why. Token cost. Inline base64 stays in the agent's conversation transcript and re-tokenizes on every subsequent turn that re-includes the message. A single 1 MB image becomes ~1.4 MB of base64; the cost stacks across screenshots, diagrams and short videos. Presigned uploads keep bytes server-side and entirely out of the agent's context window.

# default flow — keeps base64 out of the conversation
url = create-upload-url-tool({
  filename: "ui.png", mime: "image/png"
})
# PUT raw bytes to url.upload_url with url.headers
send-message-tool({
  body: "New hero — review on phone",
  tags: ["repo:marketing-site", "branch:feat/hero-v3", "project:hero-redesign"],
  attachments: [{
    filename: "ui.png",
    mime: "image/png",
    upload_key: url.upload_key
  }]
})

Supported MIME types: image/png, image/jpeg, image/webp, image/gif, video/mp4, video/webm, video/quicktime, text/plain, text/markdown, application/json. The hard inline ceiling is ~4 MB; in practice the token budget hits the wall well before the byte limit does.

3. Bundle related questions onto one card

Behavior. When the agent needs structured user input, it calls ask-question-tool. Questions that belong to the same decision (e.g. "approve the plan", "pick an approach", "any extra notes?") are passed as a single questions: [...] array (1–10 entries). Independent, unrelated questions get separate calls. The legacy single-question shape (top-level prompt + options) is kept for backwards compatibility.

Why. One card / one push / one decision moment respects the user's attention. The push title is prefixed with [?] for one question and [? N] for N>1 so the lock-screen tells the user up-front how much input is being requested.

# N questions on one card — preferred when they belong together
ask-question-tool({
  body: "Plan ready — three things to confirm before I start",
  tags: ["repo:acme-api", "branch:feat/auth-refresh", "project:auth-refresh"],
  questions: [
    {
      prompt: "Approve the migration approach?",
      options: [
        { kind: "button", key: "approve", label: "Approve",  variant: "success" },
        { kind: "button", key: "revise",  label: "Revise",   variant: "danger"  }
      ]
    },
    {
      prompt: "Roll out to staging or prod?",
      options: [
        { kind: "button", key: "staging", label: "Staging" },
        { kind: "button", key: "prod",    label: "Prod",    variant: "success" }
      ]
    },
    {
      prompt: "Anything to add?",
      options: [
        { kind: "text", key: "notes", label: "Notes (optional)", multiline: true }
      ]
    }
  ]
})  // → { message_id, question_ids: [...] }

Per question: any number of buttons + text inputs (≤ 20 total per question). Buttons in a question form an exclusive group — the user picks one. Text inputs are independent and can each be marked required: true. Variants standard | success | danger. Questions inherit the same tag trio as messages — and the screenshot the user almost certainly wants to see should be attached to the question card itself, following Practice 2.

4. Ask in both channels — Swarm AND the agent's native interface

Behavior. Every time the agent emits a question via ask-question-tool, it also surfaces the same question in its native channel — chat reply for chat-based agents (Claude Code, Cursor, Codex), terminal prompt for CLI-based agents, etc. The Swarm question and the native question must reference the same set of options so the user's choice maps cleanly to either side.

Why. The user picks where to answer. If they're at the phone, the PWA is fastest — answer via Swarm and the long-poll picks it up immediately. If they're at the keyboard, typing back in the existing chat / terminal is faster than reaching for the phone. Either path completes the question; the agent doesn't need to know in advance which channel the user will use.

The dual-channel ask works because Practice 6 (below) keeps both sides in sync — whichever channel the user chooses, the canonical answer ends up in Swarm with the original answer text preserved for future review.

# Swarm side — same as Practice 3
{ message_id, question_id } = ask-question-tool({
  body: "Plan ready — pick a deployment path",
  tags: ["repo:acme-api", "branch:main", "project:auth-refresh"],
  prompt: "Where should I roll this out first?",
  options: [
    { kind: "button", key: "staging", label: "Staging" },
    { kind: "button", key: "prod",    label: "Prod", variant: "success" },
    { kind: "text",   key: "notes",   label: "Notes (optional)", multiline: true }
  ]
})

# Native side — same question, same options, in the agent's chat reply
"Plan ready — pick a deployment path. Where should I roll this out first?
 [staging] [prod] (or send notes)
 (you can also tap the question on your phone)"

5. Long-poll for answers; chain when the wait exceeds 10 minutes

Behavior. While the question is open in both channels, the agent long-polls Swarm via wait-for-answer-tool (id + max_wait_seconds up to 600). For waits beyond 10 minutes, agents chain calls back-to-back, OR fall back to short-poll with get-question-tool on a tiered back-off (e.g. 5s for the first 10 min → 30s until +1 h → 60s until a hard deadline; surface "no answer in window" to the user when the deadline expires).

Why. Long-poll round-trips less data, returns instantly on answer, and keeps the agent in the foreground of the user's attention. The server streams MCP notifications/progress events every ~25s during the wait, which keeps the HTTP connection alive past upstream proxy idle timeouts so the full 10-minute window is reliably usable in a single call. If the user instead answers natively, the long-poll is interrupted by the agent's own answer-question-tool call (see Practice 6) — same end state, just a different trigger.

# long-poll for an immediate decision (up to 10 min per call, chain for longer)
result = wait-for-answer-tool({ id: question_id, max_wait_seconds: 600 })
if (result.status == "answered") {
  // result.answer = { selected_button: "approve", inputs: { notes: "lgtm" } }
}

6. Capture native answers and mirror them back into Swarm

Behavior. When the user answers natively (chat reply, terminal input, or any out-of-band channel), the agent:

Captures the answer payload locally — selected button + any free-text input verbatim, exactly as the user typed it.
Immediately calls answer-question-tool with that payload, marking the Swarm question resolved before doing any further work.
Continues the task with the captured answer in hand.

Why. Without this step, a question answered natively stays open in Swarm forever — the timeline drifts out of sync with reality, the questions tab fills up with phantom open questions, and any other agent on the same project may re-ask. Mirroring closes the loop: Swarm stamps the answer with answered_via: "agent", the answer text is preserved alongside the original question for future review, and the canonical state lives in one place regardless of which channel the user used.

# user said "approve, ship it" in chat — mirror before continuing
answer-question-tool({
  id: question_id,
  answer: {
    selected_button: "approve",
    inputs: { notes: "shipped from chat" }
  }
})

7. Pull user-uploaded files on demand, not speculatively

Behavior. Agents call list-uploads-tool only when the user references a previously-uploaded artifact ("see the latest mockup", "the diagram I sent earlier") or when picking up a long-running task and verifying the most recent assets. Vision-capable agents fetch the actual pixels via get-upload-urls-tool({ id }) — every file in the bundle returns with a fresh 30-min presigned URL.

Why. Speculative listing on every turn is cheap server-side but noisy and wasteful in context. Pulling on-demand keeps the agent's working memory tight. URLs expire in 30 minutes — agents that need to keep the bytes download immediately and cache locally; otherwise they re-call to refresh.

// User: "see the latest auth-redesign mockup I uploaded"
list = list-uploads-tool({
    tags: ["project:auth-redesign"],
    limit: 5
})
bundle = list.uploads[0]
files = get-upload-urls-tool({ id: bundle.id })
// files.files[*].url is anonymous-fetchable for 30 min — pipe into your HTTP client

8. Read back the timeline when context is missing

Behavior. When an agent needs to recall what's already happened (resuming a thread, answering "did you ship that yet?", building on a prior artifact, picking up another agent's work in the same project:), it queries the user's timeline via:

list-messages-tool — newest-first, AND-filtered by tags, paginated via before.
get-message-tool — single message by UUID with signed download URLs (30-min TTL) for its attachments.
list-tags-tool — the user's tag vocabulary, ordered by recent activity.
list-questions-tool / get-question-tool — discover and inspect questions still open (the agent's own or another agent's).

Why. Multi-agent coordination depends on each agent being able to read the user's recent history filtered to its slice. The repo / branch / project trio from Practice 1 is what makes this useful in practice.

list-tags-tool({ prefix: "project:" })
list-messages-tool({
  tags: ["repo:acme-api", "branch:feat/auth-refresh", "project:auth-refresh"],
  limit: 5
})

9. When work wraps up, ask what to do next via Swarm

Behavior. When a unit of work completes (PR merged, feature shipped, bug fixed, refactor landed) and there's no further user instruction queued, the agent pushes a follow-up question via ask-question-tool asking what to do next. A single open-ended question with informed-guess buttons (when the agent has them) plus a free-text fallback is the canonical shape. Tagged with the usual repo / branch / project trio so it slots into the right slice of the timeline.

Why. The user is often away from the keyboard while agents work. Sitting idle in chat wastes the round-trip; the phone is the faster channel. This practice also creates a clean handoff artifact in the timeline — the answer the user gives becomes the seed for the next unit of work, captured on the same card.

ask-question-tool({
  body: "v1.4.0 shipped, CI green, deploy verified. Picking up the next task.",
  tags: ["repo:acme-api", "branch:main", "project:auth-refresh"],
  prompt: "What should I work on next?",
  options: [
    { kind: "button", key: "open_pr_2",      label: "Open PR #2 from the backlog" },
    { kind: "button", key: "address_review", label: "Address review comments on the design doc" },
    { kind: "button", key: "wait",           label: "Wait for direction", variant: "standard" },
    { kind: "text",   key: "other",          label: "Or describe something else", multiline: true }
  ]
})

10. Push selectively — push for events, not chatter

Behavior. Pushes are reserved for events the user actually wants on their phone:

A long-running task finishes (deploy done, test suite green/red, migration complete, incident resolved).
A visual artifact the user benefits from seeing right now (UI screenshot, generated image, chart, diagram) — attached following Practice 2.
A human decision is required to unblock further automation — preferred via ask-question-tool over a plain push so the answer comes back through MCP (Practices 3–6).
A unit of work completes and the agent needs direction (Practice 9).
The user said "let me know when…" or "ping me when…".

Skip the push for:

Status chatter the user can see in their terminal.
Errors already surfaced in the next chat reply.
Every tool call — one push per coherent event.

Why. The phone is a high-attention surface. Spamming it makes the user disable notifications; under-using it makes the multi-agent setup feel disconnected. Practices 1–9 are tuned to land in the goldilocks zone.

User setup checklist

The agent's contract is above. The user's side is two steps:

Mint an API token with the abilities the agent needs at https://swarm.enge.io/settings/api-tokens and paste it into the agent's MCP config.
Open the dashboard on the target phone, Add to Home Screen (iOS only — Web Push requires an installed PWA on Apple), and tap Enable Notifications. Without an active subscription, send-message-tool succeeds silently server-side but never reaches the phone.

Troubleshooting

Message in timeline but no push? The device has no registered push subscription — revisit the user setup checklist.
401 unauthorized? Token is wrong, rotated, or missing one of the required abilities.
iOS never rings? iOS fires Web Push only for PWAs installed to the home screen.
Long-poll comes back with a network error? Should be rare — the tool streams progress notifications every ~25s to keep the connection alive. If it still happens (flaky network), drop max_wait_seconds and chain calls.