Swarm Integration Guide

A comprehensive reference for adding hard runtime enforcement, device limits, and remote stop control to Swarm-style multi-agent systems.

Companion to Implementation Guide, Operational Guarantees, and External Control Plane.
Focus: run entry, tools, handoffs, side effects Primary controls: revoke/restore + org-wide disable Goal: one page an LLM can implement from
What this is

This is a deep integration manual for implementing MachineID into Swarm-style agent orchestration. It is intentionally extensive so it can serve as a single reference for: (1) designing enforceable validation boundaries in multi-agent handoff flows, (2) modeling devices correctly across execution surfaces, and (3) applying remote stop control across tool-heavy and side-effect-heavy runs.

What you can implement from this page
  • Hard gating at run entry, tool execution, and handoff boundaries
  • Device identity schemes that scale across agents and replicas
  • Remote stop: revoke/restore, bulk controls, and org-wide disable
  • Operational semantics: fail-closed, short timeouts, predictable stop points
What this guide assumes
  • You want enforcement, not “best effort”
  • You will define explicit stop points in the runtime
  • You want authority to live outside the process
MachineID externalizes authority: if validation fails, work does not begin.
Note on Swarm

Swarm is commonly used as a learning/prototyping reference for agent handoffs and tool calling. The enforcement patterns here generalize to production agent runtimes that share the same primitives: run entry, tool execution, and handoffs.

Core invariant

Everything reduces to one invariant:

Register. Validate. Work.

If validation fails, work does not begin.

In Swarm, “work begins” at recognizable boundaries: starting a run, executing a tool function, performing a side effect, and transferring control during a handoff.

Fastest path (verify control end-to-end)

The fastest way to verify the control plane is to add a single hard gate to your runner, then revoke that device in the dashboard and confirm execution stops on the next validate.

Prerequisites
  • Generate a free org key (supports up to 3 devices): machineid.io
  • Confirm you can revoke/restore from Dashboard
  • Use deterministic device IDs (readable and stable)
After you prove run-entry control, the next step is to add boundaries that occur during real work: tool calls, handoffs, and side effects.
What Swarm is (and what matters for enforcement)

Swarm-style systems center on a small set of primitives:

  • Agents that produce messages and call tools
  • Tools (functions) where external requests and side effects occur
  • Handoffs where control transfers between agents
  • Loops / re-entry where execution continues across steps
Enforcement becomes operational only when you validate at the boundaries that actually occur during runs: tools, handoffs, and side effects.
Execution boundaries (what counts as “work begins”)
High-value boundaries in Swarm-style runtimes
  • Run entry: immediately before the orchestration run starts
  • Tool execution: immediately before each tool function is executed
  • Handoff boundary: immediately before transferring control to a different agent
  • Side-effect boundary: immediately before irreversible actions (writes, sends, purchases)
  • Loop re-entry: at the top of high-cost cycles (fan-out, recursion, repeated planning)
MachineID does not introspect your orchestration. If a run can loop indefinitely, you must place enforcement boundaries inside the loop.
Handoffs (control transfer is a boundary)

Handoffs are where responsibility moves from one agent to another. Treat that transfer as an enforcement point.

Recommended handoff rule
  • Validate immediately before performing the handoff
  • If denied, do not transfer; stop the run (fail closed)
  • If you model one device per agent role, validate using the target agent’s device ID before the transfer
This prevents delayed or unintended execution under a new agent role after a device was revoked or the org was disabled.
Tool calls (highest leverage stop point)

Tools are where costs and external effects begin. If you add only one in-run boundary beyond run entry, make it a tool-call gate.

Tool-call gate strategy
  • Validate before every tool call
  • Add an additional validate immediately before high-risk tools (payments, email, writes)
  • Fail closed on timeout/network errors
Loops and re-entry (stop latency control)

Many Swarm-style runs iterate: plan → call tools → update state → handoff → continue. If you validate only once at startup, a run can continue deep into a loop even after revoke/disable.

Recommended loop rule
  • Validate at the start of each step/iteration
  • Validate before every tool call
  • Validate before every side effect
Streaming (don’t confuse UX with control)

Streaming improves responsiveness. It is not an enforcement boundary. A streamed run can still incur tool costs and side effects. Put validation boundaries where costs and effects occur — tool execution, handoffs, and side effects — not where output is displayed.

Path A: Python SDK (recommended)

The Python SDK is the simplest integration surface: machineid-io/python-sdk.

Install
pip install machineid-io
Minimal hard gate (run entry)
import os
from machineid import MachineID

m = MachineID.from_env()

device_id = os.getenv("MACHINEID_DEVICE_ID", "swarm:dev:runner:01")

m.register(device_id)

decision = m.validate(device_id)
if not decision["allowed"]:
    print("Execution denied:", decision.get("code"), decision.get("request_id"))
    raise SystemExit(1)
SDK surface area (what exists)
  • register(device_id), validate(device_id)
  • list_devices(), usage()
  • revoke(device_id), unrevoke(device_id) (alias: restore)
  • remove(device_id)
SDK returns parsed JSON and does not raise for API-level non-2xx responses. Treat allowed:false as the stop condition.
Path B: Direct HTTP (canonical endpoints)

If you want minimal dependencies, call the canonical endpoints directly. Send your org key via x-org-key and a deterministic deviceId.

Register
POST https://machineid.io/api/v1/devices/register
Headers:
  x-org-key: org_...
Body:
  {"deviceId":"swarm:prod:runner:01"}
Validate (canonical)
POST https://machineid.io/api/v1/devices/validate
Headers:
  x-org-key: org_...
Body:
  {"deviceId":"swarm:prod:runner:01"}
Fail closed: if validate cannot be confirmed (timeout/network), treat it as not allowed and stop.
Wrapper patterns (minimal refactor, maximal control)

Wrappers let you add enforcement without rewriting your orchestration logic. The goal is consistent: validate immediately before the boundary where work begins or side effects commit.

Boundary gate helper
def must_be_allowed(m, device_id, boundary):
    d = m.validate(device_id)
    if not d["allowed"]:
        print(f"Denied at {boundary}:", d.get("code"), d.get("request_id"))
        raise SystemExit(1)
    return d
Tool-call gate wrapper
def gated_tool_call(tool_fn, *args, **kwargs):
    must_be_allowed(m, device_id, "before_tool_call")
    return tool_fn(*args, **kwargs)
Handoff gate (control transfer boundary)
def gated_handoff(next_agent_name):
    must_be_allowed(m, device_id, f"before_handoff:{next_agent_name}")
    # perform handoff here
Side-effect gate (highest risk)
def commit_side_effect():
    must_be_allowed(m, device_id, "before_side_effect")
    perform_irreversible_action()
The best “stop point” is the one that occurs during real execution. Tools and side effects are usually the most operationally meaningful.
Device ID strategy (Swarm-friendly)

Device IDs should be stable enough to audit, but specific enough to revoke. A practical pattern:

swarm:{env}:{role}:{instance}

Examples that map cleanly to Swarm execution surfaces:

  • swarm:dev:runner:01 — your orchestrator process
  • swarm:prod:agent-planner:01 — planner role (if you model per-agent roles)
  • swarm:prod:tool-runner:03 — tool-heavy surface
  • swarm:prod:event-consumer:06 — event-triggered runs
  • swarm:prod:cron-nightly:01 — scheduled run identity
Important
  • Do not embed secrets in device IDs (IDs are identifiers, not credentials)
  • Avoid “one device for the whole fleet” unless you intentionally want coarse control
  • If you want surgical stop control, model devices at the boundary you want to stop (tool-runner and side-effect surfaces)
Timeouts and failures (fail closed)

Treat validation like a safety-critical call. Use short timeouts and fail closed. If permission cannot be confirmed, work should not proceed.

Recommended policy
  • Client timeout: short (for example, 1–3 seconds)
  • Timeout/network failure: treat as allowed:false
  • Stop the run/worker loop and surface it via logs/alerts
“Proceed anyway” creates a second authority path inside the runtime. That is explicitly outside the guarantees.
Example: up to 3 devices

The free tier is enough to prove control end-to-end. The objective is to confirm that revoke/disable stops execution at predictable checkpoints.

Suggested 3-device model
  • swarm:dev:runner:01 — orchestrator run entry
  • swarm:dev:tool-runner:01 — tool-heavy surface (best stop point)
  • swarm:dev:cron-nightly:01 — scheduled execution identity
Prove remote control
  • Run a workflow that executes at least one tool call
  • Revoke swarm:dev:tool-runner:01 in the Dashboard
  • Observe: stop occurs at the next tool-call validate boundary
Control quality is primarily boundary placement. Put validate where you want the stop point.
Example: up to 25 devices

This tier supports multiple concurrent runners, consumers, and tool-heavy surfaces while keeping identity manageable.

Example topology (25-ish)
  • swarm:prod:runner:01:08 (8 runners)
  • swarm:prod:tool-runner:01:08 (8 tool-heavy surfaces)
  • swarm:prod:event-consumer:01:06 (6 triggers)
  • swarm:prod:cron-nightly:01:03 (3 schedules)
Boundary plan
  • Run entry: validate
  • Before every tool call: validate
  • Before every handoff: validate
  • Before irreversible side effects: validate
This is where revoke/restore becomes operationally meaningful across multiple simultaneous execution surfaces.
Example: up to 250 devices

At this scale, execution surfaces multiply under load: autoscaling, event storms, retries, and fan-out workflows. The dominant requirement is consistent enforcement across replicas and over time.

Patterns that drive device count
  • Autoscaling runners / consumers
  • Per-workflow or per-tenant identities
  • Tool fan-out (one decision triggers many external calls)
  • Long-running loops that execute tools repeatedly
Tool-call and side-effect gates are typically the difference between control and best-effort at this scale.
Example: up to 1000 devices

Near the upper end of standard caps, the dominant failure mode is systemic multiplication: retries, recursive workflows, and many simultaneous execution surfaces.

Scale guidance
  • Prefer per-replica identities (avoid “one device for the whole fleet”)
  • Validate at boundaries that occur frequently during real work (tools + side effects + handoffs)
  • Avoid fallback authority paths (no degraded enforcement)
  • Keep device IDs deterministic and auditable
Custom device limits

If you need device limits beyond standard tiers, MachineID supports custom device limits. Your implementation pattern does not change: identity per execution surface and explicit validation boundaries.

Dashboard controls (device + org control)

MachineID provides a console at machineid.io/dashboard. The console exists outside your runtime so control does not depend on the process cooperating.

Common operations
  • Revoke / restore devices (including bulk)
  • Remove devices
  • Register devices
  • Rotate keys
  • Org-wide disable (hard stop across devices)
Dashboard actions become effective at the next validate. Validation placement determines stop behavior.
Org-wide disable (emergency stop)

In addition to revoking individual devices, MachineID supports an org-wide disable control. This changes validate outcomes across the org (allowed becomes false).

Operational semantics
  • Org-wide disable does not change device revoked/restored state
  • It takes effect at the next validation boundary you defined
  • To make it operationally useful, validate during runs (tools/handoffs/side effects)
Stop latency (what actually stops, and when)

Remote controls become effective at the next validate. Stop latency is determined by your boundary placement: if you only validate at run entry, revocation will not stop a long run already deep in tool loops.

Make stop control operationally useful
  • Validate before tool calls
  • Validate before side effects
  • Validate before handoffs
  • Validate at loop re-entry points
What not to do

These patterns defeat the purpose of external enforcement:

  • Proceed anyway on validation timeout or error
  • Continue for a fixed grace window while enforcement is unavailable
  • Fallback to internal flags as an alternate authority path
  • Validate only at run entry for long-running, tool-heavy runs
If your runtime can execute without external permission, permission is best-effort. MachineID is designed to avoid that.
Troubleshooting
Revocation “doesn’t stop immediately”
  • This almost always means validate boundaries are too far apart
  • Add validate before tool calls and before side effects
  • Add validate before handoffs (control transfer)
  • If one tool call runs for minutes, gate it before it begins
Validate returns denied
  • Check code and request_id for the decision
  • Confirm the device is not revoked in the dashboard
  • Confirm org-wide disable is not enabled
  • Confirm you have not exceeded your device cap (new unique IDs)
Timeouts / network failures
  • Use short client timeouts (1–3s) and fail closed
  • Treat inability to validate as not allowed and stop
  • Surface denial via logs and stop the run
LLM implementation prompts (step-by-step plans)

The prompts below are designed to produce practical integration plans with minimal guesswork. Replace bracketed placeholders and paste into your LLM of choice.

Prompt 1 — Integrate MachineID into my Swarm runner (SDK path)
I have a Python Swarm-style orchestrator (agents + tools + handoffs). I want hard enforcement using MachineID.

Context:
- My org key: [PASTE ORG KEY]
- Base URL: https://machineid.io
- Device ID pattern: swarm:{env}:{role}:{instance}
- Fail-closed policy, short timeout (1–3s)
- Validation boundaries required:
  1) Run entry (register + validate, fail closed)
  2) Before every tool execution (highest leverage)
  3) Before every handoff (control transfer)
  4) Before irreversible side effects (writes, sends, payments)
  5) At loop re-entry points (high-cost cycles)

Please provide:
1) Exact files/locations to change
2) Copy/paste code blocks using the MachineID Python SDK (pip install machineid-io)
3) A test plan:
   - revoke a device from dashboard
   - restore it
   - use org-wide disable
   - verify stops occur at the next validate boundary
Prompt 2 — Design my device model by tier
Help me model MachineID devices for my Swarm-style system.

Inputs:
- Agents: [list agent roles]
- Tools: [list tools, note which ones are high-risk side effects]
- Triggers: [API requests / cron / queue / event consumer]
- Expected scale: [3 / 25 / 250 / 1000]
- I need readable device IDs and surgical revoke control.

Output:
- A proposed device ID scheme
- A concrete list of device IDs for the target tier
- Where to validate (run entry / tools / handoffs / side effects / loop re-entry)
- A minimal runbook for revoke + org-wide disable
Prompt 3 — Add stop points inside a tool loop
I have a Swarm-style run that loops and can call tools repeatedly.

Goal:
- Add MachineID validate boundaries inside the loop so I can stop it remotely.
- Fail closed on validation timeout or error.

Please provide:
- Exactly where to place validation calls
- A wrapper pattern that is hard to forget
- A test plan using dashboard revoke and org-wide disable
LLM checklist (what a correct integration includes)
A correct implementation should have all of the following
  • Stable device identity per execution surface
  • Run-entry gating (register + validate, fail closed)
  • At least one in-run stop point (tool-call or side-effect boundary)
  • Handoff gating (control transfer boundary)
  • Short timeout and consistent failure policy
  • Denials logged as operational events (include request_id)
  • A runbook to revoke/restore and use org-wide disable
References