Swarm Integration Guide

What this is

This is a deep integration manual for implementing MachineID into Swarm-style agent orchestration. It is intentionally extensive so it can serve as a single reference for: (1) designing enforceable validation boundaries in multi-agent handoff flows, (2) modeling devices correctly across execution surfaces, and (3) applying remote stop control across tool-heavy and side-effect-heavy runs.

What you can implement from this page

Hard gating at run entry, tool execution, and handoff boundaries
Device identity schemes that scale across agents and replicas
Remote stop: revoke/restore, bulk controls, and org-wide disable
Operational semantics: fail-closed, short timeouts, predictable stop points

What this guide assumes

You want enforcement, not “best effort”
You will define explicit stop points in the runtime
You want authority to live outside the process

MachineID externalizes authority: if validation fails, work does not begin.

Note on Swarm

Swarm is commonly used as a learning/prototyping reference for agent handoffs and tool calling. The enforcement patterns here generalize to production agent runtimes that share the same primitives: run entry, tool execution, and handoffs.

Core invariant

Everything reduces to one invariant:

Register. Validate. Work.

If validation fails, work does not begin.

In Swarm, “work begins” at recognizable boundaries: starting a run, executing a tool function, performing a side effect, and transferring control during a handoff.

Fastest path (verify control end-to-end)

The fastest way to verify the control plane is to add a single hard gate to your runner, then revoke that device in the dashboard and confirm execution stops on the next validate.

Prerequisites

Generate a free org key (supports up to 3 devices): machineid.io
Confirm you can revoke/restore from Dashboard
Use deterministic device IDs (readable and stable)

After you prove run-entry control, the next step is to add boundaries that occur during real work: tool calls, handoffs, and side effects.

What Swarm is (and what matters for enforcement)

Swarm-style systems center on a small set of primitives:

Agents that produce messages and call tools
Tools (functions) where external requests and side effects occur
Handoffs where control transfers between agents
Loops / re-entry where execution continues across steps

Enforcement becomes operational only when you validate at the boundaries that actually occur during runs: tools, handoffs, and side effects.

Execution boundaries (what counts as “work begins”)

High-value boundaries in Swarm-style runtimes

Run entry: immediately before the orchestration run starts
Tool execution: immediately before each tool function is executed
Handoff boundary: immediately before transferring control to a different agent
Side-effect boundary: immediately before irreversible actions (writes, sends, purchases)
Loop re-entry: at the top of high-cost cycles (fan-out, recursion, repeated planning)

MachineID does not introspect your orchestration. If a run can loop indefinitely, you must place enforcement boundaries inside the loop.

Handoffs (control transfer is a boundary)

Handoffs are where responsibility moves from one agent to another. Treat that transfer as an enforcement point.

Recommended handoff rule

Validate immediately before performing the handoff
If denied, do not transfer; stop the run (fail closed)
If you model one device per agent role, validate using the target agent’s device ID before the transfer

This prevents delayed or unintended execution under a new agent role after a device was revoked or the org was disabled.

Tool calls (highest leverage stop point)

Tools are where costs and external effects begin. If you add only one in-run boundary beyond run entry, make it a tool-call gate.

Tool-call gate strategy

Validate before every tool call
Add an additional validate immediately before high-risk tools (payments, email, writes)
Fail closed on timeout/network errors

Loops and re-entry (stop latency control)

Many Swarm-style runs iterate: plan → call tools → update state → handoff → continue. If you validate only once at startup, a run can continue deep into a loop even after revoke/disable.

Recommended loop rule

Validate at the start of each step/iteration
Validate before every tool call
Validate before every side effect

Streaming (don’t confuse UX with control)

Streaming improves responsiveness. It is not an enforcement boundary. A streamed run can still incur tool costs and side effects. Put validation boundaries where costs and effects occur — tool execution, handoffs, and side effects — not where output is displayed.

Path A: Python SDK (recommended)

The Python SDK is the simplest integration surface: machineid-io/python-sdk.

Install

pip install machineid-io

Minimal hard gate (run entry)

import os
from machineid import MachineID

m = MachineID.from_env()

device_id = os.getenv("MACHINEID_DEVICE_ID", "swarm:dev:runner:01")

m.register(device_id)

decision = m.validate(device_id)
if not decision["allowed"]:
    print("Execution denied:", decision.get("code"), decision.get("request_id"))
    raise SystemExit(1)

SDK surface area (what exists)

register(device_id), validate(device_id)
list_devices(), usage()
revoke(device_id), unrevoke(device_id) (alias: restore)
remove(device_id)

SDK returns parsed JSON and does not raise for API-level non-2xx responses. Treat allowed:false as the stop condition.

Path B: Direct HTTP (canonical endpoints)

If you want minimal dependencies, call the canonical endpoints directly. Send your org key via x-org-key and a deterministic deviceId.

Register

POST https://machineid.io/api/v1/devices/register
Headers:
  x-org-key: org_...
Body:
  {"deviceId":"swarm:prod:runner:01"}

Validate (canonical)

POST https://machineid.io/api/v1/devices/validate
Headers:
  x-org-key: org_...
Body:
  {"deviceId":"swarm:prod:runner:01"}

Fail closed: if validate cannot be confirmed (timeout/network), treat it as not allowed and stop.

Wrapper patterns (minimal refactor, maximal control)

Wrappers let you add enforcement without rewriting your orchestration logic. The goal is consistent: validate immediately before the boundary where work begins or side effects commit.

Boundary gate helper

def must_be_allowed(m, device_id, boundary):
    d = m.validate(device_id)
    if not d["allowed"]:
        print(f"Denied at {boundary}:", d.get("code"), d.get("request_id"))
        raise SystemExit(1)
    return d

Tool-call gate wrapper

def gated_tool_call(tool_fn, *args, **kwargs):
    must_be_allowed(m, device_id, "before_tool_call")
    return tool_fn(*args, **kwargs)

Handoff gate (control transfer boundary)

def gated_handoff(next_agent_name):
    must_be_allowed(m, device_id, f"before_handoff:{next_agent_name}")
    # perform handoff here

Side-effect gate (highest risk)

def commit_side_effect():
    must_be_allowed(m, device_id, "before_side_effect")
    perform_irreversible_action()

The best “stop point” is the one that occurs during real execution. Tools and side effects are usually the most operationally meaningful.

Device ID strategy (Swarm-friendly)

Device IDs should be stable enough to audit, but specific enough to revoke. A practical pattern:

swarm:{env}:{role}:{instance}

Examples that map cleanly to Swarm execution surfaces:

swarm:dev:runner:01 — your orchestrator process
swarm:prod:agent-planner:01 — planner role (if you model per-agent roles)
swarm:prod:tool-runner:03 — tool-heavy surface
swarm:prod:event-consumer:06 — event-triggered runs
swarm:prod:cron-nightly:01 — scheduled run identity

Important

Do not embed secrets in device IDs (IDs are identifiers, not credentials)
Avoid “one device for the whole fleet” unless you intentionally want coarse control
If you want surgical stop control, model devices at the boundary you want to stop (tool-runner and side-effect surfaces)

Timeouts and failures (fail closed)

Treat validation like a safety-critical call. Use short timeouts and fail closed. If permission cannot be confirmed, work should not proceed.

Recommended policy

Client timeout: short (for example, 1–3 seconds)
Timeout/network failure: treat as allowed:false
Stop the run/worker loop and surface it via logs/alerts

“Proceed anyway” creates a second authority path inside the runtime. That is explicitly outside the guarantees.

Example: up to 3 devices

The free tier is enough to prove control end-to-end. The objective is to confirm that revoke/disable stops execution at predictable checkpoints.

Suggested 3-device model

swarm:dev:runner:01 — orchestrator run entry
swarm:dev:tool-runner:01 — tool-heavy surface (best stop point)
swarm:dev:cron-nightly:01 — scheduled execution identity

Prove remote control

Run a workflow that executes at least one tool call
Revoke swarm:dev:tool-runner:01 in the Dashboard
Observe: stop occurs at the next tool-call validate boundary

Control quality is primarily boundary placement. Put validate where you want the stop point.

Example: up to 25 devices

This tier supports multiple concurrent runners, consumers, and tool-heavy surfaces while keeping identity manageable.

Example topology (25-ish)

swarm:prod:runner:01 … :08 (8 runners)
swarm:prod:tool-runner:01 … :08 (8 tool-heavy surfaces)
swarm:prod:event-consumer:01 … :06 (6 triggers)
swarm:prod:cron-nightly:01 … :03 (3 schedules)

Boundary plan

Run entry: validate
Before every tool call: validate
Before every handoff: validate
Before irreversible side effects: validate

This is where revoke/restore becomes operationally meaningful across multiple simultaneous execution surfaces.

Example: up to 250 devices

At this scale, execution surfaces multiply under load: autoscaling, event storms, retries, and fan-out workflows. The dominant requirement is consistent enforcement across replicas and over time.

Patterns that drive device count

Autoscaling runners / consumers
Per-workflow or per-tenant identities
Tool fan-out (one decision triggers many external calls)
Long-running loops that execute tools repeatedly

Tool-call and side-effect gates are typically the difference between control and best-effort at this scale.

Example: up to 1000 devices

Near the upper end of standard caps, the dominant failure mode is systemic multiplication: retries, recursive workflows, and many simultaneous execution surfaces.

Scale guidance

Prefer per-replica identities (avoid “one device for the whole fleet”)
Validate at boundaries that occur frequently during real work (tools + side effects + handoffs)
Avoid fallback authority paths (no degraded enforcement)
Keep device IDs deterministic and auditable

Custom device limits

If you need device limits beyond standard tiers, MachineID supports custom device limits. Your implementation pattern does not change: identity per execution surface and explicit validation boundaries.

Dashboard controls (device + org control)

MachineID provides a console at machineid.io/dashboard. The console exists outside your runtime so control does not depend on the process cooperating.

Common operations

Revoke / restore devices (including bulk)
Remove devices
Register devices
Rotate keys
Org-wide disable (hard stop across devices)

Dashboard actions become effective at the next validate. Validation placement determines stop behavior.

Org-wide disable (emergency stop)

In addition to revoking individual devices, MachineID supports an org-wide disable control. This changes validate outcomes across the org (allowed becomes false).

Operational semantics

Org-wide disable does not change device revoked/restored state
It takes effect at the next validation boundary you defined
To make it operationally useful, validate during runs (tools/handoffs/side effects)

Stop latency (what actually stops, and when)

Remote controls become effective at the next validate. Stop latency is determined by your boundary placement: if you only validate at run entry, revocation will not stop a long run already deep in tool loops.

Make stop control operationally useful

Validate before tool calls
Validate before side effects
Validate before handoffs
Validate at loop re-entry points

What not to do

These patterns defeat the purpose of external enforcement:

Proceed anyway on validation timeout or error
Continue for a fixed grace window while enforcement is unavailable
Fallback to internal flags as an alternate authority path
Validate only at run entry for long-running, tool-heavy runs

If your runtime can execute without external permission, permission is best-effort. MachineID is designed to avoid that.

Troubleshooting

Revocation “doesn’t stop immediately”

This almost always means validate boundaries are too far apart
Add validate before tool calls and before side effects
Add validate before handoffs (control transfer)
If one tool call runs for minutes, gate it before it begins

Validate returns denied

Check code and request_id for the decision
Confirm the device is not revoked in the dashboard
Confirm org-wide disable is not enabled
Confirm you have not exceeded your device cap (new unique IDs)

Timeouts / network failures

Use short client timeouts (1–3s) and fail closed
Treat inability to validate as not allowed and stop
Surface denial via logs and stop the run

LLM implementation prompts (step-by-step plans)

The prompts below are designed to produce practical integration plans with minimal guesswork. Replace bracketed placeholders and paste into your LLM of choice.

Prompt 1 — Integrate MachineID into my Swarm runner (SDK path)

I have a Python Swarm-style orchestrator (agents + tools + handoffs). I want hard enforcement using MachineID.

Context:
- My org key: [PASTE ORG KEY]
- Base URL: https://machineid.io
- Device ID pattern: swarm:{env}:{role}:{instance}
- Fail-closed policy, short timeout (1–3s)
- Validation boundaries required:
  1) Run entry (register + validate, fail closed)
  2) Before every tool execution (highest leverage)
  3) Before every handoff (control transfer)
  4) Before irreversible side effects (writes, sends, payments)
  5) At loop re-entry points (high-cost cycles)

Please provide:
1) Exact files/locations to change
2) Copy/paste code blocks using the MachineID Python SDK (pip install machineid-io)
3) A test plan:
   - revoke a device from dashboard
   - restore it
   - use org-wide disable
   - verify stops occur at the next validate boundary

Prompt 2 — Design my device model by tier

Help me model MachineID devices for my Swarm-style system.

Inputs:
- Agents: [list agent roles]
- Tools: [list tools, note which ones are high-risk side effects]
- Triggers: [API requests / cron / queue / event consumer]
- Expected scale: [3 / 25 / 250 / 1000]
- I need readable device IDs and surgical revoke control.

Output:
- A proposed device ID scheme
- A concrete list of device IDs for the target tier
- Where to validate (run entry / tools / handoffs / side effects / loop re-entry)
- A minimal runbook for revoke + org-wide disable

Prompt 3 — Add stop points inside a tool loop

I have a Swarm-style run that loops and can call tools repeatedly.

Goal:
- Add MachineID validate boundaries inside the loop so I can stop it remotely.
- Fail closed on validation timeout or error.

Please provide:
- Exactly where to place validation calls
- A wrapper pattern that is hard to forget
- A test plan using dashboard revoke and org-wide disable

LLM checklist (what a correct integration includes)

A correct implementation should have all of the following

Stable device identity per execution surface
Run-entry gating (register + validate, fail closed)
At least one in-run stop point (tool-call or side-effect boundary)
Handoff gating (control transfer boundary)
Short timeout and consistent failure policy
Denials logged as operational events (include request_id)
A runbook to revoke/restore and use org-wide disable

References

Python SDK: github.com/machineid-io/python-sdk
MachineID GitHub org: github.com/machineid-io
Dashboard: machineid.io/dashboard
Core enforcement guidance: Implementation Guide and Operational Guarantees

← Back to Docs