reboot [LG] · status [LG] · cash [LG] · halt · resume · clear [LG]
⚙ AutoOps — Technical Architecture
System design reference for developers and technical evaluators
Section 1 — System Overview
AutoOps — What It Is
AutoOps is a modular automation layer that sits between Bitcoin ATM hardware and the humans operating it. It eliminates the manual monitoring loop — operators no longer need to log into 5 separate dashboards, check email for Brinks replies, or manually watch RAM counters. AutoOps watches everything and acts.
Core Design Principles
Event-driven — every machine state change triggers a logged event, a Slack alert, and optionally an automated action
Circuit-breaker pattern — prevents runaway reboot loops; after N reboots in a time window, auto-reboot halts and Guardian flags for human review
Audit-first — every action is logged with timestamp, trigger source, and outcome before it executes
Human override always available — Operator Console accepts plaintext commands at any time; HALT command stops all automation instantly
Demo vs. Production
Demo: Python Flask backend, mock state, simulated poll loop, in-memory state
Production: Node.js or Python service connecting to real APIs (Lamassu, WTI, Airtable, Brinks, Slack)
Demo poll: 1-second in-memory loop
Production poll: configurable interval per machine (default 60s), webhook-augmented where APIs support it
Section 2 — Backend Architecture
Stack
Runtime: Python 3.11 / Flask (demo) → Node.js 22 + Express (production)
State engine: In-memory dict + threading.Lock (demo) → PostgreSQL/Redis (production)
API layer: REST JSON, all endpoints under /api/
Background: threading.Thread daemon (demo) → Node.js worker threads / cron (production)
Logging: JSONL append-only audit log (demo) → Postgres events table (production)
Auth: None on demo → JWT + API key per integration (production)
Deployment: Systemd service, nginx reverse proxy, port 18860
State Machine Per Machine
Each machine runs a finite state machine (FSM) with these states:
FSM transitions are triggered by: RAM threshold breach (configurable, default 160MB), manual operator console command, WTI API reboot call, or circuit breaker reset.
Circuit Breaker Logic
REBOOT_LIMIT = 4 # max reboots before circuit breaks
REBOOT_WINDOW = 180 # minutes — rolling window
RAM_THRESHOLD = 160 # MB — triggers preventive reboot
When a machine hits REBOOT_LIMIT reboots within REBOOT_WINDOW minutes:
Auto-reboot halts
Machine status → error
Guardian flags the event at CRITICAL severity
Slack alert fires to #guardian-alerts
Audit log entry written
Only a clear <MACHINE_ID> command from a human resets it
Section 3 — Integration Architecture — 5 Modules
3a. Lamassu Admin
What: Bitcoin ATM OS dashboard — machine health, transaction logs, FSM state, cassette levels
Real API: Lamassu Admin REST API (self-hosted, port 8070 by default)
Auth: Session cookie / API token
Poll: GET /api/status every 60s → parse machine states → compare to last known state
Events: Any state change → write to audit_log, post to #autoops-alerts
Demo: Frontend styled to match Lamassu Admin UI (dark navy, sidebar nav, machine table)
3b. WTI Wireless (Remote Power)
What: Remote power control — outlet A/B per machine, force reboot via power cycle
Real API: WTI RPS/NPS REST API (port 8080 typically)
Auth: HTTP Basic or digest auth
Reboot: POST /api/outlet/{outlet_id}/action with {action: "REBOOT"}
Poll: GET /api/outlet — check outlet states every 60s
Demo: Frontend styled to match WTI Wireless Portal (blue/teal header, outlet grid)
POST /api/wti/reboot/:outlet triggers 45-second reboot animation
Production: Real WTI reboot takes 30–60s; AutoOps waits for Lamassu heartbeat to confirm recovery
3c. Airtable (Machine Database)
What: Master record for every machine — location, host contact, wifi, cash level,
Brinks branch, insurance, notes
Real API: Airtable REST API v0 — https://api.airtable.com/v0/{base_id}/{table_name}
Auth: Bearer token (AIRTABLE_API_KEY env var)
AutoOps writes: cassette_pct, last_reboot, reboot_count_24h, status, daily_volume, last_alert
Read: GET /api/machine_database — returns all records
Write: PATCH /api/machine_database/{record_id} — update AutoOps-managed fields only
Demo: Grid styled to match Airtable UI; orange cells = AutoOps-written fields
3d. Brinks (Cash Logistics)
What: Automated email dispatch to Brinks branch when cassette drops below threshold
Real API: Email (SMTP or SendGrid) — Brinks does not have a REST API; all communication is email
Trigger: cassette_pct < 20% → compose pickup request email → send to branch contact
Email: Machine ID, location, address, cassette %, urgency, host contact
Reply: Incoming Brinks reply parsed by email webhook (Resend inbound or similar)
Parser extracts: ref number, pickup date, ETA window, driver name
Updates Airtable record, posts to #brinks-dispatch Slack channel
Demo: Email log in Brinks tab; ~60s after page load, mock reply from Marcus Webb arrives
(BRK-2026-03-4471, J. Calloway, Thu Mar 28, 10:00–12:00 MST)
3e. Slack (Ops Centre)
What: Real-time ops notifications and command interface
Real API: Slack Web API — https://slack.com/api/chat.postMessage
Auth: Bot OAuth token (SLACK_BOT_TOKEN env var)
Channels:
#autoops-alerts → all machine state changes, reboots, Guardian flags
#autoops-commands → operator ↔ AutoOps conversation interface
#guardian-alerts → CRITICAL events only (circuit breaker, extended offline)
#brinks-dispatch → cash logistics — requests and confirmations
#daily-summary → morning digest at 6am MST — fleet status, overnight events
Commands: Operators type in #autoops-commands → Slack sends to webhook →
AutoOps parses and executes (reboot, status, cash, halt, resume, clear)
Demo: Purple Slack-styled sidebar, 5 channels, seeded with realistic operator conversation
Section 4 — Guardian System
Guardian — Autonomous Oversight Agent
Guardian is AutoOps's internal watchdog. It monitors the automation layer itself — not just the machines.
What Guardian Watches
Circuit breaker events (machine rebooted too many times)
Extended offline machines (no heartbeat > 3 hours)
Cascading failures (multiple machines in error state simultaneously)
Audit log anomalies (gap in log entries — suggests AutoOps process died)
Severity Levels
LOW → informational, logged only
MEDIUM → logged + #guardian-alerts Slack post
HIGH → logged + #guardian-alerts + Airtable flag
CRITICAL → logged + #guardian-alerts + Slack DM to operator + HALT option
Guardian does not auto-HALT. It escalates to humans. Only a human can issue HALT.
Production Extension
API response latency (WTI/Lamassu/Airtable unreachable)
Available from any tab, bottom-right corner. Accepts plaintext commands:
reboot <MACHINE_ID> Force reboot via WTI outlet (respects circuit breaker)
status <MACHINE_ID> Current state, RAM, cassette %, FSM state
cash <MACHINE_ID> Cassette % + assigned Brinks branch
clear <MACHINE_ID> Reset circuit breaker (human confirmation required)
halt Stop ALL auto-reboot and automation immediately
resume Resume automation after halt
The console is intentionally minimal. It's not a chat interface — it's a command line. Operators who know the system can act instantly without navigating tabs.
Production extension: Commands available via Slack (#autoops-commands) — operators manage the fleet from their phone without opening a browser.
Runtime: Node.js 22 or Python 3.11 — VPS (existing) or dedicated instance
Database: PostgreSQL (events, machine state, Brinks log)
Cache: Redis (real-time machine state, outlet status)
Queue: Bull/BullMQ or Celery (async actions — reboot, email, Airtable write)
Secrets: Environment variables (.env, never in code)
Monitoring: Self-monitoring via Guardian + daily digest to #autoops-alerts
Deployment: Systemd service, nginx reverse proxy — same pattern as demo
Estimated Build Time (one developer)
Lamassu integration: 3–5 days
WTI integration: 2–3 days
Airtable integration: 2–3 days
Brinks email parser: 3–4 days
Slack bot: 2–3 days
Guardian system: 2–3 days
Testing + hardening: 5–7 days
─────────────────────────────
Total: ~3–4 weeks