Building a 52GB unified memory AI cluster with Apple Silicon — zero-cost local inference.
Cloud AI APIs are expensive. Claude Opus costs $15/M input tokens. Even "cheap" models add up fast when you're running agents 24/7. We needed a way to run capable models locally without sacrificing quality.
Exo is a distributed inference engine that shards models across multiple machines. We connected two Apple Silicon Macs via Thunderbolt 5 and created a 52GB unified memory pool.
| Machine | Chip | RAM | Role |
|---|---|---|---|
| Mac Studio | M4 Max | 36GB | Cluster master |
| MacBook Pro (Bolt) | M5 | 16GB | Worker node |
| Total | 52GB |
rdma_ctl enable in Recovery Mode)We tested the same prompt across three model configurations:
| Model | Latency | Cost/Query | Notes |
|---|---|---|---|
| Qwen3-30B (Local) | 3.0s | $0 | 255 reasoning tokens |
| GLM-5 (Z.AI Max) | 8.2s | ~$0.001 | Cloud API |
| GLM-4.7-Flash (Free) | Slow/timeout | $0 | Free tier throttled |
Result: Local inference is 2.7x faster than cloud and costs nothing per query. The hardware pays for itself after ~50K queries.
Exo auto-discovers and downloads models from HuggingFace. Our cluster has access to 54+ models including:
Exo shards models across nodes by splitting layers. Each machine holds a portion of the model in RAM. During inference:
RDMA enables direct memory access between machines — no CPU overhead for data transfer.
To make the cluster production-ready, we rebuilt our agent bridges with three critical features:
Before processing any message, the bridge saves it to a local SQLite database with status='pending'. After successful processing, it updates to status='done'. On crash/restart, the bridge replays all pending messages.
Each bridge publishes an online status on connect and sets an LWT message of offline if it dies unexpectedly. Other agents can subscribe to status topics and alert on failures.
Headless Macs don't have GUI login sessions, so LaunchAgents won't start. We migrated all services to LaunchDaemons which run system-wide. All services now auto-start on boot and restart on crash via KeepAlive.
| Test | Result |
|---|---|
| 50 concurrent MQTT messages | ✅ 0 drops, 0 duplicates |
| 30 concurrent database writes | ✅ 30/30 persisted |
| Kill bridge mid-task (v2) | ❌ Message lost |
| Kill bridge mid-task (v3) | ✅ Replayed on restart |
| Full machine reboot | ✅ All services auto-start, pending replayed |
We also built a pipeline engine for sequential task workflows. Each step passes its output to the next step automatically. Example:
Step 1: Research → research.json
Step 2: Create content → draft.md (receives research.json)
Step 3: Review → final.pdf (receives draft.md)
Step 4: Deliver → sent to recipient
The pipeline tracks status, handles failures, and logs everything to an audit trail with cost tracking.
Comparing annual costs for a high-volume agent workload (~500K queries/year):
| Option | Annual Cost |
|---|---|
| Claude Opus API | ~$7,500 |
| GLM-5 (Z.AI Max) | $960 + ~$500 tokens |
| Local Exo Cluster | $0 (hardware sunk) |
The cluster paid for itself in the first week. Hardware is an investment; API calls are rent.
~/ftws-ops/bridges/bridge-v3.py~/.openclaw/workspace/ftws-mcp-server/pipeline.js