Distributed AI Inference Cluster

Building a 52GB unified memory AI cluster with Apple Silicon — zero-cost local inference.

March 17, 2026 Infrastructure Apple Silicon Exo

The Problem

Cloud AI APIs are expensive. Claude Opus costs $15/M input tokens. Even "cheap" models add up fast when you're running agents 24/7. We needed a way to run capable models locally without sacrificing quality.

The Solution: Exo Cluster

Exo is a distributed inference engine that shards models across multiple machines. We connected two Apple Silicon Macs via Thunderbolt 5 and created a 52GB unified memory pool.

Hardware

MachineChipRAMRole
Mac StudioM4 Max36GBCluster master
MacBook Pro (Bolt)M516GBWorker node
Total52GB

Network

Speed Comparison

We tested the same prompt across three model configurations:

ModelLatencyCost/QueryNotes
Qwen3-30B (Local)3.0s$0255 reasoning tokens
GLM-5 (Z.AI Max)8.2s~$0.001Cloud API
GLM-4.7-Flash (Free)Slow/timeout$0Free tier throttled

Result: Local inference is 2.7x faster than cloud and costs nothing per query. The hardware pays for itself after ~50K queries.

Models Available

Exo auto-discovers and downloads models from HuggingFace. Our cluster has access to 54+ models including:

Architecture

Exo shards models across nodes by splitting layers. Each machine holds a portion of the model in RAM. During inference:

  1. Input tokens are processed on the master node
  2. Activations flow through Thunderbolt to the worker node
  3. Worker processes its layers and returns results
  4. Final output assembled on master

RDMA enables direct memory access between machines — no CPU overhead for data transfer.

Bridge v3: Resilience Architecture

To make the cluster production-ready, we rebuilt our agent bridges with three critical features:

1. Inbox Persistence

Before processing any message, the bridge saves it to a local SQLite database with status='pending'. After successful processing, it updates to status='done'. On crash/restart, the bridge replays all pending messages.

2. MQTT Last Will & Testament (LWT)

Each bridge publishes an online status on connect and sets an LWT message of offline if it dies unexpectedly. Other agents can subscribe to status topics and alert on failures.

3. LaunchDaemons

Headless Macs don't have GUI login sessions, so LaunchAgents won't start. We migrated all services to LaunchDaemons which run system-wide. All services now auto-start on boot and restart on crash via KeepAlive.

Stress Test Results

TestResult
50 concurrent MQTT messages✅ 0 drops, 0 duplicates
30 concurrent database writes✅ 30/30 persisted
Kill bridge mid-task (v2)❌ Message lost
Kill bridge mid-task (v3)✅ Replayed on restart
Full machine reboot✅ All services auto-start, pending replayed

Pipeline Engine

We also built a pipeline engine for sequential task workflows. Each step passes its output to the next step automatically. Example:

Step 1: Research → research.json
Step 2: Create content → draft.md (receives research.json)
Step 3: Review → final.pdf (receives draft.md)
Step 4: Deliver → sent to recipient

The pipeline tracks status, handles failures, and logs everything to an audit trail with cost tracking.

Cost Analysis

Comparing annual costs for a high-volume agent workload (~500K queries/year):

OptionAnnual Cost
Claude Opus API~$7,500
GLM-5 (Z.AI Max)$960 + ~$500 tokens
Local Exo Cluster$0 (hardware sunk)

The cluster paid for itself in the first week. Hardware is an investment; API calls are rent.

What's Next

Files & References

← Back to Free The World Software