Distributed AI Inference Cluster

Building a 52GB unified memory AI cluster with Apple Silicon — zero-cost local inference.

March 17, 2026 Infrastructure Apple Silicon Exo

The Problem

Cloud AI APIs are expensive. Claude Opus costs $15/M input tokens. Even "cheap" models add up fast when you're running agents 24/7. We needed a way to run capable models locally without sacrificing quality.

The Solution: Exo Cluster

Exo is a distributed inference engine that shards models across multiple machines. We connected two Apple Silicon Macs via Thunderbolt 5 and created a 52GB unified memory pool.

Hardware

Machine	Chip	RAM	Role
Mac Studio	M4 Max	36GB	Cluster master
MacBook Pro (Bolt)	M5	16GB	Worker node
Total		52GB

Network

Thunderbolt 5 direct connection between Mac Studio ↔ Bolt
RDMA enabled for zero-copy memory access (via rdma_ctl enable in Recovery Mode)
Network switch (TP-Link TL-SG608P) for fleet connectivity

Speed Comparison

We tested the same prompt across three model configurations:

Model	Latency	Cost/Query	Notes
Qwen3-30B (Local)	3.0s	$0	255 reasoning tokens
GLM-5 (Z.AI Max)	8.2s	~$0.001	Cloud API
GLM-4.7-Flash (Free)	Slow/timeout	$0	Free tier throttled

Result: Local inference is 2.7x faster than cloud and costs nothing per query. The hardware pays for itself after ~50K queries.

Models Available

Exo auto-discovers and downloads models from HuggingFace. Our cluster has access to 54+ models including:

Qwen3-Coder-480B (276GB, 4-bit) — Massive coding model
Qwen3-30B-A3B (16GB, 4-bit) — Our daily driver
GLM-4.7 (189GB, 4-bit) — Thinking model with reasoning toggle
DeepSeek-V3.1 (387GB, 4-bit) — Latest DeepSeek
Llama-3.3-70B (38GB, 4-bit) — Meta's flagship

Architecture

Exo shards models across nodes by splitting layers. Each machine holds a portion of the model in RAM. During inference:

Input tokens are processed on the master node
Activations flow through Thunderbolt to the worker node
Worker processes its layers and returns results
Final output assembled on master

RDMA enables direct memory access between machines — no CPU overhead for data transfer.

Bridge v3: Resilience Architecture

To make the cluster production-ready, we rebuilt our agent bridges with three critical features:

1. Inbox Persistence

Before processing any message, the bridge saves it to a local SQLite database with status='pending'. After successful processing, it updates to status='done'. On crash/restart, the bridge replays all pending messages.

2. MQTT Last Will & Testament (LWT)

Each bridge publishes an online status on connect and sets an LWT message of offline if it dies unexpectedly. Other agents can subscribe to status topics and alert on failures.

3. LaunchDaemons

Headless Macs don't have GUI login sessions, so LaunchAgents won't start. We migrated all services to LaunchDaemons which run system-wide. All services now auto-start on boot and restart on crash via KeepAlive.

Stress Test Results

Test	Result
50 concurrent MQTT messages	✅ 0 drops, 0 duplicates
30 concurrent database writes	✅ 30/30 persisted
Kill bridge mid-task (v2)	❌ Message lost
Kill bridge mid-task (v3)	✅ Replayed on restart
Full machine reboot	✅ All services auto-start, pending replayed

Pipeline Engine

We also built a pipeline engine for sequential task workflows. Each step passes its output to the next step automatically. Example:

Step 1: Research → research.json
Step 2: Create content → draft.md (receives research.json)
Step 3: Review → final.pdf (receives draft.md)
Step 4: Deliver → sent to recipient

The pipeline tracks status, handles failures, and logs everything to an audit trail with cost tracking.

Cost Analysis

Comparing annual costs for a high-volume agent workload (~500K queries/year):

Option	Annual Cost
Claude Opus API	~$7,500
GLM-5 (Z.AI Max)	$960 + ~$500 tokens
Local Exo Cluster	$0 (hardware sunk)

The cluster paid for itself in the first week. Hardware is an investment; API calls are rent.

What's Next

Add more nodes to the cluster (target: 100GB+ unified memory)
Run the 480B coding model for complex development tasks
Build auto-scaling — spin up cloud APIs only when cluster is saturated

Files & References

Exo: github.com/exo-explore/exo
Bridge v3 code: ~/ftws-ops/bridges/bridge-v3.py
Pipeline engine: ~/.openclaw/workspace/ftws-mcp-server/pipeline.js

← Back to Free The World Software