TutorialLLMOllamaLlamaTutorial

Deploy Llama 3.2 on a Mac Mini M4 Pro in Under 2 Minutes

April 12, 20268 min readby Macyou Team

Deploying Llama on a Mac mini takes about 5 minutes end to end: pick a chip in the constructor, choose a Llama stack from the catalog, and you get a running model with an OpenAI-compatible endpoint. This guide walks through the exact steps and shows the speeds we actually measure on this hardware — no estimates.

Step 1: Pick the right Mac for your Llama

Llama 3.2 3B — any config; we measured 46.7 tok/s on a base M4 16 GB.
Llama 3.1 8B — M4 16 GB is enough: 21.2 tok/s measured (~45 on an M4 Pro).
Llama 3.3 70B — M4 Pro 64 GB at Q4; see the 70B hardware guide.

All measured numbers, with methodology and raw JSON, live on our benchmarks page.

Step 2: Build and deploy

In the Build a Mac constructor, pick your chip and memory, then select a Llama stack from the catalog. Provisioning installs Ollama, pre-loads the weights, and wires up the API endpoint — typically under 5 minutes for 8B-class models.

Step 3: Call your endpoint

from openai import OpenAI

client = OpenAI(
    api_key="mcy_live_...",
    base_url="https://dep-xxxx.macyou.cloud/v1",
)
resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
)

The endpoint speaks the OpenAI format — chat completions, streaming, embeddings — so existing SDK code works after changing two lines. You also get SSH for the classic ollama run workflow and a browser desktop if you prefer a GUI.

Why this is fast: bandwidth, not magic

LLM generation is memory-bandwidth-bound: every token requires reading the model weights. Apple Silicon's unified memory feeds the GPU (via Metal) at 120 GB/s on the M4 and 273 GB/s on the M4 Pro — which is exactly why our measured 8B speed roughly doubles between those chips. No CUDA drivers, no VRAM ceiling, no host-to-device copies.

Common questions

Which Llama should I start with?

The 8B. It is the best quality-per-dollar in the family, fits the cheapest config, and if it proves too weak, moving to 70B is a redeploy — your API code does not change.

Do I pay per token?

No — fixed price per machine (M4 from $99/mo), unlimited inference. That is the point of dedicated hardware over per-token APIs.

All posts