Deploy Llama 3.2 on a Mac Mini M4 Pro in Under 2 Minutes
Running LLMs locally on Apple Silicon is incredibly efficient thanks to the unified memory architecture. In this guide, we'll deploy Llama 3.2 on a Macyou M4 Pro server using Ollama — from zero to running in under 2 minutes.
Prerequisites
- A Macyou account (sign up at macyou.co)
- macyou CLI installed (or use the web dashboard)
Step 1: Deploy a Server
$ macyou deploy --chip m4-pro --ram 48 --stack ollama
Creating server... done
Server ready in 42s
IP: 185.234.xx.xx
SSH: ssh [email protected]
Ollama: http://185.234.xx.xx:11434The --stack ollama flag pre-installs Ollama and configures it to listen on all interfaces.
Step 2: Pull and Run the Model
$ ssh [email protected]
$ ollama run llama3.2
pulling manifest... done
>>> Ready. 47 tok/s on M4 Pro Neural EngineThat's it. Llama 3.2 (8B) runs at ~47 tokens/second on the M4 Pro's Neural Engine with 48 GB unified memory — no GPU rental, no CUDA drivers, no configuration hell.
Step 3: Use the API
curl http://185.234.xx.xx:11434/api/generate \
-d '{"model": "llama3.2", "prompt": "Explain quantum computing"}'Ollama exposes a REST API compatible with the OpenAI format. Point your app at it and you have a private, fast LLM endpoint.
Performance Notes
The M4 Pro's 273 GB/s memory bandwidth is key — LLM inference is memory-bandwidth bound, and unified memory means zero data copying between CPU and Neural Engine. Expect 40–50 tok/s for 7B–8B models and 15–20 tok/s for 70B models (with quantization).