Back to blog
TutorialLLMOllamaLlamaTutorial

Deploy Llama 3.2 on a Mac Mini M4 Pro in Under 2 Minutes

April 12, 20268 min readby Macyou Team

Running LLMs locally on Apple Silicon is incredibly efficient thanks to the unified memory architecture. In this guide, we'll deploy Llama 3.2 on a Macyou M4 Pro server using Ollama — from zero to running in under 2 minutes.

Prerequisites

  • A Macyou account (sign up at macyou.co)
  • macyou CLI installed (or use the web dashboard)

Step 1: Deploy a Server

$ macyou deploy --chip m4-pro --ram 48 --stack ollama
Creating server... done
Server ready in 42s

  IP:    185.234.xx.xx
  SSH:   ssh [email protected]
  Ollama: http://185.234.xx.xx:11434

The --stack ollama flag pre-installs Ollama and configures it to listen on all interfaces.

Step 2: Pull and Run the Model

$ ssh [email protected]
$ ollama run llama3.2
pulling manifest... done
>>> Ready. 47 tok/s on M4 Pro Neural Engine

That's it. Llama 3.2 (8B) runs at ~47 tokens/second on the M4 Pro's Neural Engine with 48 GB unified memory — no GPU rental, no CUDA drivers, no configuration hell.

Step 3: Use the API

curl http://185.234.xx.xx:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Explain quantum computing"}'

Ollama exposes a REST API compatible with the OpenAI format. Point your app at it and you have a private, fast LLM endpoint.

Performance Notes

The M4 Pro's 273 GB/s memory bandwidth is key — LLM inference is memory-bandwidth bound, and unified memory means zero data copying between CPU and Neural Engine. Expect 40–50 tok/s for 7B–8B models and 15–20 tok/s for 70B models (with quantization).