Back to catalog
🦙

Local LLM Deployments

Llama 3.1 405B (Q4)

Meta's Llama 3.1 405B is the largest openly available language model. At 405 billion parameters (Q4 quantized), it approaches the quality of top proprietary models. Requires the full 128 GB Max tier. Best for tasks where model quality is the top priority.

Max+ requiredfrom $1999/mo
15 min provisioning
OpenAI-compatible API
Made by Meta
License: Llama 3.1 Community License

Technical Specifications

Tap the icon next to any term for a plain-language explanation.

Model size405B parameters
Memory required128 GB
Speed (M4 Pro)~3 tok/s
QuantizationQ4_K_M
Context window8K tokens
Disk space230 GB
RuntimeOllama + MLX

Use Cases

  • Frontier-quality text generation
  • Complex multi-step reasoning
  • Advanced code synthesis
  • Academic and scientific research
  • Benchmark and evaluation

What you get

  • Ollama runtime with Llama 3.1 405B (Q4 quantized)
  • MLX backend for optimized inference
  • OpenAI-compatible API endpoint
  • Prometheus metrics

Start using it

curl
curl https://dep-<id>.macyou.cloud/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mcy_live_<your-key>" \
  -d '{
    "model": "llama-405b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
Python (OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    api_key="mcy_live_<your-key>",
    base_url="https://dep-<id>.macyou.cloud/v1"
)

response = client.chat.completions.create(
    model="llama-405b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Tags

LLMFrontierMax

Ready to deploy Llama 3.1 405B (Q4)?

Up and running in 15 minutes on dedicated Apple Silicon.