Llama 3.1 405B (Q4)

Meta's Llama 3.1 405B is the largest openly available language model. At 405 billion parameters (Q4 quantized), it approaches the quality of top proprietary models. Requires the full 128 GB Max tier. Best for tasks where model quality is the top priority.

Max+ requiredfrom $1999/mo

15 min provisioning

OpenAI-compatible API

Made by Meta

License: Llama 3.1 Community License

Technical Specifications

Tap the icon next to any term for a plain-language explanation.

Model size405B parameters

Memory required128 GB

Speed (M4 Pro)~3 tok/s

QuantizationQ4_K_M

Context window8K tokens

Disk space230 GB

RuntimeOllama + MLX

Use Cases

Frontier-quality text generation
Complex multi-step reasoning
Advanced code synthesis
Academic and scientific research
Benchmark and evaluation

What you get

Ollama runtime with Llama 3.1 405B (Q4 quantized)
MLX backend for optimized inference
OpenAI-compatible API endpoint
Prometheus metrics

Start using it

curl

curl https://dep-<id>.macyou.cloud/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mcy_live_<your-key>" \
  -d '{
    "model": "llama-405b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="mcy_live_<your-key>",
    base_url="https://dep-<id>.macyou.cloud/v1"
)

response = client.chat.completions.create(
    model="llama-405b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Ready to deploy Llama 3.1 405B (Q4)?

Up and running in 15 minutes on dedicated Apple Silicon.