Back to catalog
🦙
Local LLM Deployments
Llama 3.1 405B (Q4)
Meta's Llama 3.1 405B is the largest openly available language model. At 405 billion parameters (Q4 quantized), it approaches the quality of top proprietary models. Requires the full 128 GB Max tier. Best for tasks where model quality is the top priority.
Max+ requiredfrom $1999/mo
15 min provisioning
OpenAI-compatible APIMade by Meta
License: Llama 3.1 Community License
Technical Specifications
Tap the icon next to any term for a plain-language explanation.
Model size405B parameters
Memory required128 GB
Speed (M4 Pro)~3 tok/s
QuantizationQ4_K_M
Context window8K tokens
Disk space230 GB
RuntimeOllama + MLX
Use Cases
- Frontier-quality text generation
- Complex multi-step reasoning
- Advanced code synthesis
- Academic and scientific research
- Benchmark and evaluation
What you get
- Ollama runtime with Llama 3.1 405B (Q4 quantized)
- MLX backend for optimized inference
- OpenAI-compatible API endpoint
- Prometheus metrics
Start using it
curl
curl https://dep-<id>.macyou.cloud/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer mcy_live_<your-key>" \
-d '{
"model": "llama-405b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="mcy_live_<your-key>",
base_url="https://dep-<id>.macyou.cloud/v1"
)
response = client.chat.completions.create(
model="llama-405b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")Tags
LLMFrontierMax