Llama 3.3 70B

Meta's Llama 3.3 70B is the top-tier open-weight model for complex reasoning and production workloads. At 70 billion parameters (Q4 quantized to fit 64 GB), it competes with proprietary models on benchmarks while running entirely on your own hardware. No data ever leaves your machine.

Advanced+ requiredfrom $599/mo

10 min provisioning

OpenAI-compatible API

Made by Meta

License: Llama 3.3 Community License

Technical Specifications

Tap the icon next to any term for a plain-language explanation.

Model size70B parameters

Memory required64 GB

Speed (M4 Pro)~12 tok/s

QuantizationQ4_K_M

Context window8K tokens

Disk space40 GB

RuntimeOllama + MLX

Use Cases

Complex reasoning tasks
Production API serving
Legal and medical document analysis
Advanced code generation
Research and analysis

What you get

Ollama runtime with Llama 3.3 70B (Q4 quantized)
MLX backend for optimized inference
OpenAI-compatible API endpoint
Prometheus metrics

Start using it

curl

curl https://dep-<id>.macyou.cloud/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mcy_live_<your-key>" \
  -d '{
    "model": "llama-3.3-70b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="mcy_live_<your-key>",
    base_url="https://dep-<id>.macyou.cloud/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Ready to deploy Llama 3.3 70B?