Back to catalog
๐ฆ
Local LLM Deployments
Llama 3.3 70B
Meta's Llama 3.3 70B is the top-tier open-weight model for complex reasoning and production workloads. At 70 billion parameters (Q4 quantized to fit 64 GB), it competes with proprietary models on benchmarks while running entirely on your own hardware. No data ever leaves your machine.
Advanced+ requiredfrom $599/mo
10 min provisioning
OpenAI-compatible APIMade by Meta
License: Llama 3.3 Community License
Technical Specifications
Tap the icon next to any term for a plain-language explanation.
Model size70B parameters
Memory required64 GB
Speed (M4 Pro)~12 tok/s
QuantizationQ4_K_M
Context window8K tokens
Disk space40 GB
RuntimeOllama + MLX
Use Cases
- Complex reasoning tasks
- Production API serving
- Legal and medical document analysis
- Advanced code generation
- Research and analysis
What you get
- Ollama runtime with Llama 3.3 70B (Q4 quantized)
- MLX backend for optimized inference
- OpenAI-compatible API endpoint
- Prometheus metrics
Start using it
curl
curl https://dep-<id>.macyou.cloud/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer mcy_live_<your-key>" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="mcy_live_<your-key>",
base_url="https://dep-<id>.macyou.cloud/v1"
)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")Tags
LLMProductionReasoning