Back to catalog
๐
Local LLM Deployments
Qwen 2.5 72B (FP16)
Qwen 2.5 72B at full 16-bit precision delivers the highest quality multilingual output available in open-source. Running without quantization means zero quality loss โ every nuance is preserved. Requires 96 GB unified memory, available on the Pro tier.
Pro+ requiredfrom $1199/mo
10 min provisioning
OpenAI-compatible APIMade by Alibaba Cloud
License: Apache 2.0
Technical Specifications
Tap the icon next to any term for a plain-language explanation.
Model size72B parameters
Memory required96 GB
Speed (M4 Pro)~8 tok/s
QuantizationFP16
Context window33K tokens
Disk space145 GB
RuntimeOllama + MLX
Use Cases
- High-stakes multilingual tasks
- Legal document drafting
- Academic research
- Premium translation services
- Enterprise knowledge bases
What you get
- Ollama runtime with Qwen 2.5 72B at full precision
- MLX backend for optimized inference
- OpenAI-compatible API endpoint
- Prometheus metrics
Start using it
curl
curl https://dep-<id>.macyou.cloud/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer mcy_live_<your-key>" \
-d '{
"model": "qwen-2.5-72b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="mcy_live_<your-key>",
base_url="https://dep-<id>.macyou.cloud/v1"
)
response = client.chat.completions.create(
model="qwen-2.5-72b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")Tags
LLMProductionFull Precision