Back to catalog
๐ŸŒ

Local LLM Deployments

Qwen 2.5 72B (FP16)

Qwen 2.5 72B at full 16-bit precision delivers the highest quality multilingual output available in open-source. Running without quantization means zero quality loss โ€” every nuance is preserved. Requires 96 GB unified memory, available on the Pro tier.

Pro+ requiredfrom $1199/mo
10 min provisioning
OpenAI-compatible API
Made by Alibaba Cloud
License: Apache 2.0

Technical Specifications

Tap the icon next to any term for a plain-language explanation.

Model size72B parameters
Memory required96 GB
Speed (M4 Pro)~8 tok/s
QuantizationFP16
Context window33K tokens
Disk space145 GB
RuntimeOllama + MLX

Use Cases

  • High-stakes multilingual tasks
  • Legal document drafting
  • Academic research
  • Premium translation services
  • Enterprise knowledge bases

What you get

  • Ollama runtime with Qwen 2.5 72B at full precision
  • MLX backend for optimized inference
  • OpenAI-compatible API endpoint
  • Prometheus metrics

Start using it

curl
curl https://dep-<id>.macyou.cloud/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mcy_live_<your-key>" \
  -d '{
    "model": "qwen-2.5-72b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'
Python (OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    api_key="mcy_live_<your-key>",
    base_url="https://dep-<id>.macyou.cloud/v1"
)

response = client.chat.completions.create(
    model="qwen-2.5-72b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Tags

LLMProductionFull Precision

Ready to deploy Qwen 2.5 72B (FP16)?

Up and running in 10 minutes on dedicated Apple Silicon.