Back to catalog
๐ฆ
Local LLM Deployments
Llama 3.3 8B
Meta's Llama 3.3 8B is a fast, lightweight large language model with 8 billion parameters. It runs entirely in 16 GB of unified memory and generates text at about 47 tokens per second on the M4 Pro โ fast enough for real-time chat. Great for building chatbots, AI agents, and rapid prototyping.
Starter+ requiredfrom $149/mo
5 min provisioning
OpenAI-compatible APIMade by Meta
License: Llama 3.3 Community License
Technical Specifications
Tap the icon next to any term for a plain-language explanation.
Model size8B parameters
Memory required16 GB
Speed (M4 Pro)~47 tok/s
QuantizationQ4_K_M
Context window8K tokens
Disk space5 GB
RuntimeOllama + MLX
Use Cases
- Customer support chatbots
- AI coding assistants
- Content generation
- RAG pipelines
- Rapid prototyping
What you get
- Ollama runtime with Llama 3.3 8B pre-loaded
- MLX backend for optimized inference
- OpenAI-compatible API endpoint
- Prometheus metrics
Start using it
curl
curl https://dep-<id>.macyou.cloud/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer mcy_live_<your-key>" \
-d '{
"model": "llama-3.3-8b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="mcy_live_<your-key>",
base_url="https://dep-<id>.macyou.cloud/v1"
)
response = client.chat.completions.create(
model="llama-3.3-8b",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content or "", end="")Tags
LLMChatAgents