Llama 3.3 8B

Meta's Llama 3.3 8B is a fast, lightweight large language model with 8 billion parameters. It runs entirely in 16 GB of unified memory and generates text at about 47 tokens per second on the M4 Pro — fast enough for real-time chat. Great for building chatbots, AI agents, and rapid prototyping.

Starter+ requiredfrom $149/mo

5 min provisioning

OpenAI-compatible API

Made by Meta

License: Llama 3.3 Community License

Technical Specifications

Tap the icon next to any term for a plain-language explanation.

Model size8B parameters

Memory required16 GB

Speed (M4 Pro)~47 tok/s

QuantizationQ4_K_M

Context window8K tokens

Disk space5 GB

RuntimeOllama + MLX

Use Cases

Customer support chatbots
AI coding assistants
Content generation
RAG pipelines
Rapid prototyping

What you get

Ollama runtime with Llama 3.3 8B pre-loaded
MLX backend for optimized inference
OpenAI-compatible API endpoint
Prometheus metrics

Start using it

curl

curl https://dep-<id>.macyou.cloud/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer mcy_live_<your-key>" \
  -d '{
    "model": "llama-3.3-8b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="mcy_live_<your-key>",
    base_url="https://dep-<id>.macyou.cloud/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-8b",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Ready to deploy Llama 3.3 8B?