Green Orchestration
Framework
A 7-layer pipeline for sustainable AI inference that routes every prompt to the most energy-efficient model capable of answering it accurately. Combines complexity scoring, T5 prompt optimization, carbon-aware routing, and cascade inference into a single unified framework.
By the numbers
Results from running the benchmark suite on 20 prompts across complexity tiers. Source: experiments/benchmark.py
The 7-Layer Pipeline
Every prompt passes through all active layers before a single GPU inference cycle runs. Each layer either reduces cost or improves routing accuracy.
REST API
The framework is deployed as a FastAPI server on Hugging Face Spaces. Accessible at https://sirenice-greeninfer-backend.hf.space
POST /chat
curl -X POST https://sirenice-greeninfer-backend.hf.space/chat \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain how transformers work",
"mode": "balanced",
"context": []
}'
Response
{
"response": "Transformers are neural networks that...",
"model_tier": "medium",
"model_name": "Llama 3.1 8B",
"energy_mwh": 3.82,
"co2_grams": 0.000757,
"tokens_saved": 4,
"original_tokens": 7,
"reduction_pct": 36,
"complexity_score_100": 48,
"complexity_label": "Medium",
"energy_saved_pct": 92,
"cascade_path": ["medium"],
"escalations": 0,
"optimizer_used": true
}
GET /health
Returns {"status":"ok","backend":"groq","optimizer":"loaded"} when all services are running.
Python SDK
The Python package wraps the REST API with a clean interface. Install from the GitHub repo.
from greeninfer import GreenInfer
gi = GreenInfer()
# Balanced mode (default)
result = gi.chat("What is quantum computing?")
print(result.response) # "Quantum computing is..."
print(result.energy_mwh) # 0.9 (routed to small model)
print(result.model_tier) # "small"
print(result.co2_grams) # 0.000178
print(result.energy_saved_pct) # 97
# Eco mode — always use smallest viable model
result = gi.chat("Write a sorting algorithm", mode="eco")
# Performance mode — prioritize quality
result = gi.chat("Analyze this legal contract...", mode="performance")
# Multi-turn conversation
result = gi.chat("Tell me more", context=gi.history)
Available Models
Three tiers, all running via Groq API for fast inference. Energy estimates are proxy values calibrated from public benchmark data.
| Model | Tier | Energy/Token | Best For | Traffic |
|---|---|---|---|---|
| Llama 3.2 1B | Small | ~0.04 mWh | Simple facts, definitions, short answers | 55% |
| Llama 3.1 8B | Medium | ~0.18 mWh | Reasoning, summaries, explanations | 30% |
| Llama 3.3 70B | Large | ~0.95 mWh | Code, complex analysis, creative writing | 15% |
Routing to small vs large saves 97% energy per query. At 55% of traffic on small, the average query uses only 4.5 mWh vs 48 mWh baseline — a 91% reduction on the average case.
GreenInfer vs Always-Large
| Metric | GreenInfer | Always Large | Improvement |
|---|---|---|---|
| Avg energy/query | 4.5 mWh | 48 mWh | -91% |
| Simple query cost | 0.9 mWh | 48 mWh | -98% |
| Token efficiency | +35% fewer tokens | Baseline | +35% |
| Carbon per session | ~0.0008g CO₂ | ~0.003g CO₂ | -73% |
| Grid awareness | Yes — ERCOT hourly | None | ✓ |
| Cost to developer | Lower (smaller models) | Fixed high cost | ✓ |