Documentation

Green Orchestration
Framework

A 7-layer pipeline for sustainable AI inference that routes every prompt to the most energy-efficient model capable of answering it accurately. Combines complexity scoring, T5 prompt optimization, carbon-aware routing, and cascade inference into a single unified framework.

✓ Open Source HF Space Live v0.1 Research
Benchmark Results

By the numbers

Results from running the benchmark suite on 20 prompts across complexity tiers. Source: experiments/benchmark.py

97%
Energy saved vs large
73%
Avg reduction, mixed
98.9%
Classifier accuracy
183g
CO₂ avoided (20 prompts)
35%
Avg token reduction
55%
Routing accuracy
Architecture

The 7-Layer Pipeline

Every prompt passes through all active layers before a single GPU inference cycle runs. Each layer either reduces cost or improves routing accuracy.

01
🔬 Complexity Scorer
Assigns a complexity score from 0 to 100 using Shannon entropy of the token distribution, sentence length heuristics, and a fine-tuned DistilBERT classifier trained on 600 labeled prompts. Score maps to one of four tiers: trivial (<15), simple (15–40), medium (40–70), complex (>70).
DistilBERT98.9% accuracy600 training examples
02
⚡ Prompt Optimizer
Silently rewrites the user prompt through a fine-tuned T5-small model (GreenPromptsOptimizer) to remove filler words, redundant phrasing, and unnecessary politeness markers before the prompt is sent to inference. Average token reduction: 35%. Zero measurable quality loss on the benchmark set.
T5-small~35% token reductionHosted on HF
03
🌎 Carbon-Aware Router
Checks hourly ERCOT grid carbon intensity estimates before routing. When grid intensity exceeds 400 gCO₂/kWh (fossil-heavy), the framework auto-activates eco mode and defers or downgrades expensive queries. Below 150 gCO₂/kWh, performance mode can unlock.
ERCOT grid dataCarbon-awareHourly updates
04
🤺 Cascade Engine
Inspired by FrugalGPT, the cascade engine starts with the smallest model that could plausibly answer the query. If the response confidence score falls below a threshold, it escalates to the next tier. This prevents over-spending on simple queries while still accessing large models for complex ones.
FrugalGPT-inspiredConfidence gating3-tier cascade
05
📊 Energy Estimator
Token-based proxy estimation using measured energy-per-token constants for each model (Llama 1B: 0.04 mWh/token, 8B: 0.18 mWh/token, 70B: 0.95 mWh/token). Reports mWh used, grams of CO₂ emitted using ERCOT intensity, tokens saved, and a composite Green Efficiency Score 0–100 per response.
Per-token constantsCO₂ trackingGreen Score
06
💡 Smart Preview
For complex queries (score >38), generates a 2-sentence preview and outline using the small model before running the full answer. User confirms before the expensive inference runs, preventing wasted reruns on misunderstood questions.
Confirm-before-runSmall model preview
07
🌿 Carbon Budget Enforcer
Tracks cumulative CO₂ emissions per session against a configurable budget (default 5g). When budget approaches, automatically shifts to eco mode. Prevents runaway usage and surfaces the true carbon cost of conversational AI.
Per-session budgetAuto eco-mode
Integration

REST API

The framework is deployed as a FastAPI server on Hugging Face Spaces. Accessible at https://sirenice-greeninfer-backend.hf.space

POST /chat

cURL
curl -X POST https://sirenice-greeninfer-backend.hf.space/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain how transformers work",
    "mode": "balanced",
    "context": []
  }'

Response

JSON Response
{
  "response": "Transformers are neural networks that...",
  "model_tier": "medium",
  "model_name": "Llama 3.1 8B",
  "energy_mwh": 3.82,
  "co2_grams": 0.000757,
  "tokens_saved": 4,
  "original_tokens": 7,
  "reduction_pct": 36,
  "complexity_score_100": 48,
  "complexity_label": "Medium",
  "energy_saved_pct": 92,
  "cascade_path": ["medium"],
  "escalations": 0,
  "optimizer_used": true
}

GET /health

Returns {"status":"ok","backend":"groq","optimizer":"loaded"} when all services are running.

SDK

Python SDK

The Python package wraps the REST API with a clean interface. Install from the GitHub repo.

Python
from greeninfer import GreenInfer

gi = GreenInfer()

# Balanced mode (default)
result = gi.chat("What is quantum computing?")
print(result.response)       # "Quantum computing is..."
print(result.energy_mwh)     # 0.9 (routed to small model)
print(result.model_tier)     # "small"
print(result.co2_grams)      # 0.000178
print(result.energy_saved_pct)  # 97

# Eco mode — always use smallest viable model
result = gi.chat("Write a sorting algorithm", mode="eco")

# Performance mode — prioritize quality
result = gi.chat("Analyze this legal contract...", mode="performance")

# Multi-turn conversation
result = gi.chat("Tell me more", context=gi.history)
Model Pool

Available Models

Three tiers, all running via Groq API for fast inference. Energy estimates are proxy values calibrated from public benchmark data.

ModelTierEnergy/TokenBest ForTraffic
Llama 3.2 1BSmall~0.04 mWhSimple facts, definitions, short answers55%
Llama 3.1 8BMedium~0.18 mWhReasoning, summaries, explanations30%
Llama 3.3 70BLarge~0.95 mWhCode, complex analysis, creative writing15%

Routing to small vs large saves 97% energy per query. At 55% of traffic on small, the average query uses only 4.5 mWh vs 48 mWh baseline — a 91% reduction on the average case.

Comparison

GreenInfer vs Always-Large

MetricGreenInferAlways LargeImprovement
Avg energy/query4.5 mWh48 mWh-91%
Simple query cost0.9 mWh48 mWh-98%
Token efficiency+35% fewer tokensBaseline+35%
Carbon per session~0.0008g CO₂~0.003g CO₂-73%
Grid awarenessYes — ERCOT hourlyNone
Cost to developerLower (smaller models)Fixed high cost