Documentation

Green Orchestration
Framework

A 7-layer pipeline for sustainable AI inference that routes every prompt to the most energy-efficient model capable of answering it accurately. Combines complexity scoring, T5 prompt optimization, carbon-aware routing, and cascade inference into a single unified framework.

✓ Open Source HF Space Live v0.1 Research

Benchmark Results

By the numbers

Results from running the benchmark suite on 20 prompts across complexity tiers. Source: experiments/benchmark.py

97%

Energy saved vs large

73%

Avg reduction, mixed

98.9%

Classifier accuracy

183g

CO₂ avoided (20 prompts)

35%

Avg token reduction

55%

Routing accuracy

Architecture

The 7-Layer Pipeline

Every prompt passes through all active layers before a single GPU inference cycle runs. Each layer either reduces cost or improves routing accuracy.

🔬 Complexity Scorer

Assigns a complexity score from 0 to 100 using Shannon entropy of the token distribution, sentence length heuristics, and a fine-tuned DistilBERT classifier trained on 600 labeled prompts. Score maps to one of four tiers: trivial (<15), simple (15–40), medium (40–70), complex (>70).

DistilBERT98.9% accuracy600 training examples

⚡ Prompt Optimizer

Silently rewrites the user prompt through a fine-tuned T5-small model (GreenPromptsOptimizer) to remove filler words, redundant phrasing, and unnecessary politeness markers before the prompt is sent to inference. Average token reduction: 35%. Zero measurable quality loss on the benchmark set.

T5-small~35% token reductionHosted on HF

🌎 Carbon-Aware Router

Checks hourly ERCOT grid carbon intensity estimates before routing. When grid intensity exceeds 400 gCO₂/kWh (fossil-heavy), the framework auto-activates eco mode and defers or downgrades expensive queries. Below 150 gCO₂/kWh, performance mode can unlock.

ERCOT grid dataCarbon-awareHourly updates

🤺 Cascade Engine

Inspired by FrugalGPT, the cascade engine starts with the smallest model that could plausibly answer the query. If the response confidence score falls below a threshold, it escalates to the next tier. This prevents over-spending on simple queries while still accessing large models for complex ones.

FrugalGPT-inspiredConfidence gating3-tier cascade

📊 Energy Estimator

Token-based proxy estimation using measured energy-per-token constants for each model (Llama 1B: 0.04 mWh/token, 8B: 0.18 mWh/token, 70B: 0.95 mWh/token). Reports mWh used, grams of CO₂ emitted using ERCOT intensity, tokens saved, and a composite Green Efficiency Score 0–100 per response.

Per-token constantsCO₂ trackingGreen Score

💡 Smart Preview

For complex queries (score >38), generates a 2-sentence preview and outline using the small model before running the full answer. User confirms before the expensive inference runs, preventing wasted reruns on misunderstood questions.

Confirm-before-runSmall model preview

🌿 Carbon Budget Enforcer

Tracks cumulative CO₂ emissions per session against a configurable budget (default 5g). When budget approaches, automatically shifts to eco mode. Prevents runaway usage and surfaces the true carbon cost of conversational AI.

Per-session budgetAuto eco-mode

Integration

REST API

The framework is deployed as a FastAPI server on Hugging Face Spaces. Accessible at https://sirenice-greeninfer-backend.hf.space

POST /chat

cURL

curl -X POST https://sirenice-greeninfer-backend.hf.space/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain how transformers work",
    "mode": "balanced",
    "context": []
  }'

Response

JSON Response

{
  "response": "Transformers are neural networks that...",
  "model_tier": "medium",
  "model_name": "Llama 3.1 8B",
  "energy_mwh": 3.82,
  "co2_grams": 0.000757,
  "tokens_saved": 4,
  "original_tokens": 7,
  "reduction_pct": 36,
  "complexity_score_100": 48,
  "complexity_label": "Medium",
  "energy_saved_pct": 92,
  "cascade_path": ["medium"],
  "escalations": 0,
  "optimizer_used": true
}

GET /health

Returns {"status":"ok","backend":"groq","optimizer":"loaded"} when all services are running.

SDK

Python SDK

The Python package wraps the REST API with a clean interface. Install from the GitHub repo.

Python

from greeninfer import GreenInfer

gi = GreenInfer()

# Balanced mode (default)
result = gi.chat("What is quantum computing?")
print(result.response)       # "Quantum computing is..."
print(result.energy_mwh)     # 0.9 (routed to small model)
print(result.model_tier)     # "small"
print(result.co2_grams)      # 0.000178
print(result.energy_saved_pct)  # 97

# Eco mode — always use smallest viable model
result = gi.chat("Write a sorting algorithm", mode="eco")

# Performance mode — prioritize quality
result = gi.chat("Analyze this legal contract...", mode="performance")

# Multi-turn conversation
result = gi.chat("Tell me more", context=gi.history)

Model Pool

Available Models

Three tiers, all running via Groq API for fast inference. Energy estimates are proxy values calibrated from public benchmark data.

Model	Tier	Energy/Token	Best For	Traffic
Llama 3.2 1B	Small	~0.04 mWh	Simple facts, definitions, short answers	55%
Llama 3.1 8B	Medium	~0.18 mWh	Reasoning, summaries, explanations	30%
Llama 3.3 70B	Large	~0.95 mWh	Code, complex analysis, creative writing	15%

Routing to small vs large saves 97% energy per query. At 55% of traffic on small, the average query uses only 4.5 mWh vs 48 mWh baseline — a 91% reduction on the average case.

Comparison

GreenInfer vs Always-Large

Metric	GreenInfer	Always Large	Improvement
Avg energy/query	4.5 mWh	48 mWh	-91%
Simple query cost	0.9 mWh	48 mWh	-98%
Token efficiency	+35% fewer tokens	Baseline	+35%
Carbon per session	~0.0008g CO₂	~0.003g CO₂	-73%
Grid awareness	Yes — ERCOT hourly	None	✓
Cost to developer	Lower (smaller models)	Fixed high cost	✓

Green OrchestrationFramework

By the numbers

The 7-Layer Pipeline

REST API

POST /chat

Response

GET /health

Python SDK

Available Models

GreenInfer vs Always-Large

Green Orchestration
Framework