Technical Documentation

Green Orchestration
Framework

A 7-layer pipeline that intelligently routes AI inference requests to the most energy-efficient model capable of answering accurately. Built with Python, deployed to Hugging Face Spaces, and integrated with Groq's API.

v0.1.0 FastAPI Backend Groq Models HF Spaces
Overview
Why this exists and how it differs from standard LLM APIs

Most AI applications send every query to the largest, most capable model available. This is like driving a semi-truck to pick up a coffee — it works, but it wastes enormous energy. GreenInfer's Green Orchestration Framework (GOF) applies a simple insight from computer science: match resource allocation to task requirements.

The framework is not an AI model itself. It is a routing and optimization layer — analogous to LangChain but focused on sustainability rather than chaining. It sits between the user and the model pool, analyzing each prompt before any expensive inference runs.

The key original contribution is the Smart Preview system: for complex queries, the framework generates a short 2-sentence summary and bullet outline using the small model first. The user confirms before the full expensive response runs, preventing wasted large-model reruns when users want to refine their question.

The 7 Layers
Every prompt flows through these layers in sequence
01
Complexity Scorer greeninfer/complexity_scorer.py
Assigns a complexity score from 0–100 to every incoming prompt. Uses a rule-based engine as a fast fallback (Shannon entropy + token length + task classification signals) and a fine-tuned DistilBERT classifier as the primary scorer.
Model: sirenice/greeninfer-complexity · 600 training examples · 4 tiers · 98.9% accuracy
02
Prompt Optimizer greeninfer/model_registry.py
Silently rewrites user prompts to remove filler words, passive constructions, and redundant phrasing before sending to the inference model. Uses a fine-tuned T5-small model (sirenice/greenpromptsoptimizer). The original and optimized prompts are both logged for transparency.
Average token reduction: ~35% · Model: T5-small fine-tuned on 1000+ prompt pairs
03
Carbon Router greeninfer/carbon_router.py
Checks the current grid carbon intensity before routing. Uses hourly ERCOT estimates by default, with support for ElectricityMaps and WattTime APIs. When the grid is running heavy (high gCO₂/kWh), large-model queries are deferred or downgraded.
Supports: ERCOT (hourly lookup) · ElectricityMaps API · WattTime API
04
Cascade Engine greeninfer/cascade.py
Implements the FrugalGPT-inspired cascade: start with the smallest model tier, evaluate confidence, and only escalate to the next tier if confidence is below the threshold. Low confidence signals include hedging phrases ("I'm not sure", "I don't have enough information") and responses that are too short relative to prompt complexity.
Path: small → medium → large · Configurable confidence threshold · Escalations logged
05
Model Registry greeninfer/model_registry.py
Model-agnostic pool that manages the three Groq model tiers plus vision model support. Handles API rate limits, error recovery, and model-specific parameter tuning. New models can be registered without changing orchestrator code.
Small: llama-3.2-1b · Medium: llama-3.1-8b · Large: llama-3.3-70b · Vision: llama-3.2-11b-vision
06
Energy Estimator greeninfer/energy_estimator.py
Calculates energy consumption per inference using proxy estimation (token count × per-token energy coefficients per model tier) with optional CodeCarbon integration for real GPU measurement. All estimates are per-request and accumulated across the session.
Small: 0.04 µWh/token · Medium: 0.18 µWh/token · Large: 0.95 µWh/token
07
Orchestrator greeninfer/orchestrator.py
Main coordinator that chains all six layers above. Returns a rich result object containing the response, model tier used, cascade path, complexity score, tokens saved, energy used in mWh, CO₂ in grams, and whether the optimizer was active.
Returns: InferenceResult dataclass with 12 fields including full provenance
REST API
Deployed at https://sirenice-greeninfer-backend.hf.space
GET
/health
Returns system status, model availability, and Groq connection state.
POST
/analyze
Scores a prompt's complexity, runs the optimizer, and returns routing recommendation — without running inference. Used by Smart Preview.
POST
/chat
Full pipeline: analyze + route + infer. Accepts prompt, mode, context history, and optional image attachments. Returns InferenceResult JSON.
{
  "response": "Photosynthesis is the process...",
  "model_tier": "small",
  "model_name": "llama-3.2-1b-preview",
  "complexity_score_100": 18,
  "complexity_label": "Low",
  "original_tokens": 12,
  "tokens_saved": 4,
  "reduction_pct": 33,
  "energy_mwh": 0.87,
  "co2_grams": 0.000172,
  "energy_saved_pct": 98,
  "cascade_path": ["small"],
  "escalations": 0,
  "optimizer_used": true
}
Python SDK
Install locally and run inference with full energy transparency
git clone https://github.com/srineshtor21-coder/GreenInfer
cd GreenInfer
pip install -r requirements.txt
export GROQ_API_KEY="your_key_here"
from greeninfer import GreenInfer

# Initialize (auto-loads all layers)
gi = GreenInfer()

# Basic chat
result = gi.chat("Explain quantum entanglement simply")

print(result.response)        # The answer
print(result.model_tier)      # "small" | "medium" | "large"
print(result.energy_mwh)      # e.g. 0.87
print(result.co2_grams)       # e.g. 0.000172
print(result.tokens_saved)    # e.g. 4
print(result.cascade_path)    # e.g. ["small"]

# Eco mode (maximum energy savings)
result = gi.chat("Write a sorting algorithm", mode="eco")

# Performance mode (accuracy priority)
result = gi.chat("Analyze this legal document...", mode="performance")
Model Pool
Three tiers via Groq API — all open-source Llama models
GROQ_MODELS = {
    "small":  "llama-3.2-1b-preview",    # 0.9 mWh/query — 55% of traffic
    "medium": "llama-3.1-8b-instant",    # 3.8 mWh/query — 30% of traffic
    "large":  "llama-3.3-70b-versatile", # 48.0 mWh/query — 15% of traffic
    "vision": "llama-3.2-11b-vision-preview"  # image inputs
}

# Routing thresholds (balanced mode)
ROUTE = {
    "small":  complexity < 30,
    "medium": 30 <= complexity < 65,
    "large":  complexity >= 65
}
Framework Comparison
How GreenInfer differs from existing tools
Feature GreenInfer LangChain FrugalGPT Default API
Energy-aware routing
Carbon grid integration
Smart Preview (confirm before gen)
Token-level energy metricsPartial
Prompt optimizer
Cascade engine
Per-response Green Score
Open source
Benchmark Results
From experiments/benchmark.py — 20 prompts, mixed complexity

These numbers come from running the full benchmark suite included in the repository. The 20-prompt test set covers all four complexity tiers: simple factual, explanation, analysis, and code generation.

97%
Energy saved (small tier)
73%
Avg energy reduction
55%
Routing accuracy
183g
CO₂ avoided
98.9%
Classifier accuracy
35%
Token reduction

Note: 55% routing accuracy reflects the challenge of exact-tier matching. The more relevant metric for sustainability is energy savings: even misrouted queries (e.g. routing medium-complexity to small) still save substantial energy while usually producing acceptable answers.