A multi-stage pipeline that combines prompt optimization, complexity analysis, and carbon-aware routing to minimize the energy cost of LLM inference - without sacrificing response quality.
Each prompt passes through all layers sequentially before a model is invoked.
Energy is estimated using a combination of token-based proxy metrics and runtime GPU utilization measurements. CO₂ is calculated using real-time grid carbon intensity.
# Energy estimation module
def estimate_energy(tokens, model_tier, grid_intensity):
"""
Estimate energy and CO2 for an inference call.
tokens: optimized token count
model_tier: 'small' | 'medium' | 'large'
grid_intensity: gCO2/kWh (from ERCOT API)
"""
ENERGY_COEFFICIENTS = {
'small': 4e-5, # Wh per token
'medium': 1.8e-4,
'large': 9.5e-4,
}
energy_wh = tokens * ENERGY_COEFFICIENTS[model_tier]
energy_kwh = energy_wh / 1000
co2_grams = energy_kwh * grid_intensity
return {
'energy_wh': energy_wh,
'co2_grams': co2_grams,
'savings_vs_large': calc_savings(tokens, model_tier),
}