Inference BOTEC
I spent about 15 minutes doing some quick back-of-the-envelope calculations to understand the economics (“tokenomics”) of inference for large transformer models. GPT 4.5 has summarized it below, and I have edited. I’m pretty sure that the labs are not serving the models at cost as a result.
Model and FLOP Analysis
Transformer inference requires roughly per token for active parameters.
For a model with ~300B active parameters (about GPT-4’s size, if I had to guess based on the reported 1.8T + MoE, though newer models likely have even fewer given faster throughput and lower prices):
GPU Costs and Performance
Using relatively conservative assumptions:
NVIDIA H100 GPU rental: $5/hr
GPU utilization: 50%
H100 performance at FP32: ~67 TFLOPS, and at FP16: ~1979 TFLOPS
Calculating cost per million tokens:
FP32:
Thus, for 1 million tokens:
or, for FP16:
Takeaway
Even with somewhat conservative estimates (rental rates, low-ish utilization, no fancy optimizations), providers charging a few dollars per million output tokens could very well be making a profit—not merely covering costs.
And then on top of this you factor in that the labs spend millions hiring top-notch engineers to optimize inference and utilization, surely had found at least some of deepseek’s inference optimizations, also have access to H20s etc.