Large-scale language models (LLMs) can be efficiently served on heterogeneous GPU clusters using adaptive model quantization and phase-aware partitioning.