Zhipu GLM-5 Scaling Pain: KV cache, speculative decoding and agent serving
This page summarizes Zhipu's unusually candid technical post about GLM-5 serving failures under high-load coding-agent traffic. The problem was not simple model quality. It involved inference-state management: KV-cache races in PD-disaggregated serving, read-before-ready timing in HiCache and monitoring signals from speculative decoding. This page reads it as a reminder that scaling intelligence also means scaling the serving system.