Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale

SC20 Proceedings

Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale

Authors: Luping Wang, Qizhen Weng, and Wei Wang (Hong Kong University of Science and Technology); Chen Chen (Hong Kong University of Science and Technology, Huawei Technologies Ltd); and Bo Li (Hong Kong University of Science and Technology)

Abstract: Online cloud services are deployed as long-running applications (LRAs) in containers. Scheduling LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on placement constraints and thus fall short in performance.

In this work, we present Metis, a general-purpose scheduler using deep reinforcement learning (RL) techniques. This eliminates manual specification of placement constraints and offers concrete quantitative scheduling criteria. As directly training an RL model does not scale, we develop novel hierarchical learning techniques that decompose a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We have implemented Metis in Docker Swarm. EC2 deployment with real applications shows that compared with state-of-the-art schedulers, Metis improves the request throughput by up to 61%, optimizes various scheduling objectives and easily scales to a large cluster where 3k containers run on over 700 machines.

Back to Technical Papers Archive Listing