Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale
Cloud and Distributed Computing
Machine Learning, Deep Learning and Artificial Intelligence
Resource Management and Scheduling
TimeWednesday, 18 November 20203:30pm - 4pm EDT
DescriptionOnline cloud services are deployed as long-running applications (LRAs) in containers. Scheduling LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on placement constraints and thus fall short in performance.
In this work, we present Metis, a general-purpose scheduler using deep reinforcement learning (RL) techniques. This eliminates manual specification of placement constraints and offers concrete quantitative scheduling criteria. As directly training an RL model does not scale, we develop novel hierarchical learning techniques that decompose a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We have implemented Metis in Docker Swarm. EC2 deployment with real applications shows that compared with state-of-the-art schedulers, Metis improves the request throughput by up to 61%, optimizes various scheduling objectives and easily scales to a large cluster where 3k containers run on over 700 machines.