SC20 Is Everywhere We Are

SC20 Virtual Platform
Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale
Event Type
Paper
Tags
Cloud and Distributed Computing
Containers
Machine Learning, Deep Learning and Artificial Intelligence
Resource Management and Scheduling
Registration Categories
TP
TimeWednesday, 18 November 20203:30pm - 4pm EDT
LocationTrack 2
DescriptionOnline cloud services are deployed as long-running applications (LRAs) in containers. Scheduling LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on placement constraints and thus fall short in performance.

In this work, we present Metis, a general-purpose scheduler using deep reinforcement learning (RL) techniques. This eliminates manual specification of placement constraints and offers concrete quantitative scheduling criteria. As directly training an RL model does not scale, we develop novel hierarchical learning techniques that decompose a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We have implemented Metis in Docker Swarm. EC2 deployment with real applications shows that compared with state-of-the-art schedulers, Metis improves the request throughput by up to 61%, optimizes various scheduling objectives and easily scales to a large cluster where 3k containers run on over 700 machines.
Back To Top Button