SC20 Is Everywhere We Are

SC20 Virtual Platform
A Parallel Job Scheduling Method to Effectively Use Shared Heterogeneous Systems for Urgent Computations
Presenter
Event Type
Workshop
Tags
Applications
Scientific Computing
Simulation
State of the Practice
Technology Challenge
Registration Categories
W
TimeFriday, 13 November 20205:50pm - 5:55pm EST
LocationTrack 10
DescriptionLightning talk: Dedicated resources are widely used in HPC systems for urgent decision-making because they can provide immediate and dedicated access to urgent computations. Setting up dedicated resources for each urgent computation, however, is not economically viable because the computations may occur rarely, and require huge amounts of resources. Moreover, disasters can cause damage to dedicated resources that makes it impossible to perform urgent computations that can help to mitigate the damage. Due to these limitations, using existing shared infrastructures becomes an indispensable approach for HPC systems supporting urgent decision making.

Recent studies had shown that providing job scheduling methods that can handle urgent jobs is one important challenge that must be addressed to enable shared infrastructures for urgent computations. Since the shared infrastructures also serve regular uses of the resource, considering both urgent and regular jobs is necessary because both jobs can delay each other. Moreover, preempting regular jobs is necessary to guarantee immediate execution of urgent jobs. In heterogeneous systems with coprocessors, preemption becomes more challenging because it relies on system functionalities provided by the host processor. In this ongoing work, we introduce a parallel job scheduling method to effectively use shared heterogeneous systems for urgent computations. To achieve low preemption delays, we employ an in-memory process swapping mechanism to preempt jobs running on coprocessors. Our evaluation with workloads of real HPC systems shows that our method can achieve negligible delays of urgent jobs without significant increases in response times and slowdowns of regular jobs.
Back To Top Button