Burst workloads to Nautilus cluster dynamically
Context
This project aims to alleviate the limited GPU resources issue in our HSRN cluster. The solution is to tap into the GPU resources on Nautilus, a HyperCluster for running containerized Big Data Applications managed by UCSD Supercomputer Lab. We already built a working prototype that can burst ML training workloads from our testing cluster on Polaris (a testing Linux machine) to our namespace in Nautilus via Admiralty, a cluster federation tool. Currently, users need to specify an annotation to manually burst the workloads, but ideally, the cluster should do this automatically and dynamically, depending on its capacity.
Requirements
Acceptance Criteria
-
show that when there is not enough GPU resource on Polaris, workload is federated to Nautlius -
show that when there is not enough CPU/Mem resources on Polaris, workload is federated to Nautlius -
show that when there is enough CPU/Mem resources on Polaris, workload is scheduled locally -
show that when there is enough GPU resource on Polaris, workload is scheduled locally (edit node object and add GPU resource attributes)
Deliverables
-
a sub dir created in k8s-fed repo that contains documentation and a working prototype with Gitlab CI setup that demonstrates dynamic bursting. Pls refer to this prototype to get an idea of how it should be done
Resources
Edited by Yuheng Lu