Burst workloads to Nautilus cluster dynamically

Context

This project aims to alleviate the limited GPU resources issue in our HSRN cluster. The solution is to tap into the GPU resources on Nautilus, a HyperCluster for running containerized Big Data Applications managed by UCSD Supercomputer Lab. We already built a working prototype that can burst ML training workloads from our testing cluster on Polaris (a testing Linux machine) to our namespace in Nautilus via Admiralty, a cluster federation tool. Currently, users need to specify an annotation to manually burst the workloads, but ideally, the cluster should do this automatically and dynamically, depending on its capacity.

Requirements

access to Polaris cluster @yc6371
access to Nautilus cluster @yc6371

Acceptance Criteria

show that when there is not enough GPU resource on Polaris, workload is federated to Nautlius
show that when there is not enough CPU/Mem resources on Polaris, workload is federated to Nautlius
show that when there is enough CPU/Mem resources on Polaris, workload is scheduled locally
show that when there is enough GPU resource on Polaris, workload is scheduled locally (edit node object and add GPU resource attributes)

Deliverables

a sub dir created in k8s-fed repo that contains documentation and a working prototype with Gitlab CI setup that demonstrates dynamic bursting. Pls refer to this prototype to get an idea of how it should be done

Resources

Admiralty

Edited Sep 27, 2024 by Yuheng Lu