ceph-mgr memory leak

2023-06-09, on worker27, ceph-mgr reached rss=362 GB and crashed

2023-07-24, on worker30, ceph-mgr reached rss=349 GB and crashed

2023-07-23 09:49:51.922 Out of memory: Killed process 3843 (ceph-mgr) total-vm:366969560kB, anon-rss:349527612kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:706988kB oom_score_adj:0

  • Can we figure out the bug? Just wait for Ceph upstream to fix?
  • Can we limit ceph-mgr memory use so it crashes faster / impacts system less?
  • Can we raise Ceph component's oom_score so that they can be killed rather than the system killing all Kubernetes containers first?

grafana memory usage graph