ceph-mgr memory leak
2023-06-09, on worker27 (hsrn-ed10a-7e12), ceph-mgr reached rss=360 GB and crashed
Before that, the OOM killer stopped promtail and node-exporter because Kubernetes sets a high oom_score_adj. It's a master node so there were no other workloads.
There is nothing in the ceph-mgr logs really, just listing pgmaps with the occasional "large alloc" warning (9 total), then a 15min gap until it crashed:
Jun 09 04:53:30 worker27 bash[3129]: debug 2023-06-09T04:53:30.567+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v11855: 9825 pgs: 1 active+clean+remapped, 9824 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 78 MiB/s rd, 1.3 MiB/s wr, 1.57k op/s; 30903/48556425 objects misplaced (0.064%)
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.631+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v11856: 9825 pgs: 1 active+clean+remapped, 9824 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 107 MiB/s rd, 2.2 MiB/s wr, 2.25k op/s; 30903/48556429 objects misplaced (0.064%)
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.899+0000 7f0748285700 0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.931+0000 7f0741237700 0 [rbd_support INFO root] TrashPurgeScheduleHandler: load_schedules
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.975+0000 7f0741237700 0 [rbd_support INFO root] load_schedules: kubernetes05-meta, start_after=
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.975+0000 7f0748285700 0 [rbd_support INFO root] load_schedules: kubernetes05-meta, start_after=
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.979+0000 7f0741237700 0 [rbd_support INFO root] load_schedules: kubernetes06, start_after=
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.979+0000 7f0748285700 0 [rbd_support INFO root] load_schedules: kubernetes06, start_after=
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.979+0000 7f0748285700 0 [rbd_support INFO root] load_schedules: kubernetes08, start_after=
Jun 09 04:53:32 worker27 bash[3129]: debug 2023-06-09T04:53:32.979+0000 7f0741237700 0 [rbd_support INFO root] load_schedules: kubernetes08, start_after=
Jun 09 04:53:34 worker27 bash[3129]: debug 2023-06-09T04:53:34.647+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v11857: 9825 pgs: 1 active+clean+remapped, 9824 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 95 MiB/s rd, 1.8 MiB/s wr, 1.98k op/s; 30903/48556429 objects misplaced (0.064%)
Jun 09 04:53:36 worker27 bash[3129]: debug 2023-06-09T04:53:36.707+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v11858: 9825 pgs: 1 active+clean+scrubbing+deep, 1 active+clean+remapped, 9823 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 105 MiB/s rd, 2.2 MiB/s wr, 1.98k op/s; 30903/48556451 objects misplaced (0.064%)
...
Jun 09 04:54:53 worker27 bash[3129]: debug 2023-06-09T04:54:53.732+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v11897: 9825 pgs: 1 active+clean+remapped, 3 active+clean+scrubbing+deep, 9821 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 30903/48556196 objects misplaced (0.064%)
Jun 09 04:54:53 worker27 bash[3129]: tcmalloc: large alloc 1233903616 bytes == 0x560f0a1b6000 @ 0x7f0898f04760 0x7f0898f25a62 0x7f08993d45c8 0x7f0899404365 0x560ead17311b 0x560ead173440 0x560ead0811ad 0x7f0899481de7 0x7f0899482cd8 0x7f089945f998 0x7f0899482087 0x7f0899482cd8 0x7f089945f998 0x7f0899482087 0x7f0899482cd8 0x7f08993df994 0x7f0899480e5f 0x7f08993e8a2b 0x7f0899484b9f 0x7f08993e0306 0x7f089945fb80 0x7f0899482087 0x7f0899482cd8 0x7f08993e0ea2 0x7f08993e1c7e 0x7f08993f3f00 0x7f08993e8a2b 0x7f0899484b9f 0x7f089945f998 0x7f0899482087 0x7f0899482cd8
Jun 09 04:54:55 worker27 bash[3129]: debug 2023-06-09T04:54:55.748+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v11898: 9825 pgs: 1 active+clean+remapped, 3 active+clean+scrubbing+deep, 9821 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 30903/48556196 objects misplaced (0.064%)
...
Jun 09 05:02:26 worker27 bash[3129]: debug 2023-06-09T05:02:26.176+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v12120: 9825 pgs: 1 active+clean+remapped, 3 active+clean+scrubbing+deep, 9821 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 30903/48556196 objects misplaced (0.064%)
Jun 09 05:02:28 worker27 bash[3129]: debug 2023-06-09T05:02:28.188+0000 7f076453d700 0 log_channel(cluster) log [DBG] : pgmap v12121: 9825 pgs: 1 active+clean+remapped, 3 active+clean+scrubbing+deep, 9821 active+clean; 23 TiB data, 67 TiB used, 1.9 PiB / 1.9 PiB avail; 30903/48556196 objects misplaced (0.064%)
Jun 09 05:16:21 worker27 systemd[1]: ceph-3e2cd52c-ac4e-11ec-9a64-7934486a0684@mgr.worker27.ocdgnb.service: Main process exited, code=exited, status=137/n/a
Jun 09 05:16:26 worker27 systemd[1]: ceph-3e2cd52c-ac4e-11ec-9a64-7934486a0684@mgr.worker27.ocdgnb.service: Failed with result 'exit-code'
2023-07-24, on worker30 (hsrn-ed2l-rcdc), ceph-mgr reached rss=349 GB and crashed
2023-07-23 09:49:51.922 Out of memory: Killed process 3843 (ceph-mgr) total-vm:366969560kB, anon-rss:349527612kB, file-rss:0kB, shmem-rss:0kB, UID:167 pgtables:706988kB oom_score_adj:0
2024-07-12, on hsrn-ed2a-wwh, ceph-mgr reached rss=371 GB and triggered OOM killer
<4>1 2024-07-12T15:16:26.935034+00:00 hsrn-ed2a-wwh kernel - - - [950671.417544] calico-node invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-997
<4>1 2024-07-12T15:16:38.730445+00:00 hsrn-ed2a-wwh kernel - - - [950671.417554] CPU: 24 PID: 21827 Comm: calico-node Tainted: G S 5.15.0-113-generic #123-Ubuntu
<4>1 2024-07-12T15:16:55.749736+00:00 hsrn-ed2a-wwh kernel - - - [950671.417676] Mem-Info:
<4>1 2024-07-12T15:16:55.749738+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] active_anon:44115 inactive_anon:98011048 isolated_anon:0
<4>1 2024-07-12T15:16:55.749739+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] active_file:0 inactive_file:859 isolated_file:113
<4>1 2024-07-12T15:16:55.749739+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] unevictable:8299 dirty:0 writeback:2
<4>1 2024-07-12T15:16:55.749740+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] slab_reclaimable:90440 slab_unreclaimable:114498
<4>1 2024-07-12T15:16:55.749740+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] mapped:2329 shmem:2517 pagetables:211026 bounce:0
<4>1 2024-07-12T15:16:55.749740+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] kernel_misc_reclaimable:0
<4>1 2024-07-12T15:16:55.749743+00:00 hsrn-ed2a-wwh kernel - - - [950671.417685] free:216883 free_pcp:7097 free_cma:0
<4>1 2024-07-12T15:16:55.749743+00:00 hsrn-ed2a-wwh kernel - - - [950671.417691] Node 0 active_anon:70312kB inactive_anon:195703172kB active_file:0kB inactive_file:1720kB unevictable:33196kB isolated(anon):0kB isolated(file):316kB mapped:8936kB dirty:0kB writeback:4kB shmem:7032kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB kernel_stack:25160kB pagetables:411328kB all_unreclaimable? no
<4>1 2024-07-12T15:16:55.749744+00:00 hsrn-ed2a-wwh kernel - - - [950671.417697] Node 1 active_anon:106148kB inactive_anon:196341020kB active_file:0kB inactive_file:1716kB unevictable:0kB isolated(anon):0kB isolated(file):136kB mapped:380kB dirty:0kB writeback:4kB shmem:3036kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:25416kB pagetables:432776kB all_unreclaimable? no
<6>1 2024-07-12T15:16:55.749771+00:00 hsrn-ed2a-wwh kernel - - - [950671.417818] Tasks state (memory values in pages):
<6>1 2024-07-12T15:16:55.749771+00:00 hsrn-ed2a-wwh kernel - - - [950671.417819] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
<6>1 2024-07-12T15:16:55.749783+00:00 hsrn-ed2a-wwh kernel - - - [950671.417873] [ 1839] 0 1839 4356910 12328 1683456 0 -999 containerd
<6>1 2024-07-12T15:16:55.749835+00:00 hsrn-ed2a-wwh kernel - - - [950671.417881] [ 1881] 0 1881 44914 10507 172032 0 0 haproxy
<6>1 2024-07-12T15:16:55.749839+00:00 hsrn-ed2a-wwh kernel - - - [950671.417896] [ 2340] 0 2340 1270894 11966 851968 0 -500 dockerd
<6>1 2024-07-12T15:16:55.749840+00:00 hsrn-ed2a-wwh kernel - - - [950671.417899] [ 2349] 0 2349 1129256 18289 851968 0 -999 kubelet
<6>1 2024-07-12T15:16:55.749849+00:00 hsrn-ed2a-wwh kernel - - - [950671.417937] [ 2789] 0 2789 339597 18158 380928 0 -997 kube-scheduler
<6>1 2024-07-12T15:16:55.749850+00:00 hsrn-ed2a-wwh kernel - - - [950671.417946] [ 2866] 0 2866 2874985 44267 1818624 0 -997 etcd
<6>1 2024-07-12T15:16:55.749853+00:00 hsrn-ed2a-wwh kernel - - - [950671.417957] [ 4286] 1004 4286 128043 33925 962560 0 0 splunkd
<6>1 2024-07-12T15:16:55.749854+00:00 hsrn-ed2a-wwh kernel - - - [950671.417964] [ 4456] 0 4456 256295 10142 282624 0 0 ir_agent
<6>1 2024-07-12T15:16:55.749854+00:00 hsrn-ed2a-wwh kernel - - - [950671.417967] [ 7898] 0 7898 1204314 752972 7340032 0 -997 kube-apiserver
<6>1 2024-07-12T15:16:55.749864+00:00 hsrn-ed2a-wwh kernel - - - [950671.418020] [ 19894] 0 19894 1274073 15852 2117632 0 1000 promtail
<6>1 2024-07-12T15:16:55.749886+00:00 hsrn-ed2a-wwh kernel - - - [950671.418094] [ 207454] 1000 207454 323083 13144 360448 0 999 metrics-server
<6>1 2024-07-12T15:16:55.749891+00:00 hsrn-ed2a-wwh kernel - - - [950671.418126] [1522587] 167 1522587 477924 305548 3567616 0 0 ceph-mon
<6>1 2024-07-12T15:16:55.749940+00:00 hsrn-ed2a-wwh kernel - - - [950671.418175] [1990126] 167 1990126 96429568 90757192 761073664 0 0 ceph-mgr
<6>1 2024-07-12T15:16:55.749947+00:00 hsrn-ed2a-wwh kernel - - - [950671.418205] [2169163] 167 2169163 671598 443872 4657152 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749952+00:00 hsrn-ed2a-wwh kernel - - - [950671.418221] [2171096] 167 2171096 664049 416500 4583424 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749956+00:00 hsrn-ed2a-wwh kernel - - - [950671.418240] [2172929] 167 2172929 663747 395465 4608000 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749960+00:00 hsrn-ed2a-wwh kernel - - - [950671.418262] [2174977] 167 2174977 599421 364945 4079616 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749964+00:00 hsrn-ed2a-wwh kernel - - - [950671.418278] [2176830] 167 2176830 674789 433291 4702208 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749968+00:00 hsrn-ed2a-wwh kernel - - - [950671.418298] [2178770] 167 2178770 681482 453778 4730880 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749972+00:00 hsrn-ed2a-wwh kernel - - - [950671.418321] [2180673] 167 2180673 673394 443724 4685824 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749974+00:00 hsrn-ed2a-wwh kernel - - - [950671.418341] [2182417] 167 2182417 627174 357618 4284416 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749978+00:00 hsrn-ed2a-wwh kernel - - - [950671.418357] [2184258] 167 2184258 784773 526671 5545984 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749982+00:00 hsrn-ed2a-wwh kernel - - - [950671.418373] [2192683] 167 2192683 588733 385544 4009984 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749989+00:00 hsrn-ed2a-wwh kernel - - - [950671.418393] [2196780] 167 2196780 662596 408533 4575232 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749993+00:00 hsrn-ed2a-wwh kernel - - - [950671.418412] [2202502] 167 2202502 685048 436814 4767744 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.749998+00:00 hsrn-ed2a-wwh kernel - - - [950671.418432] [2205214] 167 2205214 701922 475026 4890624 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.750007+00:00 hsrn-ed2a-wwh kernel - - - [950671.418456] [2206881] 167 2206881 583139 322890 3932160 0 0 ceph-osd
<6>1 2024-07-12T15:16:55.750015+00:00 hsrn-ed2a-wwh kernel - - - [950671.418494] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=cri-containerd-8e4d5b652430ebca35a908d2518df8dbb8c1f6fcb0ec6aa5bd6445f53406777e.scope,mems_allowed=0-1,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod710b3740_0ece_4122_b309_2b6692db3a4f.slice/cri-containerd-60c283e689582813fdd04668007ee7701588379d2c75b31110c37fc6c46d9909.scope,task=promtail,pid=19894,uid=0
<3>1 2024-07-12T15:16:55.750016+00:00 hsrn-ed2a-wwh kernel - - - [950671.418576] Out of memory: Killed process 19894 (promtail) total-vm:5096292kB, anon-rss:63408kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:2068kB oom_score_adj:1000
- Can we figure out the bug? Just wait for Ceph upstream to fix?
- Can we limit ceph-mgr memory use so it crashes faster / impacts system less?
- Can we raise Ceph component's oom_score so that they can be killed rather than the system killing all Kubernetes containers first?
Edited by Remi Rampin