[08:15:12] (LVSHighCPU) firing: The host lvs1018:9100 has at least its CPU 24 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [08:20:12] (LVSHighCPU) resolved: The host lvs1018:9100 has at least its CPU 24 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [08:56:12] (LVSHighCPU) firing: The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:01:12] (LVSHighCPU) resolved: (3) The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:10:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 26 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:15:12] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 26 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:18:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 24 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:23:12] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 24 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:32:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 26 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:37:12] (LVSHighCPU) firing: (3) The host lvs1018:9100 has at least its CPU 38 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:42:12] (LVSHighCPU) resolved: (3) The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:52:42] (LVSHighCPU) firing: (5) The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:53:36] vgutierrez: anything to worry about? ^^^ [09:55:02] pybal process seems to be at 100% of one core [09:55:17] Yep.. noticed it [09:57:42] (LVSHighCPU) resolved: (4) The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:57:57] (LVSHighCPU) firing: (4) The host lvs1018:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:58:42] (LVSHighCPU) firing: (3) The host lvs1018:9100 has at least its CPU 22 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [10:00:09] gonnna restart pybal on lvs1020 first [10:02:57] (LVSHighCPU) resolved: (4) The host lvs1018:9100 has at least its CPU 22 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [10:06:32] _joe_: probably unrelated but I2193efe3ac50b16f1fb611f2d1b979d03f7a7449 is causing an alert on pybal [10:06:42] WARNING - Pool schema_443 is too small to allow depooling. [10:07:30] <_joe_> vgutierrez: that alert is kinda bogus [10:07:48] bogus but it won't depool any server if it fails [10:08:14] <_joe_> vgutierrez: what I am saying is that's false [10:08:21] <_joe_> anyways [10:08:28] <_joe_> we can roll that back [10:15:04] manual testing on lvs1020 shows that's unrelated as expected [10:18:31] something is messing with those two (lvs1018/lvs1020) though [10:18:34] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020&viewPanel=31&var-datasource=thanos&var-cluster=lvs [10:39:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [10:44:12] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [11:29:12] (LVSHighCPU) firing: The host lvs1018:9100 has at least its CPU 33 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [11:32:01] that's definitely puzzling me [11:32:11] can't pinpoint any L7 traffic on upload triggering that [11:34:12] (LVSHighCPU) resolved: (2) The host lvs1018:9100 has at least its CPU 33 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:00:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:05:12] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:16:42] (LVSHighCPU) firing: (4) The host lvs1020:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:21:42] (LVSHighCPU) resolved: (5) The host lvs1018:9100 has at least its CPU 39 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:38:42] (LVSHighCPU) firing: (2) The host lvs1020:9100 has at least its CPU 11 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:41:57] (LVSHighCPU) firing: (3) The host lvs1018:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:43:42] (LVSHighCPU) resolved: (3) The host lvs1018:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:51:57] (LVSHighCPU) firing: (4) The host lvs1018:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:53:42] (LVSHighCPU) resolved: (3) The host lvs1018:9100 has at least its CPU 17 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:13:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 5 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:18:12] (LVSHighCPU) resolved: (2) The host lvs1018:9100 has at least its CPU 29 saturated - https://bit.ly/wmf-lvscpu - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:38:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 19 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:43:12] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 19 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:44:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 9 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [13:49:12] (LVSHighCPU) resolved: (2) The host lvs1020:9100 has at least its CPU 19 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [14:05:12] (LVSHighCPU) firing: The host lvs1020:9100 has at least its CPU 9 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [14:10:12] (LVSHighCPU) resolved: The host lvs1020:9100 has at least its CPU 9 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1020 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [14:16:12] volans: do you have a handy way of dumping a stacktrace for a running python2 process? [14:18:14] vgutierrez: mmmh depends, are you allowed to restart it for a repro or you need to attach to an existing one forcely? [14:18:30] volans: I can restart it (pybal) on lvs1020 [14:18:38] because pdb doesn't attach to existing ones, you can use gdb but of course you'll get only the C side ot thins [14:19:01] perf top suggests that pybal is eating CPU on evals() [14:19:42] C eval() not python's one right? [14:19:57] PyEval_EvalFrameEx [14:20:37] which pybal's branch is running right now? [14:20:45] 1.15.10 [14:24:10] vgutierrez: I've never used it but we could also try https://pypi.org/project/pdb-attach/, let me have a look if I can spot anything by chance [14:24:20] in the existing code [14:26:39] vgutierrez: I see eval's in the monitoring code, is it possible it's a red herring and just plain normal that pybal spends a lot of time in evals ? [14:26:58] volans: not happening on lvs1017 or lvs1019 [14:27:12] lvs1017 CPU usage is around 1%.. 100% in lvs1018 [14:28:14] sure, but what's the percentage of eval in the one working fine? [14:29:35] less than 2% [14:29:46] and between 13 and 18% on the affected hosts [14:31:20] what do affected hosts have in common? [14:31:32] upload services [14:31:32] high/low traffic, etc.. [14:31:39] high-traffic2 [14:31:45] both? [14:31:57] yes [14:32:04] lvs1018 is high-traffic2 [14:32:04] eqiad only [14:32:07] and lvs1020 is the secondary [14:32:09] just in eqiad indeed [14:32:31] secondary as in standby or they handle both traffic via different MED? [14:32:38] standby [14:32:38] standby [14:32:47] but of course it's healthchecking and monitoring the backend services [14:32:53] nothing else has changed too [14:32:58] sure sure, so it seems not traffic-related [14:33:04] so a buggy backend server could be triggering this [14:33:12] (buggy in a really flexible way) [14:33:29] that or monitoring config itself or the bgp config [14:34:50] maybe unrelated [14:34:56] but 7 minutes before lvs1018 started struggling [14:34:57] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 14 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [14:35:30] dbproxy1018 is as backend server for those two lvs [14:35:47] I don't see pybal 100% cpu on lvs1018/lvs1020 right now [14:36:48] seems back to normal values from the graph [14:36:51] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs1018&viewPanel=3&var-datasource=thanos&var-cluster=lvs [14:37:12] * volans would prefer that graph to not have a fixed Y-axis 0-100% to zoom in vertically :D [14:37:58] sigh [14:38:48] that's true and it doesn't match any event apparently [14:49:38] volans: https://grafana.wikimedia.org/goto/vSRg8PQVk?orgId=1 [14:50:14] nice catch! [14:50:41] hmmm marostegui isn't here [14:50:44] worth checking with data-persistence [14:50:54] and the ongoing wikireplicas issue [14:51:51] might be related to T337446 [14:51:52] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [23:30:34] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure: upload.wikimedia.beta.wmflabs.org certificate expired (May 2023) - https://phabricator.wikimedia.org/T337642 (10AlexisJazz)