[00:58:36] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:36] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:53:47] I am going to switch pc1 [08:18:36] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:36] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:10] sigh [08:53:36] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s4.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:55:08] fixed db1246 [08:55:39] <3 [08:58:36] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s4.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:58:47] woot [09:13:25] marostegui: I don't know if it's you but db1154 seems to be done [09:13:27] *down [09:14:25] yes [09:14:28] i downtimed it [09:14:41] I just upgraded its kernel for the bullseye reboots [09:15:24] it is back now [09:30:59] awesome. Thanks! [10:03:54] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:37] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:37:02] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on ms-be2077:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:41:49] (PuppetZeroResources) firing: Puppet has failed generate resources on backup2007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:42:49] (PuppetZeroResources) firing: Puppet has failed generate resources on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:46:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on ms-be2077:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:46:53] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on backup2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:47:48] (PuppetZeroResources) firing: Puppet has failed generate resources on pc2011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:48:15] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:51:57] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on ms-be2058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:56:48] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on ms-be2057:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:02:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:03:36] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:49] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on ms-be2057:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:07:06] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on backup2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:09:55] ^-- these are puppetserver issues, see #wikimedia-sre [11:11:53] (PuppetZeroResources) firing: (7) Puppet has failed generate resources on ms-be2057:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:12:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on thanos-be2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:15:05] db2097 crashed while it was backing up x1 [11:16:49] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on backup2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:16:58] (PuppetZeroResources) firing: (10) Puppet has failed generate resources on ms-be1068:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:18:20] PROBLEM - MariaDB sustained replica lag on s4 on db1243 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104 [11:20:20] RECOVERY - MariaDB sustained replica lag on s4 on db1243 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1243&var-port=9104 [11:52:48] (PuppetZeroResources) firing: Puppet has failed generate resources on pc2015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:53:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on backup2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [11:54:18] (PuppetZeroResources) resolved: Puppet has failed generate resources on thanos-fe1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:01:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on ms-be2058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:02:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on pc2015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:04:06] PROBLEM - MariaDB sustained replica lag on s4 on db1238 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [12:09:08] RECOVERY - MariaDB sustained replica lag on s4 on db1238 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1238&var-port=9104 [13:23:36] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db1246:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:46] I am going to switch es4 eqiad now [14:28:37] (SystemdUnitFailed) firing: (5) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db1213:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:38:37] (SystemdUnitFailed) firing: (6) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db1213:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:37] (SystemdUnitFailed) firing: (6) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db1213:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:48:37] (SystemdUnitFailed) firing: (6) wmf_auto_restart_prometheus-mysqld-exporter@s6.service on db1213:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:29] marostegui: I was wrong, the current backup sources are: es1022 / es1025 & es2022 / es2025, so I think no action necessary [14:51:36] good :) [14:51:58] for some reason I thought it was es2020, maybe it was that in the past or something [15:47:08] could I get a +1 on a notification toggle please? :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1004691 all is green [15:47:35] thanks marostegui [15:53:37] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db2097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:52] 😮‍💨 [16:03:38] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-mysqld-exporter@s4.service on db2138:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:10:50] these alerts are getting way too noisy :-/ [16:11:15] Yeah, I predicted that [16:11:41] I mean, one doesn't have to be too clever to do it [16:11:56] Those are a bit more noisy today cause I am working with those hosts [16:12:07] I have downtimed them via icinga but oh well [16:12:10] yeah, but it is not about your work [16:12:36] it is that we generally don't care about failing jobs, and when we do we used to have other incidents [16:12:45] *other alerts, sorryt [16:13:29] I would for now update the config to send them to the -feed channel as a temporary measure, until there is a better solution, that's my proposal [16:15:17] my point being that there is a flaw in the obs stack IMHO, but we are the ones that are not given enough options to handle them [16:15:42] yeah, these alerts will get ignored eventually [16:16:20] when something has become anoying we could make them warnings or something [16:16:45] but as these are common, we cannot just do that yet, or other things Emperor asked on ticket [16:18:43] we need to work togheter better among teams, between obs, infra and D.P., I think the split is making things too complicated [16:44:17] (SystemdUnitFailed) firing: (2) wmf_auto_restart_prometheus-mysqld-exporter@s4.service on db1244:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:23:37] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter@s7.service on db1170:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:29] PROBLEM - MariaDB sustained replica lag on s7 on db2169 is CRITICAL: 8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=13317 [18:40:23] PROBLEM - MariaDB sustained replica lag on s7 on db2122 is CRITICAL: 17.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [18:40:29] RECOVERY - MariaDB sustained replica lag on s7 on db2169 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2169&var-port=13317 [18:40:39] PROBLEM - MariaDB sustained replica lag on s7 on db1227 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [18:41:39] RECOVERY - MariaDB sustained replica lag on s7 on db1227 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1227&var-port=9104 [18:42:25] RECOVERY - MariaDB sustained replica lag on s7 on db2122 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [20:28:37] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter@s2.service on db1170:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:41] PROBLEM - MariaDB sustained replica lag on s4 on db1247 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104 [20:29:41] RECOVERY - MariaDB sustained replica lag on s4 on db1247 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1247&var-port=9104