[00:15:25] FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:40] FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:02] Going to switch s7 master [08:15:40] FIRING: SystemdUnitFailed: podman-auto-update.service on moss-be2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:21:03] Ugh, it turns out podman-auto-update races with container lifecycle :( [09:21:11] https://phabricator.wikimedia.org/P66707 [09:31:58] I think we probably just want to disable it. [11:07:29] Could I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1054864 to mask the offending unit, please? cf. T370255 [11:07:30] T370255: podman-auto-update failures - https://phabricator.wikimedia.org/T370255 [12:56:41] volans: so cumin2002 looks good now? [12:57:30] I didn't touch it, I was waiting for arnaudb to reply on why some hosts didn't get the latest package (there are a few0 [12:58:30] I think I've replied? [12:58:49] https://phabricator.wikimedia.org/T370029 [12:58:50] afaict it's linked to the last reimage, I don't think this package is in a pinned version/latest [12:59:15] forgot to reply, indeed [12:59:18] hold on [13:01:19] arnaudb: can we just upgrade the package there? (And wherever else it should be) [13:01:33] yep I think so, lets try on a sample set [13:04:30] ok [13:22:42] arnaudb: so you've tested it on cumin2002? [13:22:45] like is it all good? [13:22:58] no, I finished something first [13:23:08] ok! [13:31:30] checking db-switchover's code volans I don't see any safe path that I could try, maybe we could add a --test to just try to do basic `select @@read_only;` on the nodes? [13:31:44] much simplier [13:32:28] >>> from wmfmariadbpy import WMFMariaDB [13:32:28] ? [13:32:33] ah :D [13:32:40] >>> c.execute("select @@hostname") [13:32:46] from a sudo python3 shell [13:33:22] {'query': 'select @@hostname', 'host': 'db2121.codfw.wmnet', 'port': 3306, 'database': None, 'success': True, 'numrows': 1, 'rows': (('db2121',),), 'fields': ('@@hostname',)} [13:33:56] great [13:45:40] so we are good? [14:02:13] we should be yes, at least for cumin2002, then I guess we should update the package on all the remaining hosts [14:02:57] oh checked debmonitor now, I think arn.aud already did it [14:03:35] just the 2 cloudcumin hosts are missing https://debmonitor.wikimedia.org/packages/python3-wmfmariadbpy-remote [14:04:05] all the other packages are already up to date [14:08:44] yeah, let's upgrade it everywhere [18:23:33] PROBLEM - MariaDB sustained replica lag on s1 on db1219 is CRITICAL: 91.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104 [18:24:17] PROBLEM - MariaDB sustained replica lag on s1 on db1186 is CRITICAL: 51 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [18:24:33] PROBLEM - MariaDB sustained replica lag on s1 on db1235 is CRITICAL: 58.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104 [18:24:33] PROBLEM - MariaDB sustained replica lag on s1 on db1218 is CRITICAL: 22.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104 [18:24:33] PROBLEM - MariaDB sustained replica lag on s1 on db1196 is CRITICAL: 43.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104 [18:25:05] PROBLEM - MariaDB sustained replica lag on s1 on db1207 is CRITICAL: 66.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [18:25:15] PROBLEM - MariaDB sustained replica lag on s1 on db1234 is CRITICAL: 24.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1234&var-port=9104 [18:25:19] PROBLEM - MariaDB sustained replica lag on s1 on db1195 is CRITICAL: 16.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1195&var-port=9104 [18:25:19] PROBLEM - MariaDB sustained replica lag on s1 on db1163 is CRITICAL: 16.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1163&var-port=9104 [18:25:35] PROBLEM - MariaDB sustained replica lag on s1 on db1169 is CRITICAL: 46.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1169&var-port=9104 [18:32:19] RECOVERY - MariaDB sustained replica lag on s1 on db1195 is OK: (C)10 ge (W)5 ge 4.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1195&var-port=9104 [18:33:48] FIRING: [3x] MysqlReplicationLagPtHeartbeat: MySQL instance db1186:9104 has too large replication lag (2m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [18:34:15] RECOVERY - MariaDB sustained replica lag on s1 on db1234 is OK: (C)10 ge (W)5 ge 1.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1234&var-port=9104 [18:35:19] RECOVERY - MariaDB sustained replica lag on s1 on db1163 is OK: (C)10 ge (W)5 ge 1.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1163&var-port=9104 [18:35:33] RECOVERY - MariaDB sustained replica lag on s1 on db1218 is OK: (C)10 ge (W)5 ge 1.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1218&var-port=9104 [18:36:33] RECOVERY - MariaDB sustained replica lag on s1 on db1235 is OK: (C)10 ge (W)5 ge 4.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1235&var-port=9104 [18:36:35] RECOVERY - MariaDB sustained replica lag on s1 on db1196 is OK: (C)10 ge (W)5 ge 4.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1196&var-port=9104 [18:36:35] RECOVERY - MariaDB sustained replica lag on s1 on db1169 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1169&var-port=9104 [18:38:05] RECOVERY - MariaDB sustained replica lag on s1 on db1207 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1207&var-port=9104 [18:38:48] RESOLVED: [3x] MysqlReplicationLagPtHeartbeat: MySQL instance db1186:9104 has too large replication lag (2m 59s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [18:40:17] RECOVERY - MariaDB sustained replica lag on s1 on db1186 is OK: (C)10 ge (W)5 ge 4.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1186&var-port=9104 [18:42:35] RECOVERY - MariaDB sustained replica lag on s1 on db1219 is OK: (C)10 ge (W)5 ge 1.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1219&var-port=9104