[08:33:25] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:00] Amir1: I'm preparing the flip of db2179.codfw.wmnet and db2204.codfw.wmnet (the schema change is done) [09:16:35] morning folks, could somebody take a look at ms-fe1009? [09:54:31] federico3: noted, I'm around [12:24:11] PROBLEM - MariaDB sustained replica lag on s4 on db2179 is CRITICAL: 276 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2179&var-port=9104 [12:29:11] RECOVERY - MariaDB sustained replica lag on s4 on db2179 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2179&var-port=9104 [12:33:40] FIRING: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:39:34] (it was me upgrading the host) [13:03:25] I am trying to get logged in for the meeting, not sure why that's so slow today... [14:13:25] RESOLVED: SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:55] FIRING: [2x] SystemdUnitFailed: swift_rclone_sync.service on ms-be1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:22] Amir1: good to go for codfw DC switchover of db2204? [14:47:19] federico3: let's wait until tomorrow [14:47:25] I'm out for rest of today [14:47:49] ok [15:54:55] FIRING: [3x] SystemdUnitFailed: confd_prometheus_metrics.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:55] FIRING: [4x] SystemdUnitFailed: confd_prometheus_metrics.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:55] FIRING: [5x] SystemdUnitFailed: confd_prometheus_metrics.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:55] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:55] FIRING: [6x] SystemdUnitFailed: confd_prometheus_metrics.service on ms-be1071:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:25] federico3: out of curiosity, is there a reason we have to reapply T399249 to s6? [17:18:25] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:18:56] zabe: it was not applied on all hosts [17:20:09] Alright [22:35:11] federico3: you can use replicas = ['db1234'] as long as you set the section variable correctly, (section = 's4') it would work just fine [22:35:23] I know what's going on on ms-be1071: https://grafana.wikimedia.org/d/000000378/ladsgroup-test?from=2025-07-21T09:21:08.893Z&orgId=1&to=2025-08-18T22:23:34.558Z&timezone=utc&viewPanel=panel-29 [22:35:52] https://usercontent.irccloud-cdn.com/file/CXGd4JXf/grafik.png [22:35:53] I vaccum [22:39:27] https://www.irccloud.com/pastebin/jE1nXeQZ/ [22:39:29] that is fun [22:52:21] ugh, I moved a container db to the tanker and didn't fix the issue. I realized, we don't have space in / [22:52:37] /dev/md0 55G 55G 0 100% / [22:53:27] log clean up time [22:54:03] which were massive because... there no space left [22:54:39] so they logged like crazy and made things even worse [22:57:55] https://www.irccloud.com/pastebin/dy1l9ZqM/ [23:43:25] https://www.irccloud.com/pastebin/hGF7L7hd/