[01:00:07] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:24:04] FIRING: PuppetFailure: Puppet has failed on thanos-be2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:47:40] FIRING: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:07] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:24:04] FIRING: PuppetFailure: Puppet has failed on thanos-be2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:11:21] i'm switching es1 master to es1027 to depool es1029 for T372208 [07:11:21] T372208: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208 [07:35:34] I see we have the tradition of disks failing the moment I go on leave again [07:37:54] next time you leave we'll send you some spare disks, to help conjuring [07:38:44] heh [07:39:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db2189:9104 has too large replication lag (1d 0h 11m 2s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2189&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [07:43:12] ah i'll re-downtime it [07:47:41] FIRING: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:12:07] arnaudb: was the switch done? [08:13:29] I switched in dbctl, did I forget something? [08:19:39] The DNS change [08:19:53] Or I didn't get a review [08:24:04] I forgot something :D [08:49:55] RESOLVED: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:57] rebooting thanos-be2002 to deal with the systemd-timedated.service problem, I see the disks are filling again, so I've pinged T351927 once more [08:49:57] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [09:00:07] FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:10:08] That's a failed disk, and this is a host due for decom this quarter and we have no spares; I'll put in a CR shortly to mark the drive as failed in swift and then downtime that alert for a bit. [09:17:54] given that, could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1062355 please? Once it's merged I'll silence the alert. [09:25:27] thanks m.arostegui :) [09:25:50] arnaudb: you've got the puppet-merge lock - LMK when you're done and/or feel free to merge my change too? [09:26:15] its unlocked Emperor [09:26:45] ta [09:39:35] heads up I'm depooling and reimaging clouddb1016 (T365424) [09:39:36] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [10:54:22] clouddb1016 is reimaged and repooled [12:17:25] FIRING: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:52] is this a consequence of the disk-full stuff, I wonder? I only rebooted that earlier today [13:07:20] Ah, sdg is sad. Le sigh. [13:23:48] FIRING: PuppetFailure: Puppet has failed on thanos-be2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:47:25] RESOLVED: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:51] I've opened T372406 and will silence alerts for that host for a week to allow the disk to get swapped. [13:51:52] T372406: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406 [21:09:48] FIRING: [9x] MysqlReplicationLagPtHeartbeat: MySQL instance db1199:9104 has too large replication lag (11m 0s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [21:14:48] FIRING: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (15m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [21:19:48] RESOLVED: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (18m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [21:34:48] FIRING: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (11m 35s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [21:44:48] RESOLVED: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (21m 35s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat