[01:00:07] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:24:04] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:47:40] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:00:07] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[05:24:04] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:11:21] <arnaudb>	 i'm switching es1 master to es1027 to depool es1029 for T372208
[07:11:21] <stashbot>	 T372208: Degraded RAID on es1029 - https://phabricator.wikimedia.org/T372208
[07:35:34] <Emperor>	 I see we have the tradition of disks failing the moment I go on leave again
[07:37:54] <arnaudb>	 next time you leave we'll send you some spare disks, to help conjuring
[07:38:44] <Emperor>	 heh
[07:39:48] <jinxer-wm>	 FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db2189:9104 has too large replication lag (1d 0h 11m 2s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2189&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[07:43:12] <arnaudb>	 ah i'll re-downtime it
[07:47:41] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:12:07] <marostegui>	 arnaudb: was the switch done?
[08:13:29] <arnaudb>	 I switched in dbctl, did I forget something?
[08:19:39] <marostegui>	 The DNS change 
[08:19:53] <marostegui>	 Or I didn't get a review 
[08:24:04] <arnaudb>	 I forgot something :D
[08:49:55] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:49:57] <Emperor>	 rebooting thanos-be2002 to deal with the systemd-timedated.service problem, I see the disks are filling again, so I've pinged T351927 once more
[08:49:57] <stashbot>	 T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927
[09:00:07] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on ms-be1058:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:10:08] <Emperor>	 That's a failed disk, and this is a host due for decom this quarter and we have no spares; I'll put in a CR shortly to mark the drive as failed in swift and then downtime that alert for a bit.
[09:17:54] <Emperor>	 given that, could I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1062355 please? Once it's merged I'll silence the alert.
[09:25:27] <Emperor>	 thanks m.arostegui :)
[09:25:50] <Emperor>	 arnaudb: you've got the puppet-merge lock - LMK when you're done and/or feel free to merge my change too?
[09:26:15] <arnaudb>	 its unlocked Emperor 
[09:26:45] <Emperor>	 ta
[09:39:35] <dhinus>	 heads up I'm depooling and reimaging clouddb1016 (T365424)
[09:39:36] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[10:54:22] <dhinus>	 clouddb1016 is reimaged and repooled
[12:17:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:36:52] <Emperor>	 is this a consequence of the disk-full stuff, I wonder? I only rebooted that earlier today
[13:07:20] <Emperor>	 Ah, sdg is sad. Le sigh.
[13:23:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on thanos-be2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[13:47:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: systemd-timedated.service on thanos-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:51:51] <Emperor>	 I've opened T372406 and will silence alerts for that host for a week to allow the disk to get swapped.
[13:51:52] <stashbot>	 T372406: /dev/sdg failed on thanos-be2002 - https://phabricator.wikimedia.org/T372406
[21:09:48] <jinxer-wm>	 FIRING: [9x] MysqlReplicationLagPtHeartbeat: MySQL instance db1199:9104 has too large replication lag (11m 0s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[21:14:48] <jinxer-wm>	 FIRING: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (15m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[21:19:48] <jinxer-wm>	 RESOLVED: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (18m 36s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[21:34:48] <jinxer-wm>	 FIRING: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (11m 35s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat
[21:44:48] <jinxer-wm>	 RESOLVED: [21x] MysqlReplicationLagPtHeartbeat: MySQL instance db1190:9104 has too large replication lag (21m 35s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica  - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat