[05:32:19] FIRING: [92x] DiskSpace: Disk space thanos-be1001:9100:/srv/swift-storage/sdd1 4.381% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace -> we have had that alert firing every 4 hours for a few days already, what should we do about it? [07:09:16] I'll poke godog again; per the relevant ticket I thought he was going to silence disk-space alerts until the end of the month, and also stop some process that causes usage to spike [07:09:48] (ticket - T351927 ) [07:09:49] T351927: Decide and tweak Thanos retention - https://phabricator.wikimedia.org/T351927 [07:14:09] oh, marostegui, since I'm on-call this week and we're still having S4 problems I see over the weekend; is there anything you'd like oncallers to do if it recurs? the incident doc has "set s4 ro until it recovers" as a mitigation [07:21:40] replied on task [07:22:10] I'm going to stop/start replication on db1221 to try to make it resume [07:23:44] coming back [07:28:14] Emperor: I don't really like that solution because we could forget. But in any case there's not much we can do until the source of all those connections is identified [07:29:08] marostegui: would you rather oncallers _not_ do so, then? [07:29:43] Emperor: I think it's fine but we really need to be on top of it and as soon as it recovers we need to set it back to RW [07:31:24] marostegui: ack [07:33:40] I've silence the thanos disk-space alerts for 14 days (until beginning of Sep); hopefully the replacement h/w will have arrived by then (although I did see something go past on the procurement ticket suggesting a PO has gone astray) [07:33:57] s/nce/&d/ [07:36:15] thank you Emperor :) [12:25:35] I'm gonna depool and reimage clouddb1015 (T365424) [12:25:35] T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424 [13:07:01] what happened to the "processlist" metric in Grafana? it seems like it stopped reporting data for most hosts [13:08:19] e.g. in clouddb1017 it stopped after I reimaged last week, db1184 stops on 2024-06-03 (probably also a reimage/upgrade) [13:16:13] dhinus: indeed, that's very strange [13:16:18] arnaudb: can you investigate? ^ [13:16:28] sure! [13:16:33] I am pretty sure it was there a few days ago [13:16:39] Because I remember seeing it for s4 master [13:17:03] arnaudb: thank you [13:19:13] T372764 [13:19:14] T372764: mariadb monitoring: process list metric missing in grafana - https://phabricator.wikimedia.org/T372764 [13:23:56] thanks! [13:36:31] dhinus: this confirms that this is a trend: https://grafana.wikimedia.org/goto/GRC978jIg?orgId=1 [13:36:48] that makes the problem traceable, I bet there is some issue with the new arguments we're passing to mysqld exporter [13:36:54] will continue digging after meeting