[06:30:59] swfrench-wmf: good idea indeed, thanks for reaching out! so you'll be acting during the mw maintenance window next tuesday, right? I'll double check with Amir1 and Manuel but we should be OK [06:54:10] oh I saw your mail about 17 UTC :) [11:01:26] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-mysqld-exporter.service on db1160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:17] hot take: https://nextgres.com/res/20240419-The-Future-of-MySQL-is-Postgres.pdf [12:13:28] fixed the wmf_auto_restart_prometheus thingy [12:16:26] (SystemdUnitFailed) resolved: wmf_auto_restart_prometheus-mysqld-exporter.service on db1160:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:19:44] regarding the link: Sure [12:21:03] my thoughts haha [12:58:28] so, Amir1 jynus sollicitating your advice here, until the PXE issue is fixed (c.f. #wikimedia-sre and #wikimedia-sre-foundations (sorry for my misrouted message!)) → 14:51:03 since db2155 was reimaging, I'm not sure puppet will be able to resume its normal activity as I had to add --new to retry my run. Also, subsequent question: should I resume replication on mariadb with or without puppet runnning properly? [12:59:07] (kormat : you're welcome to contribute as well, sorry I forgot to hl you in the previous message ↑ :D) [13:02:00] arnaudb: evaluate how much time it is going to take to get it fixed and reimage back to bullseye- I think it is more important to have the host running and installed in some cases [13:02:25] if you didn't touch the disk yet [13:02:34] I did not [13:02:35] you can just use the manual setup [13:02:54] to readd it to puppet [13:03:06] manual? [13:03:09] install_console something something (it is on wikitech) [13:03:12] oh [13:03:20] to sign the puppet cert [13:03:21] so I should not have used the --new? [13:03:47] no, you had to use it I am guessing in your case [13:04:36] this is unrelated to using it or not, but as it deletes the cert, that is how to issue a new one to make the host work again [13:05:17] see install-console or something like that or manual install and check the last step (signing puppet cert) [13:05:53] or running the reimage again, up to you :-D [13:06:34] aha I have to try again with bullseye if it goes faster [13:28:55] elukey: just an fyi, eqiad was completely restarted yesterday (I just realized I forgot to make note of that anywhere). [14:38:22] arnaudb: thanks for the follow-up :) yes, exactly - this would be limited to some portion of the 17:00 UTC hour on 4/30 [14:41:42] swfrench-wmf: I've added the subject to our monday meeting, i'll get back to you if some issue arises! [14:42:22] is that page re db1234 disk space expected? [14:42:31] it's a disk issue [14:42:34] i'm checking [14:45:02] server has hw issue, is identified as such. Downtime was too optimistic ! I've added 7 more days, sorry for the page [14:45:15] 👍 [14:54:32] I've silenced db2155/db2187 which are s4 sanitarium master and client for the next 20 hours, I'm having issue reimaging the first one and their replication is stopped so the second one will alert as well (cc Amir1 kormat) [14:55:19] sure, feel free to fully disable notification [14:55:28] it's codfw one, so it shouldn't be an issue [14:55:42] I want to keep it on the backburner but will do if it does not comeback before the weekend [14:57:12] arnaudb: great, thank you! FYI, unless any issues come up today, I'll be sending a more general email to ops@ (e.g., to catch deployers). we can still reschedule if something comes up later on, but I'd like to get it on folks' radar soon :) [14:57:32] ack [14:57:50] swfrench-wmf: maybe send an optional gcal invite? so it's properly identified [14:59:57] ah, that's an interesting idea - I'll look into that [15:28:42] db2155 update: its catching up on replication (i.e. 10k+ seconds behind) → I'll start its replica after its fully uptodate (unless you're advising the opposite Amir1 ?) [15:29:10] nope, sounds good to me