[00:24:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:24:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on db1125:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:54:07] <arnaudb>	 db1125 is a test nod for Manuel so I guess we should let it aside to avoid loosing his work?
[05:54:12] <arnaudb>	 node*
[07:24:34] <Emperor>	 If it's not being used for anything prod-ish, although it's a long time to leave a node unhappy.
[07:26:15] <arnaudb>	 I've downtimed the host while we decide
[09:02:07] <dhinus>	 heads up I'm upgrading all clouddb* hosts from mariadb 10.6.18 to 10.6.19 (T365424)
[09:02:08] <stashbot>	 T365424: Upgrade clouddb* hosts to Bookworm - https://phabricator.wikimedia.org/T365424
[09:02:41] * volans runs away :D 
[09:04:09] * arnaudb is in a meeting x)
[09:15:43] <Amir1>	 Nah. These updates are minor 😁 update it to 11 because I'm ooo
[09:16:58] <dhinus>	 :D
[10:44:48] <dhinus>	 4 out of 8 clouddbs are upgraded, I'll do the rest after lunch
[10:47:29] <dhinus>	 there's a small gotcha in my upgrade+reboot process, maybe volans has ideas: I have to disable puppet before rebooting, but the reboot-single cookbook doesn't re-enable and this causes icinga alerts to fire immediately after the reboot
[10:47:41] <dhinus>	 this is the process I'm following https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host
[10:48:23] <dhinus>	 I wonder if we could add another option to the reboot-single cookbook (there's already --enable-puppet but that enables it _before_ rebooting I think)
[10:49:31] <volans>	 puppet runs @reboot, so there is no way to enable it before that run unless enabling it before shutting it down
[10:50:18] <dhinus>	 but it could enable it after it detects the boot?
[10:50:36] <volans>	 sure but you're still open to race conditions that might trigger the alerts
[10:51:30] <dhinus>	 hmm but the cookbook could then wait_for_optimal like it does if puppet is enabled
[10:52:07] <dhinus>	 I'm also confused because I add a 1-hour silence before rebooting, but that's also getting cleared at some point
[10:52:19] <volans>	 yes at the end
[10:52:35] <volans>	 with self.alerting_hosts.downtimed
[10:53:20] <dhinus>	 should it delete just its own silence, and not other existing silences? (not sure if it's possible with icinga)
[10:53:43] <volans>	 correct, alertmanager deletes the silence ID with icinga the whole host is cleared
[10:54:05] <volans>	 as there is no easy way to identify specific downtimes
[10:54:08] <dhinus>	 ack
[10:54:22] <volans>	 why do you need puppet disabled?
[10:54:50] <dhinus>	 because I'm running "umount /srv" and I don't want puppet to create files in there
[10:55:00] <volans>	 there are plans to work on a specific reboot+upgrade cookbook for databases but not yet there
[10:55:14] <dhinus>	 maybe we need a custom cookbook yeah
[10:55:31] <volans>	 why the umount?
[10:55:35] * volans trying to get more context
[10:56:06] <dhinus>	 m.anuel was worried of some data corruption I think, let me see if I find the details
[10:56:32] <dhinus>	 "If you don't umount /srv manually, there is a risk that systemd does not wait for the umount to complete and that can lead to data corruption."
[10:56:56] <dhinus>	 (this is in https://wikitech.wikimedia.org/wiki/MariaDB/Rebooting_a_host but came from some old wikitech page, not sure if it's still true)
[10:58:11] <jynus>	 there 2 reasons
[10:58:26] <jynus>	 first, preventing forgetting about stopping all processes on /srv
[10:58:50] <jynus>	 which had happened before (I stopped a mysql instance, but left another running)
[10:58:54] <jynus>	 so that would fail
[10:59:11] <jynus>	 the other is that sometimes there can be a lot of caching on memory
[10:59:43] <jynus>	 and that basically syncs disk before attempting the reboot, which usually makes the reboot faster
[11:00:06] <jynus>	 faster as in, after you ask for a reboot
[11:01:00] <volans>	 wouldn't "sync" suffice instead of umount?
[11:04:31] * dhinus lunch
[11:07:20] <volans>	 AFAIK puppet doesn't mess with any mysql data no?
[11:08:13] * volans about to go for lunhc
[11:08:17] <volans>	 *lunch
[12:00:19] <Amir1>	 arnaudb: jynus are you planning to do any further maint on s2 or s3? I want to test circular replication in one of those
[12:00:38] <Amir1>	 (or s5/s6 if you're done with the switchover)
[12:03:09] <jynus>	 Amir1: could you please wait to talk to volans ?
[12:03:56] <Amir1>	 sure, I will do it tomorrow but I'm in a conference these days and won't be around much
[12:04:48] <jynus>	 I think you should be doing those coordinating with others, if you are not going to be around much
[13:01:16] <dhinus>	 volans: afaict puppet only creates the empty dirs in /srv, so it's not a big issue if it does it on an unmounted volume... but still not ideal
[13:01:57] <dhinus>	 it's in modules/mariadb/manifests/instance.pp
[13:02:47] <volans>	 that's why i was asking if we could prevent from umounting in the first place if the only outcome we want is to flush teh cache to disk, "sync" should be enough, but ofc should be tested
[13:05:24] <arnaudb>	 Amir1: s2/s3 are paused already on my end, I still want to try and do s8/s7/s5 as they have 3 nodes total and I've already swapped some of those codfw masters
[13:05:33] <arnaudb>	 T367781 is up to date
[13:05:36] <stashbot>	 T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781
[13:09:15] <dhinus>	 volans: I guess the "answer" might be to automate the reboot procedure in a db-specific cookbook
[13:09:41] * arnaudb loves this ↑
[13:09:44] <dhinus>	 even if we fix the alerts, it's still sub-optimal using the reboot-single cookbook because it will wait forever for puppet to succeed, and you have to start the db manually in a separate window
[13:10:34] <volans>	 dhinus: of course we need a dedicated cookbook, but automating it doesn't remove the race condition so I was trying to avoid it all together removing the need to disable puppet in the first place
[13:11:37] <dhinus>	 I agree if we could remove the "puppet-disable" it's nice, but why do you think a dedicated cookbook does not remove the race condition?
[13:12:30] <dhinus>	 the cookbook could handle the reboot internally and wait before removing the icinga silence
[13:12:43] <dhinus>	 (migrating the alerts to alertmanager would also help :P)
[13:15:20] <arnaudb>	 :D
[13:15:25] <volans>	 sure sure
[13:17:42] <dhinus>	 the good news is that these alerts are not paging I think
[13:18:00] <dhinus>	 you'll see a few more today :)
[14:11:03] <dhinus>	 volans jynus not urgent, but I would like your review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072755
[14:12:11] <volans>	 dhinus: the cookbooks related to those hosts are in the prod or wmcs cookbooks repo?
[14:12:25] <volans>	 to understand if those can still be run or need to be moved
[14:12:25] <dhinus>	 prod
[14:12:28] <volans>	 ok
[14:13:51] <dhinus>	 I had an idea to move them, but I closed that as declined T347977
[14:13:52] <stashbot>	 T347977: cloudcumin: allow wmcs-admin to run wikireplicas cookbooks and scripts - https://phabricator.wikimedia.org/T347977
[14:14:23] <dhinus>	 I think we can keep them on prod cumins for now
[14:15:25] <volans>	 question inline
[14:18:08] <jynus>	 dhinus: tecnically it seems safe, I don't know policy-wise
[14:20:13] <dhinus>	 volans: replied
[14:20:43] <dhinus>	 jynus: policy-wise it should be at least safer than the current situation, i.e. it reduces the number of people who have access
[14:21:31] <jynus>	 dhinus: yeah, not blocking on that regard. What I meant is that I added some worries to the ticket, but didn't ask for that solution
[14:21:36] <volans>	 thx, +1ed
[14:21:51] <jynus>	 as in, "it has to be like that"
[14:22:14] <jynus>	 I did the same
[14:22:55] <dhinus>	 jynus: yep I know, I also think there might be "better" solutions, but given there was no clear consensus on the ticket, I think this at least makes things more consistent. we can revisit it in the future
[14:23:03] <jynus>	 +1
[14:23:14] <dhinus>	 thanks both :)