[02:27:55] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:04:35] <jynus>	 es2027 is complaining about a systemd unit
[06:06:50] <jynus>	 I wonder if it crashed
[06:11:50] <jynus>	 it is failing on start
[06:12:24] <jynus>	 looks unhealthy
[06:15:36] <jynus>	 it looks bad, it was pooled but has been down for hours
[06:18:59] <jynus>	 I am going to switchover es3 on codfw (read only section) to depool es2027
[06:28:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:35:55] <jynus>	 fceratto I have depooled es2027
[06:36:13] <jynus>	 it is giving fatal errors
[07:03:34] <federico3>	 the prometheus exporter or did you see anything more? 
[07:06:36] <jynus>	 errors on the mysql log
[07:06:50] <jynus>	 see T404940
[07:06:50] <stashbot>	 T404940: es2027 database unhealthy - https://phabricator.wikimedia.org/T404940
[07:07:43] <jynus>	 I also saw errors on the query killer, but those don't worry me as much (maybe they are normal)
[07:13:11] <jynus>	 you can take over the ticket, once depooled I leave it to the dbas to handle
[07:15:46] <federico3>	 ok thanks
[08:27:57] <jynus>	 note on checklist for T403966 is "check mariadb services on all codfw for logging oddities", so it was something that eventually was going to be seen
[08:27:57] <stashbot>	 T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966
[08:42:28] <federico3>	 jynus: yes i was seeing quite a bunch of red herrings
[08:49:13] <jynus>	 red herrings?
[09:02:10] <federico3>	 when you have a datapoint, message, alert etc that is not the main issue and risks sending the investigation down the wrong path
[09:02:49] <jynus>	 Yes, what I am asking is, do you think the database is healthy?
[09:05:06] <jynus>	 for example, was the host pooled while it was unavailable?
[09:06:02] <jynus>	 I saw it was alerting from 23 till 6, was it depooled during that?
[09:06:59] <jynus>	 I think we have a missunderstanding, as in your ticket you propose a series of fixes, which seem reasonable to me
[09:07:19] <jynus>	 but if they are real things to fix, they wouldn't be red herrings
[09:13:09] <federico3>	 i added a comment to the task with some next steps - the host was running a long cloning cookbook (as a source) which ended during the night - I'm putting together a list of action items to prevent this 
[09:13:41] <federico3>	 and yes, it was depooled and the alarms were silenced as part of the cookbook
[09:13:49] <jynus>	 ok
[09:14:21] <jynus>	 but the alarms triggered, as shown on the ticket body
[09:14:55] <jynus>	 so silencing didnt't work or expired
[09:17:03] <federico3>	 then, looking at the alarms, so far i see an error regarding the relay setup etc
[09:23:12] <federico3>	 ok added another action item around the silencing
[10:28:10] <jinxer-wm>	 FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:51:40] <Amir1>	 federico3: is es2027 healthy now, I want to repool it and close off the maint for dbs until after the switchover
[11:55:10] <jynus>	 all ok from me
[11:55:38] <jynus>	 I think wmf_auto_restart_prometheus-mysqld-exporter.service can be restarted then so it doesn't fire?
[11:59:39] <Amir1>	 going to do it
[12:01:16] <federico3>	 I restarted it
[12:01:32] <federico3>	 and updated the clone script to restart it during the run
[12:07:56] <Amir1>	 Please make sure you're not doing any db maint in screen/tmux/etc.
[12:11:22] <federico3>	 (on my side there's nothing running)
[12:28:08] <jynus>	 I reviewd backup jobs in case some had failed while network connectivity issue but all look healthy
[12:35:35] <jynus>	 me, after reading T404964: https://i.imgflip.com/a6grze.jpg
[12:35:35] <stashbot>	 T404964: The load on s7 is too high - https://phabricator.wikimedia.org/T404964
[13:38:11] <jynus>	 I realized I hadn't checked if x3 was included on spicerack, and it was
[13:38:49] <jynus>	 actually, it wasn't until 2 days ago, I can see
[15:13:29] <Amir1>	 Yeah. I was added to the patch
[15:31:09] <kavitha>	 Hi All, I am back at work, will be focusing clearing my inbox. Please let me know if there is anything urgent that I need to look into
[15:37:03] <federico3>	 wb
[18:31:14] <mutante>	 hey DBAs, I dont want to interrupt you if you are fire righting or super but some time I would love to deploy a change that touches the default.my.cnf. but in a way that is NOOP on all in the compiler.. and that unblocks future trixie upgrades.   I am just looking for an OK to go ahead I guess.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180999
[23:52:49] <Amir1>	 mutante: I will try to merge it on Monday, is that fine?