[02:27:55] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:35] es2027 is complaining about a systemd unit [06:06:50] I wonder if it crashed [06:11:50] it is failing on start [06:12:24] looks unhealthy [06:15:36] it looks bad, it was pooled but has been down for hours [06:18:59] I am going to switchover es3 on codfw (read only section) to depool es2027 [06:28:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:35:55] fceratto I have depooled es2027 [06:36:13] it is giving fatal errors [07:03:34] the prometheus exporter or did you see anything more? [07:06:36] errors on the mysql log [07:06:50] see T404940 [07:06:50] T404940: es2027 database unhealthy - https://phabricator.wikimedia.org/T404940 [07:07:43] I also saw errors on the query killer, but those don't worry me as much (maybe they are normal) [07:13:11] you can take over the ticket, once depooled I leave it to the dbas to handle [07:15:46] ok thanks [08:27:57] note on checklist for T403966 is "check mariadb services on all codfw for logging oddities", so it was something that eventually was going to be seen [08:27:57] T403966: MariaDB Pre-DC switchover tasks - https://phabricator.wikimedia.org/T403966 [08:42:28] jynus: yes i was seeing quite a bunch of red herrings [08:49:13] red herrings? [09:02:10] when you have a datapoint, message, alert etc that is not the main issue and risks sending the investigation down the wrong path [09:02:49] Yes, what I am asking is, do you think the database is healthy? [09:05:06] for example, was the host pooled while it was unavailable? [09:06:02] I saw it was alerting from 23 till 6, was it depooled during that? [09:06:59] I think we have a missunderstanding, as in your ticket you propose a series of fixes, which seem reasonable to me [09:07:19] but if they are real things to fix, they wouldn't be red herrings [09:13:09] i added a comment to the task with some next steps - the host was running a long cloning cookbook (as a source) which ended during the night - I'm putting together a list of action items to prevent this [09:13:41] and yes, it was depooled and the alarms were silenced as part of the cookbook [09:13:49] ok [09:14:21] but the alarms triggered, as shown on the ticket body [09:14:55] so silencing didnt't work or expired [09:17:03] then, looking at the alarms, so far i see an error regarding the relay setup etc [09:23:12] ok added another action item around the silencing [10:28:10] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on es2027:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:51:40] federico3: is es2027 healthy now, I want to repool it and close off the maint for dbs until after the switchover [11:55:10] all ok from me [11:55:38] I think wmf_auto_restart_prometheus-mysqld-exporter.service can be restarted then so it doesn't fire? [11:59:39] going to do it [12:01:16] I restarted it [12:01:32] and updated the clone script to restart it during the run [12:07:56] Please make sure you're not doing any db maint in screen/tmux/etc. [12:11:22] (on my side there's nothing running) [12:28:08] I reviewd backup jobs in case some had failed while network connectivity issue but all look healthy [12:35:35] me, after reading T404964: https://i.imgflip.com/a6grze.jpg [12:35:35] T404964: The load on s7 is too high - https://phabricator.wikimedia.org/T404964 [13:38:11] I realized I hadn't checked if x3 was included on spicerack, and it was [13:38:49] actually, it wasn't until 2 days ago, I can see [15:13:29] Yeah. I was added to the patch [15:31:09] Hi All, I am back at work, will be focusing clearing my inbox. Please let me know if there is anything urgent that I need to look into [15:37:03] wb [18:31:14] hey DBAs, I dont want to interrupt you if you are fire righting or super but some time I would love to deploy a change that touches the default.my.cnf. but in a way that is NOOP on all in the compiler.. and that unblocks future trixie upgrades. I am just looking for an OK to go ahead I guess. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180999 [23:52:49] mutante: I will try to merge it on Monday, is that fine?