[01:39:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1247:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1247:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:22] Amir1: Could you add an ignore for pc8 for now in MW? So I can start adding it to dbctl, at least for eqiad, as that host is ready already [08:25:48] marostegui: do you want to reply to https://phabricator.wikimedia.org/T391581#10825297 ? [08:30:59] federico3: can you do it? [08:43:14] since I opened the task (after getting your input), I think it would be good to have input from you as well in the ticket, also Amir1 if interested, so we can reach consensus [08:57:27] federico3: I think you should also feel comfortable replying there and addressing further concerns [09:10:12] ok, I can reply to the task right now [09:11:17] Thanks! [09:39:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1247:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:48] ok for me to deploy? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146942 [09:56:51] marostegui: I'm still holding off on https://phabricator.wikimedia.org/T392806 - should I restart? [10:03:48] federico3: Yes, did you see the last response there? [10:06:08] jynus: good from my side [10:06:15] marostegui: thanks [10:10:34] I've created https://phabricator.wikimedia.org/T394487 to track those specifically, but will be stalled until a full mw metadata section is migrated [10:11:54] marostegui: thanks, I'm going to update the task to clarify [10:14:01] marostegui: if you pool it with weight zero, it should get ignored [10:16:18] Amir1: ok got thanks [10:38:39] we should do something about db1147 [10:41:44] upgrade of db1239 seems it went well, but lots of things to test there still [10:42:13] I think cleaning up the grants helped a lot [11:08:27] I am going to run a test backup from db1239- it is going to fail to process, but I want to see it executing on the host [11:23:28] Would anyone have some time to look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146957 please? I _hope_ it's correct preseed setup for the new apus backends with a boss card ; if you want to look at the one that is currently being worked on, sudo install_console apus-be1004.eqiad.wmnet will get you a root shell, feel free to poke around as it'll get reimaged at least once more befre going into prod! [11:25:06] uff, Emperor, you are asking for something that is too complicated [11:25:24] partman for me is: "hit it until it works" [11:35:04] marostegui: for https://phabricator.wikimedia.org/T393296 is it ok if I clone from db1188 in s2 onto db1246 starting now, without pooling in the latter host? The host was removed from modules/profile/data/profile/installserver/preseed.yaml - is there any puppet change required before the clone? [11:38:23] federico3: you can clone I think yes [11:45:10] jynus: yeah, this is largely based on existing recipes. I will not be entirely surprised if it needs more work yet, but I think it's correct. [11:54:01] (but I do need a review to proceed at all) [12:25:45] sigh, that didn't work. [12:26:23] partman is always so hateful :( [12:26:39] 'May 16 12:23:55 partman: No matching physical volumes found' [12:27:02] Hm, I think it's run the early_command wrong [12:27:37] oh, because it's got and old version of the file, presumably something is missing a puppet run to update it. [12:28:38] Maint map now finally works https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [12:28:57] I'll force a puppet run on profile::installserver::preseed and try again [12:40:34] Amir1: <3 [12:42:33] bah, nearly, but I missed one thing. [12:46:39] jynus: sorry, would you mind a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146977 please? It's something I missed in the preseed conf the first time round, but should be more-obviously safe-to-everything-else to change :) [12:46:45] sure [12:46:53] TY :) [12:47:25] I knew that was going to happen, because it happens to me every time [12:47:32] it is what it is [12:47:46] Mmm [13:27:21] marostegui: regarding db1247 T393612 do we want to clone it now and let it run for a while before repooling? [13:27:22] T393612: db1247 crash or restart - 15:29 on 2025-05-07 - https://phabricator.wikimedia.org/T393612 [13:27:39] federico3: you can clone it today, and repool monday if all goes well [13:28:12] ok [13:36:56] ms-backup1002 network card doesn't get link with the latest kernel [13:37:12] probably I am missing some firmware package or something [13:37:33] but I will leave it with an older kernel until I debug it [13:39:26] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db1247:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:42] INFO - Backup /srv/backups/snapshots/ongoing/snapshot.s1.2025-05-15--00-00-05 generated correctly. [14:13:55] ERROR - xtrabackup version mismatch- xtrabackup version: {'major': '10.6', 'minor': 20, 'vendor': 'MariaDB'}, backup version: {'major': '10.11', 'minor': 11, 'vendor': 'MariaDB'} [14:14:01] All that had to work, works [16:44:26] FIRING: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:59:26] RESOLVED: SystemdUnitFailed: swift_dispersion_stats_lowlatency.service on ms-fe2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed