[02:10:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:27:12] ^ the idm1001 is more cosmetic, no user-visible impact, to be fixed with https://gerrit.wikimedia.org/r/1024092 [07:35:25] (SystemdUnitFailed) firing: (2) wmf_auto_restart_redis-server.service on idm1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:55:44] 10Mail, 06Infrastructure-Foundations, 10Znuny: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9743787 (10MoritzMuehlenhoff) >>! In T284145#7218511, @Keegan wrote: > @jbond my utmost apologies for not replying to this earlier! These errors can be ignored, they will... [07:57:03] 10Mail, 06Infrastructure-Foundations, 10Znuny: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9743789 (10LSobanski) 1. Let's review if the new Znuny version enabled removal of unused emails and remove them if possible 2. If not, then let's filter the emails in the... [11:35:25] (SystemdUnitFailed) firing: wmf_auto_restart_redis-server.service on idm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:25] (SystemdUnitFailed) resolved: wmf_auto_restart_redis-server.service on idm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:37] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9744535 (10MoritzMuehlenhoff) [12:48:38] hello! I'm coming with a problem that jynus and I both faced today, they are visible here https://phabricator.wikimedia.org/T361087#9744384 and here: https://phabricator.wikimedia.org/T362746#9744589 [12:51:02] since db2155 was reimaging, I'm not sure puppet will be able to resume its normal activity as I had to add --new to retry my run [14:50:01] arnaudb: hey just picking up on this [14:50:27] I think puppet will likely be ok as the reimage did not get very far, but I guess we can deal with that when we get to that point [14:50:54] what is the current status? I think we need to downgrade the firmware on the NIC in that host to make the reimage work (known issue with the more recent firmware version) [14:51:29] hey topranks hey I'm using -p7 upon moritzm advice, the server has some issue upon reboot after reimage apparently as I've got a blinking line displayed on ipmi for 700s now [14:54:57] hmmm [14:55:19] yeah I see what you mean... were you following the console output of the reimage in general? [14:55:34] i.e. did you see if the PXEboot worked, did it go into the debian installer screen with the blue background? [14:56:45] topranks: I've left IPMI right after the first image rebooted properly [14:57:12] so I might have missed a few screens :D I've connected again upon seeing retries piling up [14:57:26] we should probably give this another reboot now to see what the boot sequence shows [14:57:43] I'm guessing the OS didn't properly install and we need to try again - but worth a manual reboot to get more info first I think [14:57:51] if you are ok for me to do that? [14:58:35] sure topranks ! go for it [14:58:46] ok let's see what happens ! [15:01:27] looks like it worked [15:01:31] odd [15:02:33] ssh works but doesn't like my pubkey, so seems puppet hasn't set it up [15:02:36] thanks topranks :) I hope it'll be able to catch up its replication lag :D [15:02:40] what status is the reimage at now? [15:02:46] it's not fully reimaged [15:02:51] cookbook is still running [15:02:58] I was around the 100+ retry [15:03:04] the cookbook is waiting on ssh connection? [15:03:15] it's signing puppet's cert atm [15:03:30] (if you want to check out there is a tmux session on my account on cumin1002) [15:04:12] downtime step has been reached [15:04:13] ok... well that sounds like it is progressing [15:04:17] let's see how it goes [15:04:23] yep, will keep you posted! [15:04:27] not sure what happened there though [15:04:29] thanks! [15:04:49] me neither topranks I was not expecting to burn so much time on a reimage :D [15:29:21] topranks: everything went back to normal, server is catching up on its lag! thanks for the help [15:31:33] arnaudb: great! [15:45:22] 10Mail, 06Infrastructure-Foundations, 10Znuny: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9745238 (10Keegan) @MoritzMuehlenhoff I cannot say for sure as I have not worked in this area for several years, but I cannot imagine that the situation has changed. [21:48:25] (SystemdUnitFailed) firing: wmf_auto_restart_prometheus-redis-exporter@6380.service on netbox2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed