[00:00:31] this one didn't fail in a determined way, but failed because it applied a method call to a random unrelated object somewhere in PHP's memory. [00:01:37] that's a relatively rare one, but pretty scary e.g. hoping we don't e.g. confuse something like $editComment->save('DELETE hello world') with $dbw->query() [00:02:02] which is what this kind of corruption seems to do. [09:59:18] I have promoted a Bullseye host to serve pc1 (mysql version is the same as usual) if someone notices something strange, please let me know [16:04:52] hello folks [16:05:06] kafka-main1001 was rebooted, all good but I see this warning in icinga [16:05:07] "enp175s0f1d1 not found. This should never happen. Bailing out" [16:05:33] indeed we have enp175s0f0 in there [16:06:07] new OS version or kernel (or related packages?) could possibly have changed the naming to drop the d1 part [16:06:41] puppet seems to have done it [16:06:42] New BIOS wasn't it? [16:06:45] yeah [16:06:47] related packages probably meaning udev stuff [16:07:02] new BIOS/NIC (firmware upgrades) [16:07:18] elukey: probably the old iface name is hardcoded somewhere. If not in puppet config, then in the /etc/network/interfaces file (which would need a manual root edit). Then reboot? [16:08:21] bblack: will look for it, definitely a little weird (interfaces looks good afaics) [16:09:11] will check more in depth what the alert looks for [16:09:49] ah now the warning is gone [16:09:50] the /e/n/i file unfortunately isn't completely puppet-managed, so it's one of those things that often needs manual fixes in cases like these. [16:10:48] bblack: could it be tha the nagios check was stale? [16:10:56] I forced a re-run and now it is gone [16:12:15] maybe! [16:13:29] thanks for the support :) [16:16:52] (yep check_eth got updated on kafka-main1001) [17:16:35] <_joe_> it's a classical case of why puppet exported resources suck :) [17:17:38] <_joe_> uhm wait, puppet runs at boot, so that would change check_eth, looking at the code now [17:17:55] <_joe_> unless there is something that fools our facts collector somewhere [17:18:22] <_joe_> so unless the check happened between the reboot and when puppet modified check_eth, I'm not sure what that might be [17:21:03] it happened only once for 1001 [17:23:18] <_joe_> so yes it might be [22:47:23] looks like mx1001 is having mail delivery woes, post kernel upgrade, I am investigating [22:50:20] T297127 again or different? [22:50:21] T297127: Incident: 2021-12-03 mx2001->Gmail delivery issues - https://phabricator.wikimedia.org/T297127 [22:51:26] jynus: looks to be the same [22:57:53] jhathaway: there was manual iptables command."ip6tables -I INPUT -s 2620:0:861:102:10:64:16:8 -j ACCEPT" .that is what will have been removed by reboot (hope that helps, kind of have to go afk myself) [22:58:12] from the "manual fix" line on https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-12-03_mx [22:58:24] mutante: no problem, I am going to revert to the old kernel [22:58:41] jhathaway: i'm pretty sure it's not the kernel version but the part that it got rebooted [23:00:02] mutante: hmm, well it was working correctly before on 5.10.46-5, right? [23:01:37] jhathaway: hmm, I suppose so, yes, go ahead and revert [23:01:44] do what you were about to do [23:02:27] jhathaway: it worked on (5.10.0-8) is what I know [23:02:51] cool, 5.10.0-8 == 5.10.46-5 [23:02:59] ok! ACK then [23:03:18] trying to figure out how to get on ipmi first.. [23:03:48] ssh root@mx1001.mgmt.eqiad.wmnet [23:04:05] oh, is that broken? sigh [23:05:23] jhathaway: i'm an idiot, this is a ganeti VM :) [23:05:29] it used to be hardware in the past [23:05:42] ah right, so am I [23:05:46] so the answer is to use gnt-instance console [23:05:53] on the right ganeti machine [23:06:05] if you are on the wrong one it tells you which is the right one [23:06:14] cool, let me try that [23:07:30] woohoo, that worked [23:07:44] sudo gnt-instance console mx1001.wikimedia.org on ganeti1009 [23:07:47] cool [23:07:52] !log rebooting mx1001 to get old kernel [23:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:11] if you are lucky and really quick [23:08:19] you can get to the screen where you select the kernel [23:08:31] I got it on the second attempt last time [23:08:45] otherwise you would have to edit grub menu and stuff [23:09:09] to get out of the console is the weird "ctrl + ]" [23:10:22] or you can remove the package and reboot again, of course [23:14:40] thanks, I was able to grab the menu on reboot [23:15:53] jhathaway: :) great! [23:22:12] is the mail queue moving again? [23:22:46] yup looks good, I'll create a new ticket to note the state [23:23:11] :) cool! thanks. will go afk then [23:23:17] mutante: thanks for your help! [23:23:22] yw, ttyl [23:51:34] legoktm: your listarchive iw link is now added to the map :) [23:58:34] hauskatze: thanks :)