[05:46:26] * dhinus paged: ToolsToolsDBWritableState [05:46:37] another occurrence of T349695 [05:46:41] T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 [05:51:03] I restarted toolsdb and the alert is gone [05:51:24] * dhinus back to sleep :) [05:53:41] thx dhinus [10:39:55] today I will reimage cloudnets, starting from the "standby" node (cloudnet1006) [10:43:22] I checked "cookbook wmcs.openstack.network.tests" before starting, and it is showing 1 error [10:43:32] WARNING: failed test: puppetmasters can sync git tree [10:44:58] re-running it, a second test also failed [10:45:02] failed test: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr [10:47:19] I created T350466 [10:47:20] T350466: [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 [10:57:56] dhinus: what are the test hosts that cloudcontrol's should be able to ssh to ? [10:58:23] now everything seems to work so I'm confused [10:58:35] I put the full SSH commands in the task [11:01:05] we should probably run those tests every X hours, so that we know roughly how frequently they fail [11:02:04] maybe [11:02:17] the wording is confusing, kind of has a postivie/negative outcome in the phrasing [11:02:19] "puppetmasters can sync git tree" [11:02:28] "VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr" [11:02:55] i've never been a fan of how they modify the live puppetmaster git trees etc [11:02:55] "puppetmasters git tree sync test" might be easier to grok for the uninitiated [11:03:53] they can definitely be improved [11:04:20] two tests that might also help us pinpoint the issue if it failed: [11:04:21] I'll proceed with the reimage of cloudnet1006, given the tests are fine now [11:04:38] - check dns resolver is working ok [11:04:43] - check ping to eqiad1.bastion.wmcloud.org [11:05:15] the ssh through the bastion depends on a number of things, hard to gather from that failure why it didn't work [11:05:21] but yep we can leave for now if it's happy :) [11:07:02] the fact that the output of the failing command is not showed or logged also doesn't help [11:14:59] * taavi lunch [11:48:43] 2 errors on the first puppet run after the reimage of cloudnet1006 [11:48:46] Failed to call refresh: '/sbin/brctl addif br-internal vlan1105' returned 1 instead of one of [0] [11:48:56] Failed to call refresh: '/sbin/brctl addif br-external vlan1107' returned 1 instead of one of [0] [11:49:38] topranks: does that ring any bell? [11:50:00] does that get fixed on the second run? [11:50:04] still running [11:50:05] brctl is depreciated, I wonder if it's gone in bullseye [11:51:08] "ip link set dev vlan1105 master br-internal" is the way to do it now [11:51:20] let me have a closer look [11:51:20] actually it's already the second run so I think it will keep on failing [11:51:44] openstack::neutron::bridge needs updating then [11:53:04] brctl is there anyway so it's not that [11:53:47] third puppet run succeeded, but that might just have been because the exec is a refreshonly [11:53:48] wait we have a run without errors now [11:53:56] oh ok [11:53:58] the command ran fine for me manually [11:54:25] https://www.irccloud.com/pastebin/BPe52Jn0/ [11:55:12] then I have no idea :/ [11:55:37] potentially a race condition, for the command to work the bridge and vlan devices both need to exist [11:55:45] the reimage cookbook completed successfully [11:55:51] and are created elsewhere [11:55:57] is the network config looking fine? [11:56:20] a race condition is possible yeah [11:56:48] I think the reimage cookbook will reboot the host in just a second, so we should be fine [11:58:21] dhinus: yeah at a glance the basic network config looks ok [11:58:49] you should definitely do a forced failover from the other cloudnet to get openstack to add all the bits it does when that happens, and confirm things work, before trying to reimage the other cloudnet [11:59:08] yes makes sense [11:59:22] what's the easiest way to trigger a failover? [11:59:38] I always just ask Arturo to do it :P [12:00:03] I also see this error in the cookbook output, probably unrelated: https://phabricator.wikimedia.org/P53136 [12:01:25] dhinus: huh, that is actually my responsibility, it's the netbox interface import script [12:01:35] ha :) [12:01:44] I'll see why it's tripping up, but it's simply to update netbox so no functional problem [12:02:03] ok [12:03:21] I guess stopping all neutron* systemctl units should be enough to trigger a failover? [12:04:36] trying to fix the puppet ordering issue with this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971406/ [12:07:13] running a PCC, selecting `auto` seems to have picked quite a few hosts so it'll take a bit [12:07:45] what's the <| expression |>? I don't think I've seen it before [12:09:25] and do you agree it's sensible to "systemctl stop neutron*" to trigger a failover? [12:09:47] it's a resource collector: https://www.puppet.com/docs/puppet/7/lang_collectors [12:09:55] failover from where? [12:10:00] from 1005 to 1006 [12:10:08] so we check if 1006 is actually working before reimaging 1005 [12:11:06] cloudnet1005 is currently the active node, and 1006 the standby [12:12:08] I think yes, stopping the neutron services should do it, but I'm a bit uneasy if the reimage cookbook crashed and the node didn't actually reboot first [12:12:51] or no, it seems to have booted just fine [12:13:12] I don't think it crashed, no. it just had an error updating netbox, but it completed with a PASS [12:13:15] ah [12:13:44] as long as you know how to fail back quickly in case something goes wrong, I think we're fine [12:13:55] that's a good question :D [12:14:51] and just to make sure, are all of the bridges configured properly? [12:16:16] how would you check the bridges? [12:16:45] compare them to a working node I guess? [12:17:11] compare in the output of "ip a" or something like that? [12:17:47] does that show the bridge members? [12:21:28] I think it does, but I haven't played with this stuff in a long time. "bridge link" seems the way to list the existing bridges [12:21:38] and the output looks similar in 1005 and 1006 [12:22:42] unless neutron has additional hidden things that I don't know about [12:27:05] there is a diff in "brctl show", though [12:28:28] vlan1107 is missing from br-external [12:28:38] "ip link show master " is how to show the bridge names [12:28:59] I wrote saved a bunch of these commands here cos I always forget them: [12:28:59] https://listed.to/@techtrips/37378/linux-bridge-command-examples [12:29:12] I suspect the issue here is I added vlan1105 manually [12:29:28] *just* after it was reported good, I assumed it had fixed itself and my manual command was a no-op [12:29:48] but perhaps it did add the device to the bridge, and vlan1107 was never added to the external one [12:29:51] was the server rebooted? [12:30:37] yes, but I think the bridges were created _after_ the reboot maybe? [12:31:17] again, lot's of puppet ordering issues [12:31:54] first, the first puppet run didn't create the bridges as neutron config failed to provision, this would be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/971162 [12:32:03] the reboot happens between the first and the second run iirc [12:32:12] server says its up 51 minutes, I ran the command in https://www.irccloud.com/pastebin/BPe52Jn0/ since then [12:32:36] and then the second one created the bridges but didn't add the interfaces to them, this is fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/971406 [12:32:43] obviously we can just add vlan1107 to br-external, but I think what's more important is this happens cleanly on a reboot [12:32:56] and because the 'add to bridge' command has refreshonly => true, that did not get fixed on the third run [12:34:54] I agree this is "hokey" :P [12:35:11] the last thing is fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/971418/ [12:35:33] my gut feeling is it would make more sense to have these done by "pre-up" or "post-up" commands in /etc/network/interfaces [12:35:34] all of this would be much better if we had systemd-networkd or basically anything else than the current ifupdown setup and a horrible puppetization on top of that [12:35:37] where the vlan ints are configured [12:35:54] it is done that way on any subsequent boots [12:36:28] how do you mean? [12:37:51] most of the puppetization uses interface::post_up_command in addition to manual Exec resources to add stuff to /e/n/i [12:39:53] ok yeah. whatever works. I'd probably stick to the "interface:pre_up_command" and "interface::post_up_command" myself [12:39:56] and those post_up_commands only get run after a host reboot, so they have not run yet on 1006? [12:39:59] an example of what it could look like [12:40:00] https://phabricator.wikimedia.org/P53137$30 [12:40:54] dhinus: correct [12:41:11] sorry I think I may be wrong [12:41:31] so my suggestion would be: I'll try rebooting manually 1006, we check if things look good, and test the failover [12:41:42] then separately we can test different approaches, but I would test those in codfw first [12:41:45] there is also /etc/network/interfaces.d/br-internal [12:42:01] so I think it's falling over because ifupdown is shitty at working out the dependencies [12:42:22] I think possibly the solution I suggested might still be best, but up to you guys [12:42:59] dhinus: indeed, if it's just a race condition that exists after reimage, but it'll work on a cold boot, the problem isn't as bad [12:45:53] I'm rebooting cloudnet1006 then, unless you're testing something there [12:46:04] fire away [12:50:08] dhinus: can you also review the grants patch? (https://gerrit.wikimedia.org/r/c/operations/puppet/+/971162) [12:50:50] taavi: yes, I've already had a look but I wanted to double check more carefully [12:54:54] +1d [12:55:24] thanks! [12:55:28] reboot completed, the interfaces are looking fine [13:00:49] to test the failover, I just found this note: " Manually stopping the L3 agent service does not induce a failover event. " :D [13:01:02] so let's see if there's a cleaner way [13:03:25] maybe we need a cookbook to do a neutron failover? [13:03:34] that would be nice yeah [13:03:48] but right now I wouldn't know how to write it :D [13:04:52] I can test in codfw if stopping neutron* is enough or not [13:08:04] in codfw, I stopped neutron-l3-agent and that caused both hosts to become "active/active" for a while, then the host hwere I stopped the unit became "standby", so the failover was triggered successfully [13:08:54] that note about "manually stopping does not induce a failover" seems to be incorrect [13:11:20] I will do the same in eqiad (systemctl stop neutron-l3-agent), seems reasonably safe. I have another window open on a cloudcontrol to test the status with "neutron l3-agent-list-hosting-router cloudinstances2b-gw" [13:13:22] ack [15:14:07] dhinus: were you seeing the "VM (using floating IP) can connect to wikireplicas from Toolforge" network test fail? Because that one I can explain [15:15:01] * andrewbogott mostly out today but briefly available [15:15:27] andrewbogott: no, that one was fine. see the output in T350466 [15:15:28] T350466: [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466 [15:16:14] oh good, I tried to restart the tool for the network test but it took forever to come up, must've finally started working while I was asleep [15:17:26] that is a weird assortment of errors on that task. All sorted now or are there still mysteries? (I see that the tests are passing now) [15:26:26] all sorted apparently, but I have no idea why the network tests were failing [15:37:56] was it right after the failover? [15:38:10] (if not then I've no idea either) [15:41:22] no, it was before [15:41:41] I checked the network tests to see if everything was fine before doing anything [15:42:15] but as I tried to debug what was wrong, they magically fixed on their own [15:42:41] and only after that I continued with the reimage + failover [15:52:46] that's concerning but I guess nothing much to be done right now :/ [15:53:06] I'm about to wander off for a while (while still running ceph draining scripts.) Have a good weekend! [16:28:00] I tweaked a memory setting in ToolsDB, let's see if that stops the OOM crashes [16:28:15] I also enabled slow query logging for queries taking more than 30 seconds, to see if that highlights anything odd [16:28:22] more details at T349695 [16:28:23] T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695 [16:28:53] thanks!