[05:46:26] * dhinus paged: ToolsToolsDBWritableState
[05:46:37] <dhinus>	 another occurrence of T349695
[05:46:41] <stashbot>	 T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695
[05:51:03] <dhinus>	 I restarted toolsdb and the alert is gone
[05:51:24] * dhinus back to sleep :)
[05:53:41] <andrewbogott>	 thx dhinus 
[10:39:55] <dhinus>	 today I will reimage cloudnets, starting from the "standby" node (cloudnet1006)
[10:43:22] <dhinus>	 I checked "cookbook wmcs.openstack.network.tests" before starting, and it is showing 1 error
[10:43:32] <dhinus>	 WARNING: failed test: puppetmasters can sync git tree
[10:44:58] <dhinus>	 re-running it, a second test also failed
[10:45:02] <dhinus>	 failed test: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
[10:47:19] <dhinus>	 I created T350466
[10:47:20] <stashbot>	 T350466: [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466
[10:57:56] <topranks>	 dhinus: what are the test hosts that cloudcontrol's should be able to ssh to ?
[10:58:23] <dhinus>	 now everything seems to work so I'm confused
[10:58:35] <dhinus>	 I put the full SSH commands in the task
[11:01:05] <dhinus>	 we should probably run those tests every X hours, so that we know roughly how frequently they fail
[11:02:04] <taavi>	 maybe
[11:02:17] <topranks>	 the wording is confusing, kind of has a postivie/negative outcome in the phrasing 
[11:02:19] <topranks>	 "puppetmasters can sync git tree"
[11:02:28] <topranks>	 "VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr"
[11:02:55] <taavi>	 i've never been a fan of how they modify the live puppetmaster git trees etc
[11:02:55] <topranks>	 "puppetmasters git tree sync test" might be easier to grok for the uninitiated  
[11:03:53] <dhinus>	 they can definitely be improved
[11:04:20] <topranks>	 two tests that might also help us pinpoint the issue if it failed:
[11:04:21] <dhinus>	 I'll proceed with the reimage of cloudnet1006, given the tests are fine now
[11:04:38] <topranks>	 - check dns resolver is working ok 
[11:04:43] <topranks>	 - check ping to eqiad1.bastion.wmcloud.org
[11:05:15] <topranks>	 the ssh through the bastion depends on a number of things, hard to gather from that failure why it didn't work 
[11:05:21] <topranks>	 but yep we can leave for now if it's happy :)
[11:07:02] <dhinus>	 the fact that the output of the failing command is not showed or logged also doesn't help
[11:14:59] * taavi lunch
[11:48:43] <dhinus>	 2 errors on the first puppet run after the reimage of cloudnet1006
[11:48:46] <dhinus>	 Failed to call refresh: '/sbin/brctl addif br-internal vlan1105' returned 1 instead of one of [0]
[11:48:56] <dhinus>	 Failed to call refresh: '/sbin/brctl addif br-external vlan1107' returned 1 instead of one of [0]
[11:49:38] <dhinus>	 topranks: does that ring any bell?
[11:50:00] <taavi>	 does that get fixed on the second run?
[11:50:04] <dhinus>	 still running
[11:50:05] <topranks>	 brctl is depreciated, I wonder if it's gone in bullseye 
[11:51:08] <topranks>	 "ip link set dev vlan1105 master br-internal" is the way to do it now 
[11:51:20] <topranks>	 let me have a closer look 
[11:51:20] <dhinus>	 actually it's already the second run so I think it will keep on failing
[11:51:44] <taavi>	 openstack::neutron::bridge needs updating then
[11:53:04] <topranks>	 brctl is there anyway so it's not that 
[11:53:47] <taavi>	 third puppet run succeeded, but that might just have been because the exec is a refreshonly
[11:53:48] <dhinus>	 wait we have a run without errors now
[11:53:56] <dhinus>	 oh ok
[11:53:58] <topranks>	 the command ran fine for me manually 
[11:54:25] <topranks>	 https://www.irccloud.com/pastebin/BPe52Jn0/
[11:55:12] <dhinus>	 then I have no idea :/
[11:55:37] <topranks>	 potentially a race condition, for the command to work the bridge and vlan devices both need to exist 
[11:55:45] <dhinus>	 the reimage cookbook completed successfully
[11:55:51] <topranks>	 and are created elsewhere 
[11:55:57] <dhinus>	 is the network config looking fine?
[11:56:20] <dhinus>	 a race condition is possible yeah
[11:56:48] <taavi>	 I think the reimage cookbook will reboot the host in just a second, so we should be fine
[11:58:21] <topranks>	 dhinus: yeah at a glance the basic network config looks ok 
[11:58:49] <topranks>	 you should definitely do a forced failover from the other cloudnet to get openstack to add all the bits it does when that happens, and confirm things work, before trying to reimage the other cloudnet 
[11:59:08] <dhinus>	 yes makes sense
[11:59:22] <dhinus>	 what's the easiest way to trigger a failover?
[11:59:38] <topranks>	 I always just ask Arturo to do it :P
[12:00:03] <dhinus>	 I also see this error in the cookbook output, probably unrelated: https://phabricator.wikimedia.org/P53136
[12:01:25] <topranks>	 dhinus: huh, that is actually my responsibility, it's the netbox interface import script 
[12:01:35] <dhinus>	 ha :)
[12:01:44] <topranks>	 I'll see why it's tripping up, but it's simply to update netbox so no functional problem 
[12:02:03] <dhinus>	 ok
[12:03:21] <dhinus>	 I guess stopping all neutron* systemctl units should be enough to trigger a failover?
[12:04:36] <taavi>	 trying to fix the puppet ordering issue with this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971406/
[12:07:13] <taavi>	 running a PCC, selecting `auto` seems to have picked quite a few hosts so it'll take a bit
[12:07:45] <dhinus>	 what's the <| expression |>? I don't think I've seen it before
[12:09:25] <dhinus>	 and do you agree it's sensible to "systemctl stop neutron*" to trigger a failover?
[12:09:47] <taavi>	 it's a resource collector: https://www.puppet.com/docs/puppet/7/lang_collectors
[12:09:55] <taavi>	 failover from where?
[12:10:00] <dhinus>	 from 1005 to 1006
[12:10:08] <dhinus>	 so we check if 1006 is actually working before reimaging 1005
[12:11:06] <dhinus>	 cloudnet1005 is currently the active node, and 1006 the standby
[12:12:08] <taavi>	 I think yes, stopping the neutron services should do it, but I'm a bit uneasy if the reimage cookbook crashed and the node didn't actually reboot first
[12:12:51] <taavi>	 or no, it seems to have booted just fine
[12:13:12] <dhinus>	 I don't think it crashed, no. it just had an error updating netbox, but it completed with a PASS
[12:13:15] <taavi>	 ah
[12:13:44] <taavi>	 as long as you know how to fail back quickly in case something goes wrong, I think we're fine
[12:13:55] <dhinus>	 that's a good question :D
[12:14:51] <taavi>	 and just to make sure, are all of the bridges configured properly?
[12:16:16] <dhinus>	 how would you check the bridges?
[12:16:45] <taavi>	 compare them to a working node I guess?
[12:17:11] <dhinus>	 compare in the output of "ip a" or something like that?
[12:17:47] <taavi>	 does that show the bridge members?
[12:21:28] <dhinus>	 I think it does, but I haven't played with this stuff in a long time. "bridge link" seems the way to list the existing bridges
[12:21:38] <dhinus>	 and the output looks similar in 1005 and 1006
[12:22:42] <dhinus>	 unless neutron has additional hidden things that I don't know about
[12:27:05] <dhinus>	 there is a diff in "brctl show", though
[12:28:28] <dhinus>	 vlan1107 is missing from br-external
[12:28:38] <topranks>	 "ip link show master <bridge_name>" is how to show the bridge names 
[12:28:59] <topranks>	 I wrote saved a bunch of these commands here cos I always forget them: 
[12:28:59] <topranks>	 https://listed.to/@techtrips/37378/linux-bridge-command-examples
[12:29:12] <topranks>	 I suspect the issue here is I added vlan1105 manually 
[12:29:28] <topranks>	 *just* after it was reported good, I assumed it had fixed itself and my manual command was a no-op 
[12:29:48] <topranks>	 but perhaps it did add the device to the bridge, and vlan1107 was never added to the external one 
[12:29:51] <topranks>	 was the server rebooted?
[12:30:37] <dhinus>	 yes, but I think the bridges were created _after_ the reboot maybe?
[12:31:17] <taavi>	 again, lot's of puppet ordering issues
[12:31:54] <taavi>	 first, the first puppet run didn't create the bridges as neutron config failed to provision, this would be fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/971162
[12:32:03] <taavi>	 the reboot happens between the first and the second run iirc
[12:32:12] <topranks>	 server says its up 51 minutes, I ran the command in https://www.irccloud.com/pastebin/BPe52Jn0/ since then 
[12:32:36] <taavi>	 and then the second one created the bridges but didn't add the interfaces to them, this is fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/971406
[12:32:43] <topranks>	 obviously we can just add vlan1107 to br-external, but I think what's more important is this happens cleanly on a reboot 
[12:32:56] <taavi>	 and because the 'add to bridge' command has refreshonly => true, that did not get fixed on the third run
[12:34:54] <topranks>	 I agree this is "hokey" :P
[12:35:11] <taavi>	 the last thing is fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/971418/
[12:35:33] <topranks>	 my gut feeling is it would make more sense to have these done by "pre-up" or "post-up" commands in /etc/network/interfaces 
[12:35:34] <taavi>	 all of this would be much better if we had systemd-networkd or basically anything else than the current ifupdown setup and a horrible puppetization on top of that
[12:35:37] <topranks>	 where the vlan ints are configured 
[12:35:54] <taavi>	 it is done that way on any subsequent boots
[12:36:28] <topranks>	 how do you mean?
[12:37:51] <taavi>	 most of the puppetization uses interface::post_up_command in addition to manual Exec resources to add stuff to /e/n/i
[12:39:53] <topranks>	 ok yeah.  whatever works.  I'd probably stick to the "interface:pre_up_command" and "interface::post_up_command" myself 
[12:39:56] <dhinus>	 and those post_up_commands only get run after a host reboot, so they have not run yet on 1006?
[12:39:59] <topranks>	 an example of what it could look like 
[12:40:00] <topranks>	 https://phabricator.wikimedia.org/P53137$30
[12:40:54] <taavi>	 dhinus: correct
[12:41:11] <topranks>	 sorry I think I may be wrong 
[12:41:31] <dhinus>	 so my suggestion would be: I'll try rebooting manually 1006, we check if things look good, and test the failover
[12:41:42] <dhinus>	 then separately we can test different approaches, but I would test those in codfw first
[12:41:45] <topranks>	 there is also /etc/network/interfaces.d/br-internal
[12:42:01] <topranks>	 so I think it's falling over because ifupdown is shitty at working out the dependencies 
[12:42:22] <topranks>	 I think possibly the solution I suggested might still be best, but up to you guys 
[12:42:59] <topranks>	 dhinus: indeed, if it's just a race condition that exists after reimage, but it'll work on a cold boot, the problem isn't as bad 
[12:45:53] <dhinus>	 I'm rebooting cloudnet1006 then, unless you're testing something there
[12:46:04] <topranks>	 fire away 
[12:50:08] <taavi>	 dhinus: can you also review the grants patch? (https://gerrit.wikimedia.org/r/c/operations/puppet/+/971162)
[12:50:50] <dhinus>	 taavi: yes, I've already had a look but I wanted to double check more carefully
[12:54:54] <dhinus>	 +1d
[12:55:24] <taavi>	 thanks!
[12:55:28] <dhinus>	 reboot completed, the interfaces are looking fine
[13:00:49] <dhinus>	 to test the failover, I just found this note: " Manually stopping the L3 agent service does not induce a failover event. " :D
[13:01:02] <dhinus>	 so let's see if there's a cleaner way
[13:03:25] <taavi>	 maybe we need a cookbook to do a neutron failover?
[13:03:34] <dhinus>	 that would be nice yeah
[13:03:48] <dhinus>	 but right now I wouldn't know how to write it :D
[13:04:52] <dhinus>	 I can test in codfw if stopping neutron* is enough or not
[13:08:04] <dhinus>	 in codfw, I stopped  neutron-l3-agent and that caused both hosts to become "active/active" for a while, then the host hwere I stopped the unit became "standby", so the failover was triggered successfully
[13:08:54] <dhinus>	 that note about "manually stopping does not induce a failover" seems to be incorrect
[13:11:20] <dhinus>	 I will do the same in eqiad (systemctl stop neutron-l3-agent), seems reasonably safe. I have another window open on a cloudcontrol to test the status with "neutron l3-agent-list-hosting-router cloudinstances2b-gw"
[13:13:22] <taavi>	 ack
[15:14:07] <andrewbogott>	 dhinus: were you seeing the "VM (using floating IP) can connect to wikireplicas from Toolforge" network test fail? Because that one I can explain
[15:15:01] * andrewbogott mostly out today but briefly available
[15:15:27] <dhinus>	 andrewbogott: no, that one was fine. see the output in T350466
[15:15:28] <stashbot>	 T350466: [openstack] Network tests are failing in eqiad - https://phabricator.wikimedia.org/T350466
[15:16:14] <andrewbogott>	 oh good, I tried to restart the tool for the network test but it took forever to come up, must've finally started working while I was asleep
[15:17:26] <andrewbogott>	 that is a weird assortment of errors on that task. All sorted now or are there still mysteries? (I see that the tests are passing now)
[15:26:26] <dhinus>	 all sorted apparently, but I have no idea why the network tests were failing
[15:37:56] <andrewbogott>	 was it right after the failover?
[15:38:10] <andrewbogott>	 (if not then I've no idea either)
[15:41:22] <dhinus>	 no, it was before
[15:41:41] <dhinus>	 I checked the network tests to see if everything was fine before doing anything
[15:42:15] <dhinus>	 but as I tried to debug what was wrong, they magically fixed on their own
[15:42:41] <dhinus>	 and only after that I continued with the reimage + failover
[15:52:46] <andrewbogott>	 that's concerning but I guess nothing much to be done right now :/
[15:53:06] <andrewbogott>	 I'm about to wander off for a while (while still running ceph draining scripts.)  Have a good weekend!
[16:28:00] <dhinus>	 I tweaked a memory setting in ToolsDB, let's see if that stops the OOM crashes
[16:28:15] <dhinus>	 I also enabled slow query logging for queries taking more than 30 seconds, to see if that highlights anything odd
[16:28:22] <dhinus>	 more details at T349695
[16:28:23] <stashbot>	 T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695
[16:28:53] <andrewbogott>	 thanks!