[04:53:38] arturo, dhinus: update, I got things working sightly more but the network tests mostly fail still. Next step is probably to upgrade one or both cloudnet hosts and see if that clears more of the tests. [04:53:48] But I need to go to sleep so will leave that to you (or future me) [07:08:17] https://www.irccloud.com/pastebin/rC8V9Ary/ [07:11:25] this works https://www.irccloud.com/pastebin/2ldPS9C6/ [07:25:50] weird [07:25:52] morning [07:27:37] morning :) [07:38:50] I think that's because I'm part of the `ops` group and you are not :/ [07:39:16] not sure why we allow sudoing to root, but not to other users directly [08:09:24] this alert has been going on for a bit, anyone is looking at it? (I expected it to have been one of the network hiccups, but it did not clear) [08:09:24] https://alerts.wikimedia.org/?q=team%3Dwmcs&q=alertname%3DProbeDown [08:25:00] I will delay the cloudgw operation because the codfw1dev instability [08:25:56] dcaro: I'm not looking into it, is in toolsbeta, no? [08:26:15] yep [08:26:28] ok, I can look in a bit [08:26:33] if you don't beat me to it [08:26:46] ack, I'm doing something else too, but I might take a look after yep :) [08:27:35] morning [08:27:57] o/ [08:31:07] fourohfour on toolsbeta was logging 'uWSGI listen queue of socket ":8000" (fd: 4) full'. I restarted it [08:31:36] we should probably move that to buildservice + gunicorn [08:37:21] looking for reviews of: https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-cli/+/962628 https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-cli/+/963295 [08:48:56] oh, that error happened several times before [08:54:03] yep, but it's been much less frequent recently [08:54:24] dhinus: would you like to work with me to fix neutron @ codfw1dev? [09:03:36] gitlab is down for maintenance xd, time to relax [09:04:35] arturo: sure, give me 5 mins! [09:22:46] arturo: so where would you start? a.ndrew suggested to upgrade cloudnet hosts [09:22:58] yeah, lets do that! [09:22:58] I can do it with the cookbook [09:23:11] I guess the neutron DB was updated and also the neutron api no? [09:23:42] are they in cloudcontrols? [09:25:22] yes I see the cookbook runs "neutron-db-manage upgrade heads" for cloudcontrols [09:26:04] I will run "cookbook wmcs.openstack.cloudcontrol.upgrade_openstack_node --fqdn-to-upgrade cloudnet2005-dev.codfw.wmnet --task-id T341285" [09:26:04] T341285: Upgrade cloud-vps openstack to version 'Antelope' - https://phabricator.wikimedia.org/T341285 [09:29:01] the wiki says I should update the "standby" cloudnet first, but the command to find it fails :/ [09:29:14] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Openstack_upgrade#Upgrading_cloudnet_nodes [09:30:31] since the deployment is basically down, the order doesn't matter in this case [09:31:02] I'll start with that one [09:31:12] I wonder if we should find a new version for that command though [09:31:20] the command to figure out which neutron l3 agent is the primary requires the neutron API and the neutron l3 agent to be up and running [09:31:48] eh I'm guessing the cloudgw meet isn't happening? [09:32:03] it's failing with a deprecation notice "neutron CLI is deprecated and will be removed in the Z cycle." [09:32:20] but maybe you're right it would still work if the agent is up [09:33:02] I started the upgrade in 2005 [09:33:13] (cloudnet2005-dev) [09:35:17] topranks: yes, sorry not happening. The codfw1dev setup is down for openstack upgrades and we cannot test the cloudgw patches [09:35:26] sure np [09:36:06] in the meantime, I see puppet is still failing on cloudcontrol2001-dev https://puppetboard.wikimedia.org/node/cloudcontrol2001-dev.codfw.wmnet [09:36:08] dhinus: also yes, AFAIK there is no equivalent for that particular command in the openstack CLI [09:40:34] dhinus: did you check the keystone logs? it seems to be throwing 500 (when nova-conductor tries to start up) [09:42:02] it's complaining about keys not found [09:42:03] keystone.server.flask.application keystone.exception.KeysNotFound: An unexpected error prevented the server from fulfilling your request. [09:42:14] (from https://logstash.wikimedia.org/app/dashboards#/view/8aa679f0-d52e-11eb-81e9-e1226573bad4?_g=h@865c245&_a=h@af42e7d) [09:43:26] dcaro: no I didn't see those logs, thanks [09:43:27] neutron-api and cinder-api fail due to not having the arguments `--http-socket` it seems, though that might be a different issue [09:43:54] cloudnet2005 is upgraded and puppet is fine [09:44:05] keystone may not be running at all because of T348157 [09:44:05] T348157: keystone: segfaults in debian bookworm - https://phabricator.wikimedia.org/T348157 [09:44:38] see the comment in that task, the segfault only happened once, there was something with the startup scripts that a.ndrew was trying to fix [09:45:07] oh, ok [09:47:06] I will update the other cloudnet in the meantime, but yes keystone has something not right [09:47:09] then yes, I think the keysnotfound problem that david commented is the main problem here [09:48:50] ok I think I know what's going on [09:48:58] we lost the fermet keys cluster-wide [09:48:59] $ sudo ls /etc/keystone/fernet-keys/ [09:49:07] none of the cloudcontrol servers has anything in there [09:49:31] hmm not even in the ones that have not been upgraded? [09:50:05] there's a service copying them around, did that fail? [09:50:06] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_or_revoking_keystone_fernet_tokens [09:50:15] I will just generate new ones [09:51:01] https://www.irccloud.com/pastebin/7KvLjsFC/ [09:51:42] that looks good [09:53:33] nice [09:55:10] cloudnets are both upgraded [09:56:12] \o/ [09:56:22] neutron seems happy [09:57:36] puppet will start failing on prometheus VMs, as the alerts code is stored in gitlab, and gitlab is down [09:57:40] (fyi) [10:25:18] * dcaro lunch [10:34:24] the neutron router doesn't work as expected and I can't figure out why just yet [10:37:33] I'm still seeing the errors dcar.o reported earlier (--http-socket) but I'm not sure it's related [10:52:58] the thing that puzzles me is that the IP is assigned in the neutron l3 network namespace, but it is reported as unreachable [10:53:04] maybe there is some route missing [10:53:45] the IP == the external IP of the neutron gateway router == 185.15.57.10 [11:07:45] (got distracted with other thing) [11:48:59] gitlab is back up :) (it seems) [11:54:22] topranks: may cloudsw be filtering traffic for cloudnet <-> cloudcontrol in codfw? [11:55:39] mmm [11:55:41] missing vlan? [11:56:09] shouldn't but but let me look [11:57:22] sry that's the L2 only one, nah definitely not [11:57:45] this is what I see [11:57:47] cloudcontrol side [11:57:49] https://www.irccloud.com/pastebin/HWWkVxJA/ [11:58:06] cloudnet side [11:58:08] https://www.irccloud.com/pastebin/Bv40ZRlW/ [11:58:22] the icmp echo reply should be on the wire, but is never getting back into cloudcontrol [11:58:39] wrong route somewhere? [11:59:17] https://www.irccloud.com/pastebin/iFgNZbIm/ [11:59:55] topranks: note this is cloudcontrol2001-dev [11:59:57] not cloudgw [12:00:16] I was going to ask if the traffic between cloudcontrols and cloudnets goes through cloudgw xd [12:00:26] arturo: sry my bad [12:01:10] there could be some asymmetry going on [12:01:29] cloudcontrol -> cloudgw -> cloudnet [12:01:32] then for the return [12:01:36] cloudnet -> cloudcontrol [12:01:41] dcaro: https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/24 [12:01:52] arturo: you can check the mac address and/or the route I guess [12:03:04] nah this is direct on the same vlan, and there is two way comms [12:03:31] dhinus: are you acting on cloudnet200x servers? [12:03:39] not right now [12:03:51] I haven't touched them since I ran the upgrade cookbook [12:03:55] ack [12:04:16] my ssh session froze in cloudnet2006-dev for whatever reason [12:06:49] what traffic are you thinking of here? [12:07:03] oh, this may be filtered in cloudgw actually [12:07:06] i.e. what is connecting to what? [12:07:17] root@cloudcontrol2001-dev:~# ping 185.15.57.10 [12:07:22] the cloudnet 185.15.57.10 range is only reachable from cloudgw [12:08:08] I think there is a misconfigured filter in cloudgw [12:08:16] 12:07:35.372075 IP 185.15.57.10 > 172.20.5.5: ICMP echo reply, id 5496, seq 2, length 64 [12:08:23] this return packet gets dropped in cloudgw [12:08:31] ok [12:11:44] what you need to bear in mind is the netns on the cloudnet has no direct connection to cloud-private [12:12:07] * arturo nods [12:12:10] so all comms from the qrouter netns follow the default to get to 172.20.x.x, same as an internet direction [12:12:14] https://www.irccloud.com/pastebin/CSAeqPUH/ [12:12:33] I think cloud-private is unreachable from the vrf-cloudgw as well? [12:13:06] topranks: proof: `aborrero@cloudgw2003-dev:~ 2m4s 130 $ sudo ip vrf exec vrf-cloudgw ping 172.20.5.5` [12:13:33] aborrero@cloudgw2003-dev:~ 2s 1 $ sudo ip vrf exec vrf-cloudgw ip route get 172.20.5.5 [12:13:33] 172.20.5.5 via 10.192.20.1 dev eno1 src 10.192.20.7 uid 0 [12:14:18] wait [12:14:25] this is missing all the routes, no? [12:15:12] cloudgw2003-dev needs a reboot, is missing routes, is in an inconsistent state [12:15:16] doing so now [12:15:19] vrf-cloudgw is directly connected to the cloudvrf on the switches [12:15:36] this is wrong [12:15:36] the main/default netns on cloudgw has no link to it [12:15:37] https://www.irccloud.com/pastebin/GSO7PByZ/ [12:16:23] huh, I don't see that [12:16:26] https://www.irccloud.com/pastebin/AR3GIszt/ [12:16:41] mmm wrong command on my side? [12:17:53] yes, wrong command on my side, I see now the same as you [12:18:17] actually yeah - ip route will only show the routing table it's asked for [12:18:17] regardless of the context executed [12:18:17] so you need to specify the vrf or table as parameter to "ip route" always, regardless of the context he command executes in [12:21:13] I can't explain why this doesn't work [12:21:28] https://www.irccloud.com/pastebin/X6DxyfQV/ [12:24:04] something weird is happening. [12:24:10] https://www.irccloud.com/pastebin/tk3Iq3CC/ [12:24:33] ^^ this is 5 minutes ago - gets to cloudsw at least [12:24:35] and now: [12:24:39] https://www.irccloud.com/pastebin/JJPiTwMV/ [12:26:54] ok topranks let's reboot the server? [12:27:09] If that's an option I think worthwhile yes [12:27:13] ok, doing now [12:30:48] ok so cloudgw2002 took the floating VIPs [12:30:53] https://www.irccloud.com/pastebin/xNzujB55/ [12:31:04] https://www.irccloud.com/pastebin/YNBmbyPv/ [12:31:22] ^^ this looks right and is probably cloudcontrol not allowing the ICMP from 208.80.153.189 [12:32:21] it does work from the 185.15.59.9 IP on cloudgw2002 (i.e. the inside one facing cloudnet) [12:32:25] https://www.irccloud.com/pastebin/FPikf3pw/ [12:33:12] tracing the package in the nftables firewall, I see [12:33:41] trace id 84d6a8a7 inet cloudgw trace_chain packet: oif "vrf-cloudgw" ip saddr 208.80.153.189 ip daddr 172.20.5.5 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 8644 ip protocol icmp ip length 84 icmp code net-unreachable icmp id 37543 icmp sequence 1 @th,64,96 36108339914365381725534816512 [12:33:52] note the `icmp code net-unreachable` [12:34:39] also works from the cloudnet, so this not filtered by cloudgw: [12:34:49] https://www.irccloud.com/pastebin/2MslzIkn/ [12:35:49] can you run a tcpdump or similar on cloudsw? [12:35:55] if that's a network unreachable the cloudgw is generating and sending back to the cloudcontrol we'd need to know what the original packet was that it was generated in response to [12:35:57] arturo: lol [12:36:03] :-) [12:36:09] no that's not really a thing, traffic is handled by asic [12:36:15] but it's not the cloudsw there are no filters [12:38:09] arturo: The 208.80.153.184/29 ("cloudgw-trasnport") subnet between cloudgw and cloudsw is just used for transport [12:38:30] ok [12:38:40] The only reason you may need to permit it, or care if it's blocked, is if you want the cloudgw to initiate connections in the cloudgw-vrf to something [12:39:34] But if we're talking about other traffic here it's a red herring [12:40:12] ok, so I see this [12:40:20] aborrero@cloudgw2002-dev:~ 10s 1 $ sudo ip vrf exec vrf-cloudgw ping -c1 172.20.5.5 [12:40:26] 12:39:41.458061 IP cloudgw2002-dev.codfw1dev.wikimediacloud.org > cloudcontrol2001-dev.private.codfw.wikimedia.cloud: ICMP echo request, id 25929, seq 1, length 64 [12:40:41] yeah that matches the pattern I mention above [12:40:44] that's the packet leaving cloudgw [12:40:51] then, in cloudcontrol2001-decv [12:41:09] i.e. you're pinging from 208.80.153.189 [12:41:25] if you care about that you may need to allow it on the cloudcontrol, which is gonna block it [12:41:36] mmm [12:41:39] but cloudcontrol has this [12:41:42] root@cloudcontrol2001-dev:~# iptables-save -c | grep -i icmp [12:41:42] [90:6912] -A INPUT -p icmp -j ACCEPT [12:41:48] from the other IP on the cloudgw, and from the cloudnet, the traffic works [12:41:50] https://www.irccloud.com/pastebin/FPikf3pw/ [12:42:11] sorry, yeah I assumed it was blocked, [12:42:51] if you want to allow traffic originated by cloudgw from 208.80.153.189 to reach things on cloud-private the other hosts need a route for 208.80.153.184/29 pointing to 172.20.5.1 [12:43:08] as is it will get sent out using default to 10.x wmf prod gateway [12:43:52] wait ... now this works? [12:43:53] https://www.irccloud.com/pastebin/jangita2/ [12:44:07] so maybe is that cloudgw2003-dev was misconfigured and that's all [12:44:19] after the other day tests [12:44:31] topranks: I can confirm this! [12:44:34] this was all a misconfiguration [12:44:43] https://www.irccloud.com/pastebin/O3oJlr9z/ [12:44:46] dhinus: ^^^ [12:44:54] a reboot of cloudgw2003-dev solved the problem [12:45:09] everything works now [12:45:27] do we want to proceed with the cloudgw changes that we had scheduled for earlier today? [12:45:39] T347469 [12:45:39] T347469: cloudgw improvements - https://phabricator.wikimedia.org/T347469 [12:46:10] i.e, this and friends [12:46:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/922105 [12:46:20] to clarify - looking at cloudcontrol, it blocks the traffic from 208.80.153.189 due to missing route, and rp_filter on its cloud-private interface [12:46:28] net.ipv4.conf.vlan2151.rp_filter = 1 [12:46:42] ack [12:47:01] But - again - I don't see why you want or need to care about the 208.80.153.184/29 network, doesn't seem at all related to the original problem [12:47:17] I think it was just a cross check [12:47:30] yeah it's just it gets used in the test pings etc. [12:47:56] but anyway - looks like something funky happened on cloudgw2002, that "network unreachable" I got was odd [12:48:33] as for the changes I'm around to help out if we want to proceed [12:49:11] ok, lets proceed [12:49:23] would you like to +1?= [12:50:02] yep, done [12:50:24] my only fear here is that the /etc/network/interfaces file that is built with those classes has things in an order ifupdown doesn't like [12:50:51] we are about to find out [12:51:04] I'll do codf1dev first (cloudgw2003/2002) [12:51:40] yeah 100% let's not touch eqiad until we are fully happy [12:52:51] ok [12:53:31] topranks: maybe, shall we merge the 3 patches in a batch? [12:53:41] for less server reboots, etc [12:54:07] that's a good idea actually [12:54:27] it might be easier to rollback as well [12:54:29] please +1 here https://gerrit.wikimedia.org/r/c/operations/puppet/+/963298 [12:54:37] and here https://gerrit.wikimedia.org/r/c/operations/puppet/+/963311 [12:57:55] done [12:58:01] thanks, merging, puppet is disabled in eqiad1's cloudgw [12:58:06] cool [13:00:05] ( I had to rebase the patches, the CI is taking a bit) [13:01:51] solid 3 minutes for the CI -_- [13:03:22] ok running puppet in cloudgw2003-dev (inactive node) [13:05:53] fixing puppet resource name clash [13:06:18] ah ok [13:06:28] https://gerrit.wikimedia.org/r/c/operations/puppet/+/963723 [13:06:50] another 3 minutes of CI -_- [13:10:29] rebooting cloudgw2003-dev to see how ifupdown works [13:10:37] hold on if you can [13:10:47] :-( already in flight [13:10:50] ok [13:11:11] it's ok, sometimes if the order of things is wrong iupdown will bail and leave us with no ssh [13:11:11] puppet agent run felt OK [13:11:13] https://www.irccloud.com/pastebin/uWLCbhPz/ [13:11:16] but we can get in through recovery console [13:12:55] mmmm [13:13:09] aborrero@cumin1001:~ 130 $ sudo install_console cloudgw2003-dev.eqiad.wmnet [13:13:10] cloudgw2003-dev.eqiad.wmnet: is not a valid hostname [13:13:30] aborrero@cumin1001:~ 1 $ sudo install_console cloudgw2003-dev.mgmt.eqiad.wmnet [13:13:30] cloudgw2003-dev.mgmt.eqiad.wmnet: is not a valid hostname [13:13:30] aborrero@cumin1001:~ 1 $ sudo install_console cloudgw2003-dev [13:13:30] cloudgw2003-dev: is not a valid hostname [13:13:33] that's weird [13:13:38] em I don't think we need to go there just yet? [13:13:42] am I making a typo? [13:13:46] arturo: try .codfw.wmnet [13:13:52] I typically just ssh to the mgmt fqdn, so maybe that? [13:13:53] oh right [13:14:06] cloudgw2003-dev.mgmt.codfw.wmnet [13:14:21] it works now, was a typo, thanks taavi [13:14:39] topranks: server is back online [13:14:47] ok, no need for my panic :P [13:14:54] and a good sign :) [13:15:11] https://www.irccloud.com/pastebin/W6WNMrOg/ [13:15:39] " post-up ip route add table cloudgw default via 208.80.153.185 dev vlan2120" [13:15:50] is under the vrf def in the interfaces file [13:16:00] should be under vlan2120 I think [13:16:13] ok, let me verify in place then I'll write a puppet patch [13:16:41] catching up now, so was the config broken before I upgraded the cloudcontrol yesterday? [13:17:05] topranks: I'll reboot, ok? [13:17:36] there is more going on though perhaps [13:18:03] dhinus: unclear, but anyway, now about to (hopefully not) be broken in other ways [13:18:06] topranks: what do you see? [13:18:17] no give it a shot [13:18:32] it couldn't create the vrf device, but perhaps it just didn't cos post-up command wrong? [13:18:35] ok, rebooting [13:19:11] mmmm [13:19:14] https://www.irccloud.com/pastebin/4RUSxt6I/ [13:19:31] we may need an `auto` in there, instead of `allow-hotplug` [13:19:43] not sure what the allow-hotplug is gonna do in this context yeah [13:20:03] I think that's injected by the puppet base interface module [13:20:15] vlan ints have auto, yeah was gonna ask what is controlling which is being used [13:20:33] if it gets to tricky we may have to revert to the erb template [13:20:34] yes [13:20:36] https://www.irccloud.com/pastebin/t7ZL4bqg/ [13:23:42] dhinus, arturo, any upgrade fallout I can help with, or is all the action in the network now? [13:24:00] andrewbogott: no actions required at the moment, thanks [13:24:02] I read the backscroll and am now trying not to ask the question "but then how did this ever work" [13:25:22] arturo: I manually changed "allow-hotplug" to "auto" in the interfaces file and "ifup vrf-cloudgw" worked [13:25:36] topranks: cool, cooking a patch [13:25:56] I'm guessing there is no alternative to "interface::manual" we can use to get it with 'auto' not 'allow-hotplug'? [13:26:04] or cool, if you have a plan [13:28:08] topranks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/963727 [13:29:32] topranks: I'll reboot again before merge to make sure this is it [13:30:02] good idea [13:30:10] ok, rebooting now [13:35:02] topranks: I get now [13:35:09] ip route add 172.16.128.0/24 table cloudgw nexthop via 185.15.57.10 dev vlan2107 [13:35:09] Error: Nexthop has invalid gateway. [13:35:09] ifup: failed to bring up vlan2107 [13:37:30] it's failed to add the IP to the vlan2107 interface [13:37:47] which I don't understand, cos it did so just fine for vlan2120, and they have very similar config [13:38:07] iface vlan2107 inet manual [13:38:11] s [13:38:12] vs [13:38:14] iface vlan2120 inet static [13:38:14] one thing I do see [13:38:23] https://www.irccloud.com/pastebin/6Ndxgzb2/ [13:38:39] ^^ the last command should come before the "ip route" commands [13:38:44] ack [13:38:46] fixing that now [13:39:07] arturo: I merged your patch [13:39:09] the manual/static may be an issue too [13:39:17] andrewbogott: thanks [13:41:08] the "method" parameter is set for interface::tagged to manual for the vlan2107 one [13:42:21] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/wmcs/cloudgw.pp#60 [13:42:32] ack [13:43:04] think we can just remove the 'method' completely, not passed on the wan one [13:45:44] https://gerrit.wikimedia.org/r/c/operations/puppet/+/963734 [13:45:55] I got tired of manually having to download the latest debian from the toolforge clis MRs I'm testing: https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/84 [13:49:17] jbond: I conclude (indirectly) that you are cleaning up Buster-specific puppet things today? [13:49:34] Unfortunately the cloudweb* hosts are still running Buster because of wikitech. [13:52:10] andrewbogott: now thats not the case. i just decomissioned the old puppetdb hosts which where buster [13:52:18] s/now/no/ [13:52:50] huh, ok, I guess I'll hunt elsewhere for puppet breakage :) [13:52:53] are you seeing issues. my chantges should only affect useres of role::puppetdb which afaik is only used in production [13:53:04] i can take a look its possible i missed something [13:53:10] dcaro: neat, reviewed [13:54:03] jbond: I think I see what it is, ignore me for now [13:54:29] ok wel the puppet office hours starts in ~5 mins so feel free to bring it to that :) [13:55:08] topranks: I think I'm ready to declare cloudgw2003-dev is OK and move to cloudgw2002-dev [13:55:36] jbond: it was this :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/963737 [13:55:56] ack :) [13:56:10] I saw a complaint about buster packages and leapt to conclusions [13:59:13] arturo: yes looks good to me let's move on to cloudgw2002-dev [13:59:16] ceph is starting to get kinda full :/, I'm keeping a close eye, but let me know if you see anything [13:59:23] topranks: running puppet [13:59:43] topranks: vrrp switched primary to cloudgw2003-dev [14:00:20] topranks: I'm rebooting cloudgw2002-dev now [14:00:36] ok [14:03:13] arturo: sorry to interrupt the thing you're in the middle of. Can you add to your last-minute list of things to do writing a little wikitech page about the openstack bpos and how the automation works? Just a paragraph or two would be plenty. [14:03:18] Unless that's already documented someplace [14:06:15] andrewbogott: I don't think there is need to write special docs. It is a mirror like the other mirrors the WMF mirror service hosts [14:06:30] To learn a little more about installing on prod. With pulling down from the web restricted, how do I pull down random things like kolla? [14:07:00] topranks: does cloudgw2002-dev looks good to you? [14:07:05] topranks: it looks good to me [14:07:39] Rook: an internal docker registry [14:08:15] And the surrounding kolla code? [14:09:39] Rook: a fork on gitlab, maybe [14:09:57] arturo: ok [14:10:36] Rook: you're breaking new ground so the short answer is that we don't know for sure. But a gitlab fork is the most common way so far for that kind of thing [14:10:38] topranks: I'll have a break then proceed with eqiad1. This looks good to me overall [14:10:49] * arturo back later [14:10:50] sry was afk for a few mins [14:11:01] let me check but I'm guessing 2002 is good too, will let you know [14:12:23] arturo: yep looks good to me, only last thing perhaps is another reboot of cloudgw2003-dev, to make sure 2002 takes over and is ok, but I can't see why it wouldn't be [14:16:05] Rook: andrewbogott: if kolla is written in python and has non-trivial dependencies (i.e. more than a couple of python3-* packages to install via puppet), I'm fairly sure the preferred way is with an apt package, scap or the python_deploy::venv class in puppet which is very similar to scap except it uses a cookbook. [14:17:47] all of them make dependency management sort of a pain for rapid testing (although hopefully it can be tested outside production :P) [14:17:53] I think following taavi's response, when I find that there are a large number of dependencies it is pulling down I use something that I can find in apt, scap, or puppet? (Puppet has some special way of getting around networking stuff?) [14:17:53] Basically I need to fork/repackage everything I use? [14:23:00] if it's done via an apt package, then all of the dependencies must be packaged in debian too. for the other methods we would have a locally hosted mirror of the main git repo, and an another local git repo that has all of the python wheels committed [14:24:08] but yeah, the key idea for all methods is that everything in production can be installed from the local debian mirror, the wikimedia debian repository, any of the wikimedia git hosting services (gerrit/gitlab) or the wikimedia docker registry, and everything is managed via puppet and not by hand [14:57:32] andrewbogott: your q yesterday about keystone-admin sent me in a rabbit hole and now I have a question: do we even need the keystone-admin service? it looks like it's another copy of the same service, just running on a different port [14:58:59] I found this in https://docs.openstack.org/api-ref/identity/v3/#what-s-new-in-version-3-0-grizzly (2013): "Former “Service” and “Admin” APIs are consolidated into a single core API" [14:59:11] we run two services but the init.d files are identical [15:03:03] dhinus: that's a good question. There used to be a bunch of hard-coded checks within the keystone code like if is_admin {} and then sections of functionality didn't work if you weren't using the admin API [15:03:13] but as to how that flag was set... I'm unsure [15:03:24] I suspect that some of those admin blocks are still around in the code though [15:03:35] it's worth more research [15:04:01] * andrewbogott watching the Connect keynote now [15:05:19] hmm, for builds-cli pypi release failed complaining about auth missing, I tried setting the vars (TWINE_*) to the same values the repo has locally (copy-pasting both the name and value), and it worked for me :/, if anyone has ideas I'd appreciate them [15:06:57] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/963752 [15:07:01] dcaro: do you have a link? [15:07:21] yes https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-cli/-/jobs/149428 [15:12:46] hmm [15:13:20] dcaro: I wonder if you also need to configure the tag pattern as protected in gitlab. it's missing from my instructions, but it is set in toolforge-weld https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-weld/-/settings/repository [15:13:32] hmm, maybe, let me try [15:14:00] topranks: I'm continuing now with eqiad1 in a couple minutes [15:17:48] taavi: that was it yes! (now it failed because it already exists xd) I'll update the instructions [15:21:01] https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/21 fix [15:22:54] dhinus andrewbogott any ongoing work in codfw1dev? [15:23:58] I'm hitting [15:24:01] aborrero@cloudcontrol2004-dev:~ 15s $ sudo wmcs-openstack server list --all-projects [15:24:01] Failed to contact the endpoint at https://openstack.codfw1dev.wikimediacloud.org:29292 for discovery. Fallback to using that endpoint as the base url. [15:24:01] Failed to contact the endpoint at https://openstack.codfw1dev.wikimediacloud.org:29292 for discovery. Fallback to using that endpoint as the base url. [15:24:01] The image service for :codfw1dev-r exists but does not have any supported versions. [15:24:24] arturo: everything is still mid-upgrade as far as I know? [15:24:32] That's similar to a message I was seeing last night [15:24:44] I mean, in 2023-10-05 15:11:44 things were OK [15:25:00] and starting that time, my network tests have started to fail again [15:25:13] in that case I know nothing [15:25:40] arturo: not touching anything right now [15:25:45] ok [15:26:57] please keep codfw1dev stable (and eqiad1) this afternoon [15:32:04] the ceph drainage might require a different approach to avoid getting any osd too full [15:33:09] I might have to drain the whole rack at once, that might create extra traffic that might slow down ceph, ending up in the NFS hiccup on tools side :/ [15:34:32] so, I guess next week? [15:49:51] FYI I finished my work with cloudgw in eqiad1, so you should feel free to work (and break!) codfw1dev again! [15:49:52] cc andrewbogott dhinus [15:52:02] thanks arturo [15:52:26] both eqiad1 and codfw1dev are left in a known-good working state network-wise [15:55:51] * taavi off [15:57:10] * arturo offline [16:08:52] andrewbogott: I think your patch for the keystone init scripts might have fixed cloudcontrol2001-dev and broke the others that haven't been upgraded yet :) [16:09:17] that seems likely but they were broken already [16:09:20] due to the db schema upgrade [16:09:26] or at least I assume they were [16:09:32] makes sense [16:09:38] once you start the upgrade you pretty much have to forge ahead :) [16:09:49] shall I run the cookbook for all cloudcontrols then? [16:10:07] sure [16:10:19] or you could attend Wikimedia Connect, as you prefer :) [16:10:42] there is still a keystone-related error in Puppet for 2001, not sure what's the issue [16:12:07] running the cookbook in 2004 [16:12:28] * andrewbogott running puppet on 2001 to see what's happening but also mostly not here [16:13:13] andrewbogott: no rush, I'm not working tomorrow so I will run those cookbooks now, and probably not much more until Monday [16:13:15] the puppet errors I see are all nova things [16:13:25] yeah but nova is complaining about keystone :) [16:13:31] (or so it seems) [16:13:32] ok. I can take a stab at cleaning things up later on [16:29:03] 2004 is upgraded, and puppet is failing with the same error as 2001 [16:29:10] galera is in sync [16:29:17] I will proceed with the last one (cloudcontrol2005-dev) [16:30:18] we'll see if nova becomes happy after keystone is upgraded everywhere [16:46:39] all cloudcontrols in codfw are now running Antelope packages [16:46:46] puppet is still failing on all 3 of them [16:52:59] yeah, there are no keys installed, let me see if I can initialize [16:56:26] that happened also this morning :/, there might be some already around [16:56:39] there was a service copying the keys from one to the other no? is that broken? [16:56:59] is that fernet keys again? [16:57:06] or different keys? [16:57:26] I thought it was frenet, but maybe not xd /me jumps to conclusions [16:57:36] also how did you find that the problem is related to keys? [16:58:26] the keys are all gone, I'm not sure why [16:58:35] dhinus: I didn't, just looked and saw that there aren't any [16:58:51] this morning a.rturo pointed to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Procedures_and_operations#Rotating_or_revoking_keystone_fernet_tokens [16:58:59] which seemed to fix the issue for another service [16:59:27] The keys need to be in sync between all three services [16:59:37] so we need to initialize them in one place (2001, done) and then copy them to the other two hosts [16:59:46] after that (in theory) timer jobs will keep them updated and in sync [17:03:24] jbond: https://gerrit.wikimedia.org/r/c/operations/puppet/+/963719 seems to have broken puppetruns on cloud puppetdbs [17:03:33] Oct 05 16:36:24 tools-puppetdb-1 puppet-agent[18206]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'profile::sre::os_reports::host' (file: /etc/puppet/modules/profile/manifests/sre/os_updates.pp, line: 5) on node tools-puppetdb-1. [17:04:13] * jbond looking [17:04:45] I think it might be a hiera default/ordering issue (as we have our own kinda thing) [17:04:48] I'm logging off, thanks dcaro andrewbogott for looking into the codfw errors :) [17:04:59] 👍 [17:06:04] dcaro: ack likley just needs some sane default ill send a patch in a soon [17:06:39] thanks! gtg now though [17:06:47] sure thing leave it with me [17:06:51] enjoy your evning [17:06:52] thanks a lot :) [17:06:56] np [17:07:00] * dcaro off [17:42:17] fyi puppetdb in cloud should be working now, tested with tools-puppetdb-1 [18:53:36] To verify on the systems in https://phabricator.wikimedia.org/T342456 I shouldn't expect pip to work? I'm not sure if I mistakenly was referring to some other systems over the last day. I mean the systems in that ticket.