[09:24:58] folks - I am going to decom the links from cloudsw1-c8-eqiad to asw2-b-eqiad [09:25:41] there are no longer any cloud hosts in row B outside the dedicated cloud racks, so the links aren't needed (see T330479) [09:25:41] T330479: Move WMCS servers out of eqiad row B - https://phabricator.wikimedia.org/T330479 [09:32:11] topranks: ack [10:30:41] topranks: I see some potential packet filtering happening in codfw1dev for the between cloudgw and cloudsw [10:30:56] nah [10:31:03] let me double check but I don't think so [10:31:47] topranks: this is the packet I'm interested in [10:31:48] aborrero@cloudgw2002-dev:~ 5s $ sudo tcpdump -i any icmp and host 172.16.130.64 [10:31:53] but it could be routing as well [10:32:05] because it is true I don't see the packet leaving the cloudgw host [10:32:26] no there is nothing there [10:32:44] there are filters on the CRs, but nothing applied on the cloudsw itself [10:33:55] this routing may not be correct, no? [10:33:58] https://www.irccloud.com/pastebin/2lYCloig/ [10:34:21] traffic comes into the cloudgw from that ip [10:34:25] https://www.irccloud.com/pastebin/AYZ57bIg/ [10:34:44] I would expect one CIDR to be routed to cloudsw and the other to neutron [10:36:14] the problem I'm experiencing is that a VM in the IPv4-only network in codfw1dev cannot talk to the DNS server [10:36:34] for example networktests-vxlan-ipv4only-nofloating.testlabs.codfw1dev.wikimedia.cloud [10:36:41] elements = { 172.16.128.0/23 counter packets 0 bytes 0 } [10:36:47] ^^ this is the problem in nftables [10:36:50] https://www.irccloud.com/pastebin/9OcnoUiJ/ [10:37:08] 172.16.30.129 is not part of 172.16.128.0/23 [10:37:17] oh right [10:37:21] a rp filter in there [10:37:34] when was that added? [10:37:40] I don't remember that rpfilter in there [10:37:50] was that added as part of the PAWS problems in december? [10:38:11] I think we removed those filters, but let me double check [10:38:20] * arturo suddenly fears of `git annotating` pointing to himself [10:38:32] sorry I'm not sure if that is the cause [10:39:03] mmm I don't see nftables _dropping_ the packet [10:39:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1105036 [10:39:08] indeed though no route back to 172.16.130.x so that's your problem [10:39:18] yeah nft should allow it actually, that def is just used for NAT [10:39:38] but traffic has no route back is the issue (and yes rp filter means it's not being forwarded, but even if so it wouldn't get back) [10:39:59] ok, yeah, thanks! [10:40:10] on the switches we have a /21 routed [10:40:14] https://www.irccloud.com/pastebin/nMC95TBf/ [10:40:51] now I just need to figure out how to inject the route [10:40:53] probably replacing both of these with that single /21 is needed: [10:40:57] 172.16.128.0/24 via 185.15.57.10 dev vlan2107 [10:40:58] 172.16.129.0/24 via 185.15.57.10 dev vlan2107 [10:41:04] (in vrf-cloudgw) [10:41:28] and also likely you should update the nft to include it - but it should be able to get to the private ranges without that (just not interent) [10:42:01] ok [10:42:06] I think the routing config is in here: [10:42:06] modules/profile/manifests/wmcs/cloudgw.pp [10:43:44] hieradata/role/codfw/wmcs/cloudgw.yaml seems to have the correct configuration [10:43:52] so this might be a hiera resolution problem after all [10:45:21] oh no, it is not correct, it is missing the IPv4-only subnet [10:46:10] thanks you both for the assistance! [11:44:46] topranks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135689 [11:45:51] arturo: looks good to me +1 [11:46:43] does that not need the v6 ranges? [11:47:10] that particular CIDR is for the VXLAN IPv4-only subnet [11:47:49] thanks! [12:45:10] arturo: fwiw I was re-reading the scrollback (I was rushing out earlier) [12:45:20] the rp_filter is enabled in the kernel by default on interfaces [12:45:28] https://www.irccloud.com/pastebin/2l7QevGA/ [12:45:52] it's disabled some places but tbh unless it's causing a problem I don't think there is any issue [12:45:53] yeah, I think we set that via puppet somewhere [12:46:13] it's enabled on the two physical ints and also the vrf (so I guess anything through the l3mdev device) [12:46:21] the problems have been fixed now, BTW [12:46:26] anyway it should be fine I think, we need the right routes in the right places [12:46:38] but yeah you won't see drops due to this in nftables output [12:46:50] ok cool that's great :) [13:04:10] I have another example of rp filtering now [13:04:43] root@cloudservices2004-dev:~# tcpdump -i any host 172.16.130.92 [13:04:43] 13:02:33.141146 vlan2151 In IP 172.16.130.92.39910 > cloudservices2004-dev.private.codfw.wikimedia.cloud.ldap: Flags [S], seq 28799728, win 42300, options [mss 1410,sackOK,TS val 2870440553 ecr 0,nop,wscale 9], length 0 [13:04:58] root@cloudservices2004-dev:~# ip route get 172.16.130.92 [13:04:58] 172.16.130.92 via 10.192.20.1 dev eno1 src 10.192.20.10 uid 0 [13:05:10] https://www.irccloud.com/pastebin/zEOruVOq/ [13:09:17] topranks: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135715 [14:29:36] does anybody have an idea of why the CI is failing on this one? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135467 [14:29:52] also, how can I reproduce a CI run locally using exactly the same versions/deps? [14:31:53] dhinus: I have seen that before [14:32:04] lol, what a cryptic failure [14:32:10] the way the repo infers which py version to use for each py file is a bit cumbersome [14:32:34] I know how to fix it, let me look for an example [14:32:37] I've been banging my head against it for more than 1 hour :( [14:32:58] ended up in a rabbit hole trying to run rake locally and/or with the integration/config dockerfile [14:33:27] iirc you have to add the shebang with the python version [14:33:36] nah, is just the logic for python version detection [14:33:39] yeah, the shebang [14:33:43] that was the trick [14:34:03] where? [14:34:07] even if the file is not really a script (for example, a lib file), it still needs the shebang, otherwise it defaults to py2 [14:35:01] the logic is in rake_modules/taskgen.rb [14:35:07] def sort_python_files [14:35:19] Hm, I've run into that shebang thing but I thought there was an explanatory error message? [14:35:26] * dcaro vanishes again... [14:35:54] o/ [14:35:55] o/ quick hi and cya soon [14:35:58] Anyway, yeah, #!/usr/bin/python3 is definitely necessary to make CI happy, if not necessarily suficient [14:36:03] * andrewbogott waves at dcaro [14:36:25] * dhinus waves [14:36:46] also modules/openstack/files/clientpackages/py2/mwopenstackclients.py could maybe be dropped entirely [14:36:49] I haven't added any new file though [14:36:53] why we still have py2 stuff ? [14:37:13] dhinus: is for files in which there is a diff, per the logic in that rake_modules [14:37:30] so even if your change is unrelated, you need to add the shebang if it was not present [14:38:03] modules/openstack/files/clientpackages/mwopenstackclients.py doesn't have a shebang [14:38:22] and needs one because your patch updates it [14:38:38] ack let me try that [14:39:54] arturo: that was to support Buster VMs, I think? So only recently useless. [14:41:11] andrewbogott: ack [14:41:39] btw what does everybody use to debug CI failures locally for operations/puppet? [14:41:57] I usually just wait for gerrit to compile [14:42:19] s/compile/run whatever it runs/ [14:42:46] dhinus: ./utils/run_ci_locally.sh [14:43:00] ah that's what I was looking for :) [14:43:19] TIL! [14:43:31] :-( yeah, has been hidden for me for years [14:43:40] huh, in the past I've used the docker setup described at https://wikitech.wikimedia.org/wiki/Puppet/Testing#Rspec [14:44:12] same error with the shebang :( [14:44:17] maybe it's the py2 file? [14:44:23] dhinus: likely! [14:44:35] see if you can drop it? [14:45:39] I'm trying to remove the change on that file first [14:45:48] dhinus: if you find any py2 files in your adventures you can just 'git rm' them in an earlier patch [14:46:50] andrewbogott: ack [14:47:58] is our top-level DNS really managed by 'Donuts Inc.' of Bellevue? [14:48:50] * arturo now wants a donut [14:54:05] Do they have donuts in Andalusia? [14:55:00] yes [14:55:01] ok the CI is now happy (shebang + removing the py2 script) [14:55:17] thanks everyone, I was going crazy :) [14:56:06] cake or raised? [15:28:04] arturo: chuckonwu: while you're playing with tofu & cicd, you might want to explore this gitlab feature https://docs.gitlab.com/user/infrastructure/iac/mr_integration/ [15:28:21] if you find the right JSON incantation, you should be able to get gitlab to visualize this nice summary https://docs.gitlab.com/user/infrastructure/iac/img/terraform_plan_widget_v13_2.png [15:28:47] oh may, I want that [15:28:52] hehe [15:29:08] I remembered I saw that months ago, and it took me a while to find it again in the docs! [15:30:49] would you like to experiment with that here? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning [15:31:30] I want to finish a few other things first, but if you don't try it yourself I will do it next week :) [15:31:52] ack, I'll let you do it [15:31:59] Thanks dninus, I'll try it out later [15:32:57] chuckonwu: please do, if you get stuck me and arturo will try to help you :) [15:34:37] I'm going to restart tools-k8s-worker-nfs-76 [15:34:51] andrewbogott: can you review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135467 ? no rush though [15:39:05] lgtm. Also you've reminded me to rip out puppet code for bobcat [15:41:13] yep I also noticed that [15:41:47] I need a +1 also on the other patch in the stack [15:42:54] ok, done [15:43:01] thanks! [15:43:18] hm, surprised that that py2 file is not referred to anywhere -- we must have done a partial cleanup already [15:45:12] probably [15:46:05] dhinus: I could not resist :-) [15:46:12] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/merge_requests/14 [15:46:43] https://usercontent.irccloud-cdn.com/file/dJepqWFs/image.png [15:49:46] cc chuckonwu [15:50:28] arturo: :D [15:51:46] now it's my turn to be baffled by CI https://integration.wikimedia.org/ci/job/operations-puppet-tests-bullseye/8536/console [15:52:15] I guess it's just deciding to lint a bunch of files that aren't in the patch [15:52:25] arturo: in the docs they say "select View Full Log to go to the plan output", but I don't see it in your MR [15:52:36] maybe it's getting swallowed by some command? [15:52:57] ah no I see it now: No changes. Your infrastructure matches the configuration. [15:53:12] yeah [15:53:23] I was looking for "Plan: " but that's only if there are changes