[06:35:08] morning [06:36:09] seems like we have a dead disk on cloudcephmon1004 [07:39:53] morning [07:54:40] morning! [08:28:42] so I guess one of the highlights of the day is that I will try enabling the VXLAN/dualstack network on Cloud VPS eqiad1 [08:28:52] topranks: are you online today? [08:32:28] i.e, this patch: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/204 [08:34:41] created T392458 for the cloudcephmon1004 disk issue [08:34:41] T392458: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458 [08:39:21] arturo: yes I'm around today [08:39:30] great [08:41:38] so yeah I'll be able to help out / test things if needed [08:41:52] ok I'll send a few patches your way soon [08:41:54] I did want to talk about the dns change, just to get my head around all of the networks mentioned [08:41:55] ok [08:42:10] dns reverses are less important obviously - above patch seems fine [08:42:12] yeah, I'm just fixing the DNS one [08:44:45] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/205 [08:46:30] topranks: also, this one: https://gerrit.wikimedia.org/r/c/operations/puppet/+/113779 needs review [08:47:30] Ok let me take a look [08:47:38] On the DNS one I'm confused what these networks are [08:47:48] 20a2:ec80:a000:0000::/64 [08:47:58] 20a2:ec80:a000:0100::/64 [08:48:14] yeah I dropped them in the last patch version [08:48:23] neither are in Netbox, and somewhat don't match the numbering scheme used for 2a02:ec80:a000:1::/64 [08:48:40] ah ok.... my gitlab skills letting me down there still looking at the first revision I think [08:49:09] ok [08:49:21] the only reason I added them in the first patch version is because there are defined for codfw1dev [08:49:22] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/resources/codfw1dev-r/cloudinfra-codfw1dev/dns.tf?ref_type=heads#L7 [08:51:17] ok right, yeah I'm not sure that's correct but no major harm [08:52:28] I get a weird gerrit permission error when I try to open the above gerrit link [08:52:51] topranks: oh, same here! [08:52:56] https://usercontent.irccloud-cdn.com/file/bfg345WP/image.png [08:52:57] ok [08:53:00] not just me then [08:53:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137793 [08:53:14] yep that works thanks [08:53:17] the link was incomplete [08:53:26] yep yep [08:53:37] weird it mentions permissions makes it seem worse [08:54:52] dropping the confusing IPv6 reverse zones in codfw1dev: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/206 [08:55:46] arturo: I created this documentation page to refer people to https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_IP_space [08:55:51] arturo: can you run pcc for the above patch against the cloudgw? so I can get a bettter picture of how it translates on the actual box? [08:56:15] topranks: https://puppet-compiler.wmflabs.org/output/1137793/3602/ [08:56:22] thanks [08:57:22] topranks: LGTM [08:59:15] yeah +1 makes sense [08:59:38] ok, I will merge and roll-rollout to cloudgw [09:13:16] cloudgw1004 (standby) seems to have rebooted fine with the new IPv6 settings [09:13:34] so I will reboot cloudgw1003 (primary) next [09:21:25] there is a typo I think? [09:21:33] topranks: where? [09:22:05] cmooney@cloudgw1004:~$ ip -br -6 addr show dev vlan1107 scope global [09:22:05] vlan1107@eno12399np0 UP 2a02:ec80:a100:fe04::1:1/64 2a02:ec80:a000:fe04::1004:1/64 [09:22:20] oh I see it [09:22:20] ^^^ this network should be 2a02:ec80:a000:fe04::/64 [09:22:27] a000 not a100 [09:22:27] :a100 vs :a000 [09:22:29] yes [09:23:07] I should have spotted earlier sry [09:23:46] same mistake on the VIP but not the dedicated IP [09:23:49] cmooney@cloudgw1004:~$ ip -br -6 addr show dev vlan1120 scope global [09:23:50] vlan1120@eno12399np0 UP 2a02:ec80:a100:fe03::2/64 2a02:ec80:a000:fe03::1004:1/64 [09:24:37] topranks: I think this fixes both: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138297 [09:25:32] yep +1 looks good [09:25:35] because is the keepalived config, we don't need a reboot, just a service restart [09:25:36] thanks, merging [09:33:52] looks a lot better now [09:34:13] I will merge https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/205 next [09:37:22] I see another problem on the cloudsw, overlapping IP addr [09:37:43] ? [09:41:06] cloudsw1-d5-eqiad had 2a02:ec80:a100:fe03::2 on it's irb.1120 interface [09:41:25] realistically the keepalived VIP on that vlan should have been ::3 [09:41:33] anyway I'm moving the cloudsw to ::3 now [09:41:58] topranks: maybe it is better to change the cloudgw side then? [09:42:03] https://netbox.wikimedia.org/ipam/prefixes/1099/ip-addresses/ [09:42:11] ^^ this is the current I'm happy either way [09:42:21] possibly better the cloudsw is the first two, then the cloudgw the third? [09:42:42] sure [09:43:28] I guess in codfw is different because we only have 1 switch [09:44:19] yes [09:44:35] it also shows why using Netbox is a good idea for this stuff :) [09:44:47] this was mostly copy-pasted from codfw, so that's why :2 was used I guess [09:45:03] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138304 [09:46:52] yeah no probs [09:47:13] lgtm +1 [09:47:35] thanks, merging [09:50:56] topranks: change to :3 is live on cloudgw [09:52:07] ok cool I have updated the routes on the cloudsw too [09:52:47] arturo: there is another change we need [09:52:51] again a little diff than codfw [09:52:55] ok [09:53:16] won't break anything, just for optimal routing [09:53:17] https://phabricator.wikimedia.org/P75288 [09:53:40] both cloudgw have same static route, but we should do it like the v4 ones and have each use their local cloudsw as the default [09:54:51] I see [09:58:26] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138306 <-- running PCC [09:59:05] this now looks good to me: [09:59:06] https://phabricator.wikimedia.org/P75289 [09:59:50] yeah, looks good to me as well [10:01:20] topranks: I think https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138306 should be ready to merge [10:02:25] yep pcc looks good [10:07:09] giving cloudgw1004 a reboot to make sure it boots with the correct routing config [10:07:30] topranks: anything else before I merge this one? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/204 [10:08:47] ah ok [10:08:58] I was just gonnna say cloudgw1004 went offline [10:08:59] cool [10:09:06] let's wait till it comes back and take a look [10:09:18] but yeah I think all the infra side of it down to the cloudgw is working anyway [10:11:07] cloudgw1004 is back online now [10:13:05] ok, merging the neutron change now [10:13:19] heads up, potential for outage [10:14:11] yep +1 [10:14:33] merging [10:15:09] merged, changes are live [10:15:58] mmm [10:16:08] I see a few netowrk tests failing [10:16:23] I can't access the VMs [10:16:25] I will rollback [10:16:59] ok [10:17:26] it feels like the neutron router gets its routing messed up [10:17:32] (in IPv4 anyway) [11:16:51] * arturo back [11:22:23] I will start now to gradually introduce network resources [11:22:31] first one: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/208 [11:23:30] taavi: would you like to approve? [11:27:41] looking [11:28:35] arturo: +1 [11:28:44] thanks, merging [11:30:06] merged, everything seems fine [11:33:34] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/209 [11:38:07] (if you prefer I can self-merge) [11:40:50] arturo: i'm ok with you self merging but can review as well [11:41:03] ok [11:42:32] (approved) [11:42:39] ok thanks [11:44:27] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/210 next one (I will ping you a dozen times! :-P) [11:45:03] +1 [11:47:08] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/211 [11:48:03] approved [11:49:18] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/212 [11:49:45] wait [11:49:47] hmm [11:49:56] why is that changing enable_snat [11:50:45] what do you mean? [11:50:57] I see the router is now misbehaving [11:51:32] look at the tofu plan [11:51:36] ~ enable_snat = true -> false [11:52:02] I see [11:52:09] although that should be off, we do snat at cloudgw and not at neutron level I think [11:52:48] so if I run tofu now [11:52:50] tofu plan [11:52:51] I get [11:52:53] https://www.irccloud.com/pastebin/RxFCCRwB/ [11:52:59] so the router has somehow changed config [11:53:09] and that matches https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/f3a59c4ee034236f4272a473312337a2fd212d6f/resources/eqiad1-r/admin/routers.tf#L6 [11:53:13] most likely in the last patch, this: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/211 [11:53:14] try applying that change? [11:53:34] sure [11:53:39] but why did it change? :-( [11:53:58] change applied [11:53:58] very good question [11:54:29] plan is now a noop [11:54:40] good [11:54:48] I bet the openstack API resets the neutron router state after adding a new interface [11:54:59] which is what latest commit did [11:55:00] when you before said you saw something misbehaving, what did you see? [11:55:15] exactly that, could not reach VMs without floating IPs [11:55:25] I can reach them now, after that last tofu apply [11:55:51] all is green at the moment [11:55:59] do you want to move forward to the next patch? [11:56:06] yes [11:56:29] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/212 [11:56:34] ok, can you run tofu plan again? that should hide the unnecessary diff in https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/212#note_136895 [11:56:39] ah, just did [11:56:51] +1 [11:56:59] ok, thanks, merging [11:57:29] hopefully we discovered the original problem and we are past it [11:58:26] yea, as long as there aren't any other surprise neutron functions that will change that flag [11:58:38] btw, this is something that would been very hard to spot without tofu [11:59:00] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/213 [11:59:04] yes [11:59:10] I'm very glad we have tofu, honestly [11:59:29] approved [12:01:53] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/214 I think this one could also be potentially sensitive [12:02:31] ok, let's give it a try then [12:02:59] merging [12:04:01] (no tofu diff afterwards) [12:04:09] (network seems green) [12:04:26] 172.16.16.1 now responds to ping [12:04:40] great [12:04:52] 2a02:ec80:a000:1::1 still results in a routing loop from the outside [12:05:30] 172.16.16.1 has no external connectivity though [12:05:31] (not sure if that's expected at this stage or not) [12:05:52] same, not sure if expected [12:05:59] topranks: yeah, so far I was just pinging it from a VM [12:06:06] I'm mostly focusing on the rest of the network surviving the changes [12:06:09] yeah that's fine [12:06:15] same with 172.16.8.1 which is there [12:06:20] arturo: sounds good [12:06:21] it could be cloudgw rules even [12:06:27] arturo: yeah +1 [12:06:30] https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/215 [12:06:34] the routing loop is also fine in a way [12:06:41] traffic goes from cloudgw to cloudnet, cloudnet fires it back [12:07:03] routing loop? [12:07:04] hopefully fixed when cloudnet adds the 2a02:ec80:a000:1::/64 network [12:07:16] arturo: +1, I think that patch should maybe fix the loop even [12:07:21] missed that the last one added that for v4 only [12:07:22] yeah [12:07:56] arturo: yeah cloudnet is looping traffic back to the cloudgw [12:07:57] https://phabricator.wikimedia.org/P75299 [12:08:02] but I think that should be expected right now [12:08:09] topranks: try now? [12:08:25] looks good! [12:08:27] looks better from my end [12:08:33] 🎉 [12:08:44] https://phabricator.wikimedia.org/P75300 [12:10:22] cool [12:10:54] I think all network changes are done [12:10:58] this is all that remains [12:10:59] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/216 [12:11:52] ooh [12:12:12] +1 [12:13:22] so I guess we discovered this enable_snat bit flip by a cosmic ray was one of the original problem s [12:14:22] blame one: 1) cosmic ray 2) tofu openstack provider 3) gophercloud lib 4) openstack API [12:14:36] e) all of the above [12:14:53] heh [12:15:30] did you work out what happened? [12:15:47] same things applied one-by-one "just worked"? [12:16:04] topranks: when adding a new address to the neutron router, it would somehow change internal config to start doing some weird and unwanted NAT [12:16:13] hmm ok [12:16:19] for some reason adding the external v6 interface to the neutron router flipped the neutron-level SNAT switch to on [12:16:31] is traceroute showing PTRs now? [12:16:31] for everything? that is weird [12:16:43] probably a bug I guess [12:16:47] arturo: we need to merge this [12:16:47] https://gerrit.wikimedia.org/r/c/operations/dns/+/1113527 [12:16:50] if you an +1 [12:16:53] *can [12:17:23] LGTM, +1'd [12:18:29] cmooney@cumin1002:~$ dig +short -x 2a02:ec80:a000:1::1 @ns0.openstack.eqiad1.wikimediacloud.org. [12:18:29] vxlan-dualstack.cloudinstances2b-gw.svc.eqiad1.wikimedia.cloud. [12:18:34] ^^ looks good that side [12:19:06] hmm, does it matter that the delegations are for the /56s but the zones (and so the SOAs) in designate are for the /64s? [12:20:19] arturo: I'm going to try spinning up a new VM in testlabs with dualstack connectivity [12:20:42] taavi: recursors generally just request A records, and follow NS if they get those back [12:20:55] so it ought to work, though it is not ideal [12:20:58] taavi: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/-/merge_requests/16 [12:21:02] AAAA records in this case [12:21:50] I think overall still better delegating the entire ranges in the WMF auth dns so we don't have to touch it again if more openstack networks are added [12:22:24] arturo: hmm, does https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/eqiad/wmcs/cloudgw.yaml#13 need updating first? [12:22:41] looks good! [12:22:46] https://www.irccloud.com/pastebin/eFjaZyRv/ [12:22:57] taavi: maybe? for NAT, no? [12:23:04] topranks: cool [12:23:36] arturo: that has the legacy VLAN /21 only, why would it not need the vxlan network /21s too? [12:23:43] taavi: most likely yes [12:24:20] yeah cloudgw won't let them through [12:24:34] postrouting will only match that [12:25:00] counter packets 9967791 bytes 460632414 snat ip to 185.15.56.1 comment "routing_source_ip" [12:27:17] fixing that now [12:28:52] taavi topranks https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138336 [12:29:40] lgtm [12:29:49] well drives me nuts they are not in order. but lgtm :P [13:04:34] hmm, deployed that to both cloudgws and cloudservices nodes but networktests-vxlan-dualstack still can't talk to DNS [13:05:15] I still need to reboot cloudgw1003 [13:05:17] doing it now [13:06:00] ah, that was it :P [13:07:51] networktests-vxlan-dualstack is missing a PTR record [13:10:02] maybe the reverse zone is missing? [13:10:49] anyway, I think that VM needs to be re-created, the initial setup timed out due to the original network issues and now it's in some weird state where it doesn't know its project [13:11:24] was that VM created using https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning/ or you by hand? [13:11:34] created via tofu [13:12:01] ok, then I will delete via horizon, and re run the gitlab pipeline [13:12:09] ack [13:15:12] VM was re-created by tofu [13:16:09] I just updated the network tests for eqiad1 [13:16:18] if you want a list of things to update [13:16:19] cookbook wmcs.openstack.network.tests --cluster-name eqiad1 [13:16:26] will report a hundred little things to update [13:17:22] hrm, puppet is still failing with [13:17:24] > Apr 23 13:17:06 networktests-vxlan-dualstack cloud-init[1381]: Info: Creating a new SSL certificate request for networktests-vxlan-dualstack..eqiad1.wikimedia.cloud [13:17:30] so something's not setting the hostname properly [13:17:52] also, the puppetmaster may need additional security group rules for the new subnets [13:17:54] aha, that's in /etc/hosts, so cloud-init maybe [13:18:46] yeah there are lots of tiny things [13:18:58] it'd be great if we could move those to tofu/puppet as we fix them [13:19:55] ah [13:21:55] also, typos https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138353 [13:22:34] re-creating the VM once more [13:27:47] now it's running puppet [13:29:03] what was the fix? [13:29:25] needed a puppet run on cloudcontrols to add the new IP range to the password allowlist [13:29:35] ok [13:29:51] so the VM is now up, can talk to dns/puppet/ldap/etc [13:30:01] next step is to run the network tests against it? [13:30:30] yeah, with that last typo fix we went from [13:30:33] [2025-04-23 13:18:44] INFO: --- passed tests: 36 [13:30:33] [2025-04-23 13:18:44] INFO: --- failed tests: 42 [13:30:34] [2025-04-23 13:18:44] INFO: --- total tests: 78 [13:30:35] to [13:31:03] [2025-04-23 13:30:53] INFO: --- passed tests: 62 [13:31:03] [2025-04-23 13:30:53] INFO: --- failed tests: 16 [13:31:03] [2025-04-23 13:30:53] INFO: --- total tests: 78 [13:31:08] oh, and we're still missing the reverse DNS entry [13:31:39] I need to grab food, be back later [13:31:44] ok, ttyl [13:37:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138356/ should fix the reverse DNS [13:54:28] no, that doesn't seem to have helped [13:56:12] you restarted all the sink agents? (probably puppet did that but I'm not 100% certain) [13:57:06] i ran puppet on cloudcontrol1* which should have done that [13:57:13] Notice: /Stage[main]/Openstack::Designate::Service/Service[designate-sink]: Triggered 'refresh' from 1 event [13:59:07] yep, that should do it [13:59:51] I just woke up... is the network broken enough that we should cancel meetings to troubleshoot? [14:00:08] no! [14:00:16] this is all about the new v6 networks that we've been introducing today [14:00:33] the old networks that everyone is actually using work just fine [14:00:37] oh great! [14:01:17] We had catastrophic network failures all the other times we did anything with ipv6 so I'm trained to expect the worst [14:03:41] yeah, that was because for some reason neutron flipped the neutron-level SNAT toggle to enabled when we added the router v6 external interface [14:03:50] but once you turn that back off everything went back to normal [14:05:19] sharp eyes, finding that [14:05:24] more like tofu [14:06:20] oh, a tofu run flipped it back? That's fancy [14:06:26] yeah! [14:09:22] does anyone happen to know what's the difference between wmcs_nova_fixed_ptr and nova_fixed_multi in designate? [14:10:09] I think fixed_ptr is the thing that arturo added to handle v6 [14:10:12] * andrewbogott checks [14:10:27] in codfw: enabled_notification_handlers: "nova_fixed, wmcs_nova_fixed_ptr, wmf_sink" [14:10:33] in eqiad1: enabled_notification_handlers: "nova_fixed_multi, wmf_sink" [14:10:35] yes [14:10:41] correct [14:10:45] nova_fixed_multi is very clearly IPv4 only? [14:10:58] arturo: should I just copy the list of handlers from codfw1dev to eqiad1? [14:11:16] yes, but also, Is not that in puppet? [14:11:23] it's a hiera key [14:11:28] that has different values in different deployments [14:11:43] yes, then definitely enable in eqiad1 [14:11:55] today was the day to enable it :-) [14:12:25] (unrelated: has anyone touched/investigated cloudcephmon1004 since it started emailing us yesterday?) [14:12:36] i think d.caro made a task for dc-ops [14:12:41] I created a task for dc-ops yep [14:12:47] arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138371/ [14:13:37] (what is `nova_fixed` then?) [14:14:08] I see it T392458 now [14:14:08] T392458: hw troubleshooting: disk failure (sdb) on coludcephmon1004 - https://phabricator.wikimedia.org/T392458 [14:14:51] taavi: 'nova_fixed' sounds like a typo, although that doesn't explain how anything works with that set [14:15:13] next question: does anything work in codfw1dev? [14:15:20] it did on Monday [14:15:40] ok, and codwf1dev has nova_fixed enabled [14:15:44] taavi: codfw1dev works perfectly network wise [14:16:21] please merge [14:16:39] the nova-fixed-ptr code is here: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/designate-sink-wmcs-nova-fixed-ptr [14:16:50] it specifically handles PTR records only [14:16:55] the others handle other records [14:17:03] ok, merging [14:17:21] arturo: so nova_fixed_multi is fully unused now? [14:17:36] andrewbogott: I think so, yes [14:17:38] i'm already working on a patch to drop it from puppet [14:17:46] ok [14:18:24] anyway, running puppet for the config change [14:19:34] cleanup patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1138373 [14:20:41] LGTM [14:20:55] re-creating the VM once more [14:22:07] the extra time I spent creating https://gitlab.wikimedia.org/repos/cloud/cloud-vps/networktests-tofu-provisioning seems to be paying off :-) [14:22:56] taavi@runko:~ $ host networktests-vxlan-dualstack.testlabs.eqiad1.wikimedia.cloud [14:22:56] networktests-vxlan-dualstack.testlabs.eqiad1.wikimedia.cloud has address 172.16.17.248 [14:22:56] networktests-vxlan-dualstack.testlabs.eqiad1.wikimedia.cloud has IPv6 address 2a02:ec80:a000:1::ed [14:22:56] taavi@runko:~ $ host 2a02:ec80:a000:1::ed [14:22:56] d.e.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1.0.0.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa domain name pointer networktests-vxlan-dualstack.testlabs.eqiad1.wikimedia.cloud. [14:23:09] 🎉 [14:23:12] arturo: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/217 [14:23:45] taavi: conflicts with https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/206 [14:25:17] that one also needs a rebase [14:25:55] rebased [14:26:54] I'll merge mine first [14:27:51] ok, updated mine [14:28:20] +1'd [14:29:37] deployed [14:30:43] [2025-04-23 14:25:04] INFO: --- passed tests: 48 [14:30:43] [2025-04-23 14:25:04] INFO: --- failed tests: 30 [14:30:43] [2025-04-23 14:25:04] INFO: --- total tests: 78 [14:33:04] now [14:33:10] [2025-04-23 14:32:54] INFO: --- passed tests: 63 [14:33:10] [2025-04-23 14:32:54] INFO: --- failed tests: 15 [14:33:10] [2025-04-23 14:32:54] INFO: --- total tests: 78 [14:53:13] andrewbogott: i was wrong about the MTU thing, it's not visible in netbox but it's set on the switch side [14:53:45] taavi: I'm unable to check the quarry k8s cluster as I did last time (from the quarry-bastion, using roo.k's credentials), has something changed? [14:53:54] Unable to connect to the server: dial tcp 172.16.2.84:6443: connect: no route to host [14:53:58] maybe the IP or something [14:54:01] the credentials are now in andrew's home directory! [14:54:14] xd, okok [14:54:30] the system could, uh, maybe not be as tied to a specific user as it currently is [14:55:33] taavi: re: mtu, great thanks for double-checking [14:55:51] re: quarry, paws is deployed from ~root, we can do that with quarry as well [14:56:04] I mean, not perfect but slightly better? [14:56:19] probably yep [14:56:48] chuckonwu: arturo: is there a way to run `tofu plan` locally with the toolforge tofu repo? [14:57:03] Let's do that next deployment, and then add some obvious note in my and rook's $home [14:57:04] taavi: no, only meant for the gitlab ci/cd pipeline [14:57:25] ok, and it seems like it's in toolsbeta only at this point? [14:57:26] I'm just crafting a patch to enable it to work on tools [14:57:44] oh great [15:01:03] https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/4 with this, only creds should be missing [15:04:53] I will generate them [15:18:45] credentials sorted ✅ [15:19:55] taavi: I just merged the MR, so you should feel free to follow up with the VM [15:20:29] maybe create a module/vms in a similar fashion to the work chuckonwu is doing with DNS on https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/3 [15:22:44] hmm, passing the needed data between objects can get very hard with those sorts of maps I think [15:26:42] let me show [15:26:48] the idea [15:27:35] this is using the same pattern that metricsinfra and the tofu registry are using: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/5 [15:29:52] taavi: ok [15:30:56] taavi: +1d [15:32:14] hmm, currently the DNS module does not allow provisioning both A and AAAA records for the same name [15:35:30] what do you mean? [15:36:16] in toolforge tofu-provisioning, the dns module takes a dict of {record name => record data}, so you can't specify two records of different types for a specific name [15:36:47] taavi: do it like this: https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/blob/main/resources/eqiad1-r/cloudinfra/dns.tf?ref_type=heads#L175 [15:37:12] will not work due to https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/blob/main/modules/dns/main.tf?ref_type=heads#L33 [15:37:21] oh wiat [15:37:28] diff keys [15:38:12] right [15:38:13] thanks [15:38:42] anyway, need to go, will continue later [15:38:54] you may need to import the whatever zone as well [15:39:02] which is exactly what chuckonwu is doing for toolsbeta [15:50:26] spotted a similar problem to what you detected earlier taavi [15:50:29] https://www.irccloud.com/pastebin/jyZ3qOiq/ [15:50:40] was it a cloud-init issue? [15:51:02] yeah, you need to re-create the Vm [15:51:09] ack [15:56:32] * arturo offline [15:56:50] want me to clean up those dns leaks or are they of interest ipv6wise? [15:57:12] they are not [15:57:47] ok, will delete and see if new ones appear! [15:58:05] have a good evening! [16:12:04] * dcaro off [16:12:06] cya tomorrow [16:12:17] * andrewbogott waves [16:13:04] topranks, do you happen to remember if the QOS settings you added to keep ceph's health check pings working are also present in the codfw1dev cluster? [16:14:26] andrewbogott: yeah they should be, they were applied at the puppet level so I assume the same config is present there [16:14:37] it's less of an issue in codfw as there is only one switch [16:14:45] ok! I am probably seeing something else then. thx [16:14:55] we could still saturate outbound towards a given server, but there are no links connecting switches to max out [16:15:43] I was seeing a new osd flap between up and down but I didn't actually investigate much. I'll see if I can make it happen again today :) [16:17:03] ok yeah [16:17:13] just had a quick look and the traffic prioritization seems to be ok [16:17:25] but levels overall are quite low so it's not really doing anything [16:18:46] how about now, did the graph get all jumpy? [16:19:24] I'm only doing one drive this time, it seems to be staying up so far [16:21:07] which host is it? [16:21:18] cloudcephosd2004-dev is the new one [16:21:39] it's much much bigger than the old osd nodes so the balancing algorithm is likely to do some dramatic things [16:21:52] At one point ceph informed me the rebalancing would complete after -1 days [16:22:31] oh I guess that was just division by 0. Now it's a little bit consistent. [16:22:57] ok, there, it just dropped from 'up' to 'down' [16:23:15] the others got busy for a minute there [16:23:19] https://grafana.wikimedia.org/goto/hIraTfJHR [16:23:23] 2004 not doing much though [16:24:49] for tomorrow: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/6 [16:24:49] ceph thinks it's unreachable so I wouldn't expect it to do much now :( [16:25:33] there are basically no drops on the switch level [16:25:33] is anything saturated during that spike? [16:25:36] https://grafana.wikimedia.org/goto/OuesofJNg [16:25:37] ok [16:25:38] nope [16:25:44] well then what the heck [16:26:01] closest thing is outbound towards cloudcephosd2002 [16:26:10] I wonder if there's a way to ask ceph "why do you think this is down" [16:26:21] one thing which might make things "fun" here is the new host (2004) is connected at 10G, the others are all at 1G [16:27:00] oh, actually -- during that spike was there any activity on 2004? I'm wondering if it just isn't working at all and ceph is just wrong when it says it's up and pooled. [16:27:31] 2004 is the only one that didn't spike in usage, seems around the same as it was [16:27:50] hmmm [16:28:24] any chance the data-plane network just isn't up at all for that host? [16:28:50] yeah it's not set up right [16:29:19] * andrewbogott depools again [16:29:22] it has no connection to the "cloud-storage" vlan (cluster network in ceph terminology) [16:29:58] well that 100% explains what I'm seeing [16:30:19] Is that a switch thing or something in the host config? [16:31:19] should be ok now [16:31:24] cmooney@cloudcephosd2004-dev:~$ ping 192.168.4.3 [16:31:24] PING 192.168.4.3 (192.168.4.3) 56(84) bytes of data. [16:31:24] 64 bytes from 192.168.4.3: icmp_seq=1 ttl=64 time=12.7 ms [16:31:24] 64 bytes from 192.168.4.3: icmp_seq=2 ttl=64 time=7.57 ms [16:31:35] host config was ok, switch port was not configured with any vlan [16:32:52] great! Let's see if I can pool it now... [16:36:22] so far it's staying up and the # of objects to be moved is decreasing which seems promising. [16:47:59] yeah there is a steady stream of traffic out to 2004 on the newly-enabled port [17:42:04] seems to be somewhat consistent now. Thanks for fixing! I'll probably be tomorrow before I know for sure if things are settling properly. [19:11:12] dhinus and cteam: I'm not 100% sure that the current dns leaks aren't the leak-detector failing; please don't do a cleanup until we can check with artu.ro to see if having two ipv4s on that host(s) is unexpected. [19:11:43] ack [20:19:45] not expected [20:22:36] VMs in the dualstack subnet only have 1 IPv4 (and 1 IPv6) [20:23:13] I guess the leak detector doesn't understand that? This was not updated in codfw [20:58:21] In this case it's showing two ipv4s. [20:58:23] ["Found 2 ptr recordsets for the same VM: networktests-vxlan-dualstack.testlabs.eqiad1.wikimedia.cloud. ['185.18.16.172.in-addr.arpa.', '248.17.16.172.in-addr.arpa.']" [21:48:35] the VM was recreated multiple times today, with the same name [21:49:21] ok, that's probably all it is then. thx [21:49:43] btw I'm stealing T377126 from you, I imagine it'll need some switch/bgp magic tomorrow to finalize [21:49:43] T377126: replace cloudlb2001-dev with cloudlb2004-dev - https://phabricator.wikimedia.org/T377126 [22:17:32] ack