[06:36:53] greetings [06:37:11] dcaro: yes confirmed with your reproducer in the task I'm having the same problem locally with lima-kilo [07:27:55] provisioning new trixie bastions in toolforge: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/76 [07:42:01] LGTM [07:51:44] godog: found the issue, it's the network policy for metrics-http on the loki side, it's filtering for 127.0.0.1, it should not as the pods have their own ips and requests don't come from localhost [07:51:59] it seems like the toolforge project is out of CPU quota :/ [07:52:59] dcaro: nice find! out of curiosity how/when did that change ? [07:53:26] no idea xd, just tested in my lima-kilo changing that and now it worked [07:53:49] sweet [07:56:02] onyhow, today and tomorrow I'm ooo, so I'll handle on monday, unless someone wants to get to it before :) [07:56:20] I saw T404282 and had a look, I think it's ok to resolve but lmk what do you think. [07:56:20] T404282: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T404282 [08:03:15] dcaro: ack! [08:03:51] volans: interesting, yeah seems ~benign ? [08:04:23] I was kinda surprised to not find anything in the bmc's logs [08:33:18] morning [08:35:11] o/ [08:36:57] volans: left a comment, we can resolve it for now but it's the second time it happens on that same DIMM [09:16:05] related, I reviewed ALL the KernelErrors phab tasks and I think we can just delete the alert, I created T404300 [09:16:05] T404300: Remove KernelErrors alerts - https://phabricator.wikimedia.org/T404300 [09:20:36] neat, thank you dhinus ! [10:56:53] dhinus: T404325 is because I closed the previous one too early? [10:56:54] T404325: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T404325 [10:58:07] volans: ah yes that's a problem with how the alert is designed, we need to ack the alert in alertmanager [10:58:14] otherwise it will fire again and reopen a task [10:58:58] it will resolve after 24h because of how it's defined [10:59:26] yeah I figured but didn't thought about it earlier, sorry [10:59:41] yeah I also forgot I needed to ack! [11:00:19] acked and resolved as duplicate [11:00:19] thx [13:10:16] FYI I'm taking a look at k8s-worker-nfs-53 alert also in the context of T404322 [13:10:16] T404322: wmf-auto-restart can get wedged on nfs4 mounts even when the filesystem is excluded - https://phabricator.wikimedia.org/T404322 [13:10:36] will reboot once done [13:44:58] {{done}} [13:51:41] compare the memory graphs on https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-30m&to=now&timezone=utc&var-server=cloudcephosd1016&var-datasource=000000026&var-cluster=wmcs&refresh=5m [13:51:47] with https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-server=cloudcephosd1017&var-datasource=000000026&var-cluster=wmcs&refresh=5m [13:52:16] 1016 is running reef, 1017 is running quincy. The memory usage is all spiky on 1017 and nice and smooth on 1016. [13:52:34] I'm hoping that means they made the new release better, and not that it's just not doing anything [13:55:36] andrewbogott: which panel(s) were you looking at from the dashboards ? [13:56:08] Memory: saturation is the most obvious difference (and most likely to be evidence of a bugfix) [13:56:25] but the network: utilization graph is also dramatically different [13:56:54] oooh wait I have those set to different timescales [13:56:56] * andrewbogott headdesk [13:57:47] ok! now with the same scale it's only the memory:saturation that looks different to me [13:57:55] and better with reef. [13:57:58] So, I'll take it :) [14:01:44] fyi godog, when we ran version P with debian bookworm, RAM use grew without bounds until the OOM killer fired. So I'm on edge about moving nodes to bookworm even though it's a later ceph (R) version this time around. [14:02:28] does anybody know how to track where these requests are originated exactly? T404347 [14:02:29] T404347: WMCS is sending millions of invalid requests to Europeana.eu servers - https://phabricator.wikimedia.org/T404347 [14:03:43] andrewbogott: ack, fair enough! [14:04:14] dhinus: as far as I know we do not currently have a great way to track this unless there's an obvious user agent set. That said, there's a 60% chance that it's iabot, so I would cc max/cyberpower and ask 'this you?' [14:05:01] hmmmm actually if it's a search query then I'd drop those odds to 30%. Still worth asking though. [14:06:29] are there any logs of the NATting or is there a place where one could do tcpdump on the internal interface and see the originating interal ip? [14:10:43] I don't think we have logs; tcpdump might be possible although we haven't had luck tracking such things in the past. topranks might have suggestions (and/or discouragement) [14:11:36] we have NAT logs since recently, .. but since that site is behind cloudflare we're going to need some more detail to distinguish it from all other traffic to cf [14:12:07] ooh, nice! I mean, re: nat logs [14:13:11] dhinus: so for options: 1) ask the reporter for timestamp+source port combinations, and match those to NAT logs, or 2) try to look at which node is making recdns queries for that domain [14:14:09] browsing phab tasks suggests that it's probably a wikidata-adjacent project, although that doesn't narrow things down a whole lot. [14:14:56] andrewbogott: I doubt it's IABot, since that runs on nodes with floating IPs so it would not be using the generic NAT address [14:15:49] taavi: yep, that makes sense. I'm just trained by a history of "why do you keep loading pages on our site?" emails about iabot -- this is clearly not that. [14:19:58] taavi: 1) asked, for 2) where would I look, in our DNS servers? [14:21:50] dhinus: one of the cloudservices* boxes acts as the active dns recursor, basically you'd need to tcpdump the incoming dns traffic there [14:22:46] I see, and hope there's a request flying right now. I wonder if they come in bulks or not, I asked [16:01:00] for dns logging: T404373 [16:01:00] T404373: Log DNS queries from Cloud VPS clients - https://phabricator.wikimedia.org/T404373 [16:01:09] thanks taavi [16:10:42] MRs for updating toolforge bastions: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/77 [16:45:48] dhinus, andrewbogott: re: T404150 i'm not sure how to proceed. i think using a public floating ip on wan-transport-eqiad for the lb based service would make it much easier for us to get gitlab-cloud-runner back onto wmcs, so i would prefer that. but it seems like y'all are saying there's a different way to do it. is that with an lb on octavia and a web proxy? [16:45:49] T404150: Additional floating IPs for gitlab-cloud-runner testing in testlabs project - https://phabricator.wikimedia.org/T404150 [16:46:57] not sure how to make that happen, and am a little worried that having the lb managed in k8s and the web proxy managed elsewhere (tofu) is going to make this a bit brittle. unless we can ensure the lb ip is static (not sure if that's possible when not using floating ips?) [16:53:16] andrewbogott: are you still using the dns-recursor-on-a-VM experiment added in T374830 or can I rip out the puppet code for that? [16:53:17] T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830 [16:53:48] taavi, dduvall, sorry I won't be able to respond for an hour or so, in meetings [16:54:07] np [16:55:10] let's see... [16:55:47] dduvall: tell me what process you're imaging for a rebuild, that might help me understand. [16:55:54] Are you thinking you'd have two floating IPS and do blue/green? [16:57:45] i'm using the existing terraform/tofu in https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner but refactored with an `wmcs` module instead of the existing `digitalocean` one [16:58:23] the `wmcs` module i have is pretty basic so far [16:58:26] https://www.irccloud.com/pastebin/FDSkDKQS/ [16:59:32] for now, i just need a place to test the tofu. i'm using testlabs for that because i had acccess and it has enough quota :). i think at the bare minimum i just need one more floating ip for the lb [17:00:11] i can get by with a tunnel for the k8s api when testing [17:00:28] dduvall: I haven't played with Magnum/Octavia enough but I was expecting you could still provision a LoadBalancer in k8s without using the floating ip quotas (then how to reach that LB is another question...) [17:00:44] once i reach a proof of concept, i imagine we would request a new project with appropriate quotas for this, and we would want two floating ips in that project [17:01:28] I don't think it's a problem to bump the quota up but I think it's worth making sure we understand if it's a strict requirement or not, mostly to learn for the future [17:02:17] andrewbogott: if you add a +1 to that quota request I can bump up the quota tomorrow [17:02:22] dhinus: right. seems like a web proxy in front of the lb would be the way. if we can make it so the lb has a static/fixed ip, that seems doable. the scenario i'm worried about is one where 1) the lb is provisioned; 2) we create a web proxy to point to it; 3) something happens and k8s reprovisions the service with a new lb and ip [17:02:33] can we do that in some project that is not testlabs? [17:03:25] taavi: another project is fine by me. i was just using testlabs because that's what we had used when testing this a while back, i think at the 2023 offsite? [17:03:42] and yeah, it had enough quota for the nodes [17:04:19] Magnum assumes that public floating ips are a normal thing. [17:04:27] and also, this is a poc, so i wasn't sure an additional project was a good thing before knowing it would work out [17:06:32] bd808: does magnum/octavia directly request a floating ip to the networking api? the part that confuses me is that once I created an Octavia LB and I didn't think it used a floating ip, but maybe I just didn't notice and I had enough quota [17:08:02] dhinus: The Heat automation in Magnum makes the Octavia LB and tries to assign it a public IP. The whole thing is a non-transparent mechanism that lets you either enable or disable ingress. [17:08:17] it doesn't know how to use IPv6 either [17:08:28] ah that makes sense. can we somehow tell magnum/octavia to use a private floating IP, so that we don't waste the public ones we have? [17:08:46] not that I have found, no [17:08:51] ack [17:09:39] I have no objections to bumping up the quotas, not sure if there's another good project we can use that is not testlabs [17:09:41] the whole thing is both very opinionated and very fragile [17:09:48] I see :) [17:10:04] that matches my impression of magnum so far :P [17:10:27] I don't know if the newer provisioning system that andrewbogott was trialing in codfw fixes any of this [17:10:36] I have to log off, please leave some +1 or comments in the quota request and I'll change the quotas tomorrow! [17:10:48] dhinus: thank you! [17:11:14] yw, sorry if this is taking a bit longer but we're still all understanding how magnum works :) [17:11:40] * dhinus off [17:11:50] oh np there. this is a spike to tease out the blockers so it is going as planned :) [17:13:44] bd808: i am curious what would happen if i used a different `external-network` label [17:15:11] hmm, maybe not "The name or network ID of a Neutron network to provide connectivity to the external internet for the cluster. This network must be an external network, i.e. its attribute ‘router:external’ must be ‘True’. The servers in the cluster will be connected to a private network and Magnum will create a router between this private network and the external network." [17:18:19] dduvall: yeah, I think it might be possible for us to make a fake external network to point things at, but doing that might also confuse other OpenStack bits. [17:19:34] and we would have to manage the web proxy independently which is not desirable [17:20:19] I keep battling thoughts of abandoning Magnum and instead working on Tofu + Puppet integration to build k8s clusters, but E_NOTIME and it feels like NIH a bit too. [17:20:44] one day i want to write a kubernetes controller to manage cloud vps web proxy based on service or ingress resources [17:21:16] taavi: that would be cool [17:21:28] I really want a k8s cluster for deployment-prep and Magnum unfortunately is not going to make it easy to integrate the Puppet provided secrets that a copy of WikiKube will need to expose. [17:21:37] is this a thing in our current magnum deployment https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/octavia-ingress-controller/using-octavia-ingress-controller.md ? [17:22:12] dduvall: it should be, yes [17:22:42] i wonder if we can replace our home-rolled nginx ingress with that. it might simplify things [17:23:03] i.e. get rid of our nginx ingress and cert-manager and external-dns [17:23:16] you can poke around in the cluster in the zuul project to see what is there if you'd like [17:23:43] https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning should show you how to get in [17:23:59] nice. i've been poaching your code already :) [18:02:16] ok, I still need lunch but I'm reading the backscroll... [18:02:49] Is one of the issues that you can't launch an octavia lb at all because it demands a floating IP? [18:09:24] If yes, that's a thing that's fixed in the codfw1dev deployment, and might be a reason to accelerate providing that driver in eqiad. But I certainly don't mind granting an IP in the meantime. [18:09:45] andrewbogott: the loadbalancer was created but the external ip of the service in k8s is stuck at pending [18:10:52] ok, yeah. And setting floating_ip_enabled=false in the template doesn't do anything? [18:11:08] i have that already [18:11:11] ok. [18:11:23] the capi-helm driver totally ignores that setting, I guess the heat driver does too :/ [18:12:22] can you re-link me to the quota request? francesco asked me to +1 it but didn't link as far as I can see [18:12:38] oh, nm, found it [18:13:25] andrewbogott: thanks! [18:13:32] as i said above i think this should be in a project that not testlabs [18:15:11] taavi: I think this is just for proof of concept... dduvall are you ready to try this in a proper project? And do you need a new project for that? [18:15:38] i'm not quite ready to go beyond poc [18:16:28] if we assess that magnum is going to be sufficient, then i would want to set up a formal cluster and hammer on it a bit [18:17:23] i mean testlabs is a staging ground for wmcs infrastructure specifically, not a generic place to test stuff for anyone? otherwise we'll end up in a situation where we have lots of resources getting hogged up with no real accountability or visibility of what is happening there [18:17:25] "hammer on it a bit" == run a bunch of jobs, probably on nodes with better iops, and compare ci job performance [18:18:00] projects are cheap, there's no reason not to create a new project for this and then just delete it if you wind up not wanting. [18:18:15] i can do that [18:19:06] i'll clean up in testlabs and file a project request [18:19:46] thx, link your request here, taavi can +1 and I'll create after I eat something. And please mention the 2 fips in the request so we can track it :) [18:20:20] will do! thanks for the help y'all [18:20:30] i might request some additional flavors as well [18:20:52] * taavi also wonders whether this could just go into gitlab-runners or whether these runners are somehow different enough [18:26:30] gitlab-runners is managed by collab services and runners there are currently serving some ci jobs, so i think i'd rather do testing elsewhere. if the poc is successful, that might be a good option [18:26:54] sorry for intruding into testlabs. i will be sure to clean up [19:00:06] taavi, andrewbogott: https://phabricator.wikimedia.org/T404386 (thank you) [19:00:24] * dduvall will be lunching soon as well [19:02:55] the `g4.cores8.ram24.disk20.ephemeral90.4xiops` flavor i mentioned is based on what we have in integration. the ephemeral90 is probably not necessary for the perf testing, only the iops [19:06:08] actually, some of the services will be utilizing PVs (buildkitd, reggie, docker-hub-mirror) so they will likely be bound by the cinder volume constraints anyway. hmm, i will amend the flavor part of the request [19:46:49] dduvall: I made about 30 mistakes on the way to creating this but I think the quotas are set now, lmk what's missing. [21:35:48] andrewbogott: looks great! thanks so much