[06:36:53] <godog>	 greetings
[06:37:11] <godog>	 dcaro: yes confirmed with your reproducer in the task I'm having the same problem locally with lima-kilo
[07:27:55] <taavi>	 provisioning new trixie bastions in toolforge: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/76
[07:42:01] <godog>	 LGTM
[07:51:44] <dcaro>	 godog: found the issue, it's the network policy for metrics-http on the loki side, it's filtering for 127.0.0.1, it should not as the pods have their own ips and requests don't come from localhost
[07:51:59] <taavi>	 it seems like the toolforge project is out of CPU quota :/
[07:52:59] <godog>	 dcaro: nice find! out of curiosity how/when did that change ?
[07:53:26] <dcaro>	 no idea xd, just tested in my lima-kilo changing that and now it worked
[07:53:49] <godog>	 sweet
[07:56:02] <dcaro>	 onyhow, today and tomorrow I'm ooo, so I'll handle on monday, unless someone wants to get to it before :)
[07:56:20] <volans>	 I saw T404282 and had a look, I think it's ok to resolve but lmk what do you think.
[07:56:20] <stashbot>	 T404282: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T404282
[08:03:15] <godog>	 dcaro: ack!
[08:03:51] <godog>	 volans: interesting, yeah seems ~benign ?
[08:04:23] <volans>	 I was kinda surprised to not find anything in the bmc's logs
[08:33:18] <dhinus>	 morning
[08:35:11] <taavi>	 o/
[08:36:57] <dhinus>	 volans: left a comment, we can resolve it for now but it's the second time it happens on that same DIMM
[09:16:05] <dhinus>	 related, I reviewed ALL the KernelErrors phab tasks and I think we can just delete the alert, I created T404300
[09:16:05] <stashbot>	 T404300: Remove KernelErrors alerts - https://phabricator.wikimedia.org/T404300
[09:20:36] <godog>	 neat, thank you dhinus !
[10:56:53] <volans>	 dhinus: T404325 is because I closed the previous one too early?
[10:56:54] <stashbot>	 T404325: KernelErrors Server cloudcephosd1041 logged kernel errors - https://phabricator.wikimedia.org/T404325
[10:58:07] <dhinus>	 volans: ah yes that's a problem with how the alert is designed, we need to ack the alert in alertmanager
[10:58:14] <dhinus>	 otherwise it will fire again and reopen a task
[10:58:58] <dhinus>	 it will resolve after 24h because of how it's defined
[10:59:26] <volans>	 yeah I figured but didn't thought about it earlier, sorry
[10:59:41] <dhinus>	 yeah I also forgot I needed to ack!
[11:00:19] <dhinus>	 acked and resolved as duplicate
[11:00:19] <volans>	 thx
[13:10:16] <godog>	 FYI I'm taking a look at k8s-worker-nfs-53 alert also in the context of T404322
[13:10:16] <stashbot>	 T404322: wmf-auto-restart can get wedged on nfs4 mounts even when the filesystem is excluded - https://phabricator.wikimedia.org/T404322
[13:10:36] <godog>	 will reboot once done
[13:44:58] <godog>	 {{done}}
[13:51:41] <andrewbogott>	 compare the memory graphs on  https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-30m&to=now&timezone=utc&var-server=cloudcephosd1016&var-datasource=000000026&var-cluster=wmcs&refresh=5m
[13:51:47] <andrewbogott>	 with https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-server=cloudcephosd1017&var-datasource=000000026&var-cluster=wmcs&refresh=5m
[13:52:16] <andrewbogott>	 1016 is running reef, 1017 is running quincy. The memory usage is all spiky on 1017 and nice and smooth on 1016.
[13:52:34] <andrewbogott>	 I'm hoping that means they made the new release better, and not that it's just not doing anything
[13:55:36] <godog>	 andrewbogott: which panel(s) were you looking at from the dashboards ?
[13:56:08] <andrewbogott>	 Memory: saturation is the most obvious difference (and most likely to be evidence of a bugfix)
[13:56:25] <andrewbogott>	 but the network: utilization graph is also dramatically different
[13:56:54] <andrewbogott>	 oooh wait I have those set to different timescales
[13:56:56] * andrewbogott headdesk
[13:57:47] <andrewbogott>	 ok! now with the same scale it's only the memory:saturation that looks different to me
[13:57:55] <andrewbogott>	 and better with reef.
[13:57:58] <andrewbogott>	 So, I'll take it :)
[14:01:44] <andrewbogott>	 fyi godog, when we ran version P with debian bookworm, RAM use grew without bounds until the OOM killer fired. So I'm on edge about moving nodes to bookworm even though it's a later ceph (R) version this time around.
[14:02:28] <dhinus>	 does anybody know how to track where these requests are originated exactly? T404347
[14:02:29] <stashbot>	 T404347: WMCS is sending millions of invalid requests to Europeana.eu servers - https://phabricator.wikimedia.org/T404347
[14:03:43] <godog>	 andrewbogott: ack, fair enough!
[14:04:14] <andrewbogott>	 dhinus: as far as I know we do not currently have a great way to track this unless there's an obvious user agent set. That said, there's a 60% chance that it's iabot, so I would cc max/cyberpower and ask 'this you?' 
[14:05:01] <andrewbogott>	 hmmmm actually if it's a search query then I'd drop those odds to 30%. Still worth asking though.
[14:06:29] <volans>	 are there any logs of the NATting or is there a place where one could do tcpdump on the internal interface and see the originating interal ip?
[14:10:43] <andrewbogott>	 I don't think we have logs; tcpdump might be possible although we haven't had luck tracking such things in the past. topranks might have suggestions (and/or discouragement)
[14:11:36] <taavi>	 we have NAT logs since recently, .. but since that site is behind cloudflare we're going to need some more detail to distinguish it from all other traffic to cf
[14:12:07] <andrewbogott>	 ooh, nice! I mean, re: nat logs
[14:13:11] <taavi>	 dhinus: so for options: 1) ask the reporter for timestamp+source port combinations, and match those to NAT logs, or 2) try to look at which node is making recdns queries for that domain
[14:14:09] <andrewbogott>	 browsing phab tasks suggests that it's probably a wikidata-adjacent project, although that doesn't narrow things down a whole lot.
[14:14:56] <taavi>	 andrewbogott: I doubt it's IABot, since that runs on nodes with floating IPs so it would not be using the generic NAT address
[14:15:49] <andrewbogott>	 taavi: yep, that makes sense. I'm just trained by a history of "why do you keep loading pages on our site?" emails about iabot -- this is clearly not that.
[14:19:58] <dhinus>	 taavi: 1) asked, for 2) where would I look, in our DNS servers?
[14:21:50] <taavi>	 dhinus: one of the cloudservices* boxes acts as the active dns recursor, basically you'd need to tcpdump the incoming dns traffic there
[14:22:46] <dhinus>	 I see, and hope there's a request flying right now. I wonder if they come in bulks or not, I asked
[16:01:00] <taavi>	 for dns logging: T404373
[16:01:00] <stashbot>	 T404373: Log DNS queries from Cloud VPS clients - https://phabricator.wikimedia.org/T404373
[16:01:09] <dhinus>	 thanks taavi 
[16:10:42] <taavi>	 MRs for updating toolforge bastions: https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/77
[16:45:48] <dduvall>	 dhinus, andrewbogott: re: T404150 i'm not sure how to proceed. i think using a public floating ip on wan-transport-eqiad for the lb based service would make it much easier for us to get gitlab-cloud-runner back onto wmcs, so i would prefer that. but it seems like y'all are saying there's a different way to do it. is that with an lb on octavia and a web proxy?
[16:45:49] <stashbot>	 T404150: Additional floating IPs for gitlab-cloud-runner testing in testlabs project - https://phabricator.wikimedia.org/T404150
[16:46:57] <dduvall>	 not sure how to make that happen, and am a little worried that having the lb managed in k8s and the web proxy managed elsewhere (tofu) is going to make this a bit brittle. unless we can ensure the lb ip is static (not sure if that's possible when not using floating ips?)
[16:53:16] <taavi>	 andrewbogott: are you still using the dns-recursor-on-a-VM experiment added in T374830 or can I rip out the puppet code for that?
[16:53:17] <stashbot>	 T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org - https://phabricator.wikimedia.org/T374830
[16:53:48] <andrewbogott>	 taavi, dduvall, sorry I won't be able to respond for an hour or so, in meetings
[16:54:07] <dduvall>	 np
[16:55:10] <andrewbogott>	 let's see... 
[16:55:47] <andrewbogott>	 dduvall: tell me what process you're imaging for a rebuild, that might help me understand.
[16:55:54] <andrewbogott>	 Are you thinking you'd have two floating IPS and do blue/green?
[16:57:45] <dduvall>	 i'm using the existing terraform/tofu in https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner but refactored with an `wmcs` module instead of the existing `digitalocean` one
[16:58:23] <dduvall>	 the `wmcs` module i have is pretty basic so far 
[16:58:26] <dduvall>	 https://www.irccloud.com/pastebin/FDSkDKQS/
[16:59:32] <dduvall>	 for now, i just need a place to test the tofu. i'm using testlabs for that because i had acccess and it has enough quota :). i think at the bare minimum i just need one more floating ip for the lb
[17:00:11] <dduvall>	 i can get by with a tunnel for the k8s api when testing
[17:00:28] <dhinus>	 dduvall: I haven't played with Magnum/Octavia enough but I was expecting you could still provision a LoadBalancer in k8s without using the floating ip quotas (then how to reach that LB is another question...)
[17:00:44] <dduvall>	 once i reach a proof of concept, i imagine we would request a new project with appropriate quotas for this, and we would want two floating ips in that project
[17:01:28] <dhinus>	 I don't think it's a problem to bump the quota up but I think it's worth making sure we understand if it's a strict requirement or not, mostly to learn for the future
[17:02:17] <dhinus>	 andrewbogott: if you add a +1 to that quota request I can bump up the quota tomorrow
[17:02:22] <dduvall>	 dhinus: right. seems like a web proxy in front of the lb would be the way. if we can make it so the lb has a static/fixed ip, that seems doable. the scenario i'm worried about is one where 1) the lb is provisioned; 2) we create a web proxy to point to it; 3) something happens and k8s reprovisions the service with a new lb and ip
[17:02:33] <taavi>	 can we do that in some project that is not testlabs?
[17:03:25] <dduvall>	 taavi: another project is fine by me. i was just using testlabs because that's what we had used when testing this a while back, i think at the 2023 offsite?
[17:03:42] <dduvall>	 and yeah, it had enough quota for the nodes
[17:04:19] <bd808>	 Magnum assumes that public floating ips are a normal thing.
[17:04:27] <dduvall>	 and also, this is a poc, so i wasn't sure an additional project was a good thing before knowing it would work out
[17:06:32] <dhinus>	 bd808: does magnum/octavia directly request a floating ip to the networking api? the part that confuses me is that once I created an Octavia LB and I didn't think it used a floating ip, but maybe I just didn't notice and I had enough quota
[17:08:02] <bd808>	 dhinus: The Heat automation in Magnum makes the Octavia LB and tries to assign it a public IP. The whole thing is a non-transparent mechanism that lets you either enable or disable ingress.
[17:08:17] <bd808>	 it doesn't know how to use IPv6 either
[17:08:28] <dhinus>	 ah that makes sense. can we somehow tell magnum/octavia to use a private floating IP, so that we don't waste the public ones we have?
[17:08:46] <bd808>	 not that I have found, no
[17:08:51] <dhinus>	 ack
[17:09:39] <dhinus>	 I have no objections to bumping up the quotas, not sure if there's another good project we can use that is not testlabs
[17:09:41] <bd808>	 the whole thing is both very opinionated and very fragile
[17:09:48] <dhinus>	 I see :)
[17:10:04] <dhinus>	 that matches my impression of magnum so far :P
[17:10:27] <bd808>	 I don't know if the newer provisioning system that andrewbogott was trialing in codfw fixes any of this
[17:10:36] <dhinus>	 I have to log off, please leave some +1 or comments in the quota request and I'll change the quotas tomorrow!
[17:10:48] <dduvall>	 dhinus: thank you!
[17:11:14] <dhinus>	 yw, sorry if this is taking a bit longer but we're still all understanding how magnum works :)
[17:11:40] * dhinus off
[17:11:50] <dduvall>	 oh np there. this is a spike to tease out the blockers so it is going as planned :)
[17:13:44] <dduvall>	 bd808: i am curious what would happen if i used a different `external-network` label
[17:15:11] <dduvall>	 hmm, maybe not "The name or network ID of a Neutron network to provide connectivity to the external internet for the cluster. This network must be an external network, i.e. its attribute ‘router:external’ must be ‘True’. The servers in the cluster will be connected to a private network and Magnum will create a router between this private network and the external network."
[17:18:19] <bd808>	 dduvall: yeah, I think it might be possible for us to make a fake external network to point things at, but doing that might also confuse other OpenStack bits.
[17:19:34] <dduvall>	 and we would have to manage the web proxy independently which is not desirable
[17:20:19] <bd808>	 I keep battling thoughts of abandoning Magnum and instead working on Tofu + Puppet integration to build k8s clusters, but E_NOTIME and it feels like NIH a bit too.
[17:20:44] <taavi>	 one day i want to write a kubernetes controller to manage cloud vps web proxy based on service or ingress resources
[17:21:16] <dduvall>	 taavi: that would be cool
[17:21:28] <bd808>	 I really want a k8s cluster for deployment-prep and Magnum unfortunately is not going to make it easy to integrate the Puppet provided secrets that a copy of WikiKube will need to expose.
[17:21:37] <dduvall>	 is this a thing in our current magnum deployment https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/octavia-ingress-controller/using-octavia-ingress-controller.md ?
[17:22:12] <bd808>	 dduvall: it should be, yes
[17:22:42] <dduvall>	 i wonder if we can replace our home-rolled nginx ingress with that. it might simplify things
[17:23:03] <dduvall>	 i.e. get rid of our nginx ingress and cert-manager and external-dns
[17:23:16] <bd808>	 you can poke around in the cluster in the zuul project to see what is there if you'd like
[17:23:43] <bd808>	 https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning should show you how to get in
[17:23:59] <dduvall>	 nice. i've been poaching your code already :)
[18:02:16] <andrewbogott>	 ok, I still need lunch but I'm reading the backscroll...
[18:02:49] <andrewbogott>	 Is one of the issues that you can't launch an octavia lb at all because it demands a floating IP?
[18:09:24] <andrewbogott>	 If yes, that's a thing that's fixed in the codfw1dev deployment, and might be a reason to accelerate providing that driver in eqiad. But I certainly don't mind granting an IP in the meantime.
[18:09:45] <dduvall>	 andrewbogott: the loadbalancer was created but the external ip of the service in k8s is stuck at pending
[18:10:52] <andrewbogott>	 ok, yeah. And setting floating_ip_enabled=false in the template doesn't do anything?
[18:11:08] <dduvall>	 i have that already
[18:11:11] <andrewbogott>	 ok.
[18:11:23] <andrewbogott>	 the capi-helm driver totally ignores that setting, I guess the heat driver does too :/
[18:12:22] <andrewbogott>	 can you re-link me to the quota request? francesco asked me to +1 it but didn't link as far as I can see
[18:12:38] <andrewbogott>	 oh, nm, found it
[18:13:25] <dduvall>	 andrewbogott: thanks!
[18:13:32] <taavi>	 as i said above i think this should be in a project that not testlabs
[18:15:11] <andrewbogott>	 taavi: I think this is just for proof of concept... dduvall are you ready to try this in a proper project? And do you need a new project for that?
[18:15:38] <dduvall>	 i'm not quite ready to go beyond poc
[18:16:28] <dduvall>	 if we assess that magnum is going to be sufficient, then i would want to set up a formal cluster and hammer on it a bit
[18:17:23] <taavi>	 i mean testlabs is a staging ground for wmcs infrastructure specifically, not a generic place to test stuff for anyone? otherwise we'll end up in a situation where we have lots of resources getting hogged up with no real accountability or visibility of what is happening there
[18:17:25] <dduvall>	 "hammer on it a bit" == run a bunch of jobs, probably on nodes with better iops, and compare ci job performance
[18:18:00] <andrewbogott>	 projects are cheap, there's no reason not to create a new project for this and then just delete it if you wind up not wanting.
[18:18:15] <dduvall>	 i can do that
[18:19:06] <dduvall>	 i'll clean up in testlabs and file a project request
[18:19:46] <andrewbogott>	 thx, link your request here, taavi can +1 and I'll create after I eat something.  And please mention the 2 fips in the request so we can track it :)
[18:20:20] <dduvall>	 will do! thanks for the help y'all
[18:20:30] <dduvall>	 i might request some additional flavors as well
[18:20:52] * taavi also wonders whether this could just go into gitlab-runners or whether these runners are somehow different enough
[18:26:30] <dduvall>	 gitlab-runners is managed by collab services and runners there are currently serving some ci jobs, so i think i'd rather do testing elsewhere. if the poc is successful, that might be a good option
[18:26:54] <dduvall>	 sorry for intruding into testlabs. i will be sure to clean up
[19:00:06] <dduvall>	 taavi, andrewbogott: https://phabricator.wikimedia.org/T404386 (thank you)
[19:00:24] * dduvall will be lunching soon as well
[19:02:55] <dduvall>	 the `g4.cores8.ram24.disk20.ephemeral90.4xiops` flavor i mentioned is based on what we have in integration. the ephemeral90 is probably not necessary for the perf testing, only the iops
[19:06:08] <dduvall>	 actually, some of the services will be utilizing PVs (buildkitd, reggie, docker-hub-mirror) so they will likely be bound by the cinder volume constraints anyway. hmm, i will amend the flavor part of the request
[19:46:49] <andrewbogott>	 dduvall: I made about 30 mistakes on the way to creating this but I think the quotas are set now, lmk what's missing.
[21:35:48] <dduvall>	 andrewbogott: looks great! thanks so much