[07:52:53] Morning! [07:56:49] morning. I'm looking at some reported paws issues, T400542 [07:56:50] T400542: [Bug] PAWS server not starting - https://phabricator.wikimedia.org/T400542 [08:01:37] ack, I was looking at the tools probes flapping this morning, it seems there's spikes in traffic [08:09:31] morning. there were also some reports of toolforge errors yesterday in -cloud [08:10:40] I quickly checked alerts.wm.o yesterday evening and I didn't see anything firing apart form a couple nodes with processes in D state [08:11:40] paws is "fixed" with a reboot [08:11:44] yep, there was a big spike in one specific node [08:11:55] taavi: 👍 thanks [08:16:00] I had a very brief look at toolforge yesterday as well, it looked like there was some high traffic tool (geohack?) that was taking ages to respond? and that was then causing everything else to run out of worker slots [08:17:15] hmmm, I think I might have killed one prometheus node [08:17:16] xd [08:17:34] (suddenly the results from my queries changed quite a lot) [08:18:10] geohack was the tool with the most qps consistently when looking right now, but now the data I get is quite different ... :S [08:22:00] nm, was using the wrong status code (ctl+z works inside the input field :) ) [08:22:57] I don't see a big change for it in time though [08:23:01] https://usercontent.irccloud-cdn.com/file/c67Ntd4p/image.png [08:27:08] I don't see any traffic spike at the ingress level [08:28:28] hmm... aggregated, when splitting it up geohack seems to have some [08:28:50] vs [08:28:51] https://usercontent.irccloud-cdn.com/file/Y4hfv9it/image.png [08:29:11] https://usercontent.irccloud-cdn.com/file/cEoiNo2G/image.png [08:30:28] might just get diluted in the sum probably [08:31:58] this tells a different story though [08:32:00] https://usercontent.irccloud-cdn.com/file/dOcW5gpt/image.png [09:01:55] hmm... now tools-prometheus-9 is a bit unresponsive [09:17:30] yep, trying to swap but can't, getting bogged [09:17:45] https://usercontent.irccloud-cdn.com/file/hpHQ2q3e/image.png [09:18:04] I'll force-reboot it [09:22:38] back up and running [09:25:50] novafullstack is failing since last night, no idea why [09:30:33] I think that peak in connections that did not end up in requests was a DoS of sorts (they opened connections but did not really request anything, ended up closing as 499/client closed) [09:31:19] dhinus: the last run failed trying to ssh to the VM it seems (from logstash) [09:31:37] `SSH waited for 895 (of 900)` [09:31:52] well, the next round xd [09:31:56] `SSH waited for 906 (of 900)` [09:32:28] yes I just found similar logs in cloudcontrol1007 [09:36:32] it's failing to start because of the puppet cert [09:36:40] Error: Could not request certificate: The CSR retrieved from the master does not match the agent's public key. [09:36:49] *failing to start puppet [09:42:24] hmm... [09:42:31] is it normal that it thinks it's name is buildvm? [09:42:35] https://www.irccloud.com/pastebin/KPSej5R4/ [09:42:42] maybe cloudinit failed to change the hostname [09:42:55] no [09:43:02] that sounds like a problem with the metadata api! [09:43:44] hmm [09:43:45] 2025-07-28 09:14:27,573 - cc_update_hostname.py[DEBUG]: Updating hostname to buildvm-7fa025d8-39b1-4b67-9228-ba2bb6414f38.admin.eqiad1.wikimedia.cloud (buildvm-7fa025d8-39b1-4b67-9228-ba2bb6414f38) [09:43:48] ohhh [09:43:56] yep, that sounds like a likely issue :) [09:46:46] they were up and happy on all the cloudcontrols, not really logging anything though, I restarted them just in case [09:47:24] which one? both nova and neutron run a -metadata service, and I think it's the neutron one that's been having issues lately [09:48:46] https://www.irccloud.com/pastebin/8d50oHnX/ [09:48:50] found the log [09:49:09] I was checking nova-api-metadata [09:55:38] hmm, interesting [09:55:41] `Jul 19 13:03:13 cloudnet1006 neutron-metadata-agent[3245698]: OSError: [Errno 24] Too many open files: '/etc/neutron/policy.d'` [09:58:27] it's a few days old though, no big issues since then on that node [10:02:22] aha, this is new [10:02:28] `Jul 28 10:02:02 cloudnet1005 neutron-metadata-agent[1497959]: OSError: [Errno 24] Too many open files: '/etc/neutron/policy.d'` on the other cloudnet [10:02:33] (cloudnet1005) [10:02:58] huh, that's worth looking into [10:05:46] I think it has the standard file open limit [10:05:49] https://www.irccloud.com/pastebin/uQq3awme/ [10:06:37] /lib/systemd/system/neutron-metadata-agent.service sets LimitNOFILE=65535 [10:07:21] nice, that'll be easier to grep in puppet [10:08:30] hmm... [10:09:04] I think that might be the default or something xd [10:09:25] (grep does not give a clear result right away, looking) [10:09:34] you mean a default in the package? if so, seems very likely to me [10:09:45] it'd be relatively trivial to put an override in openstack::neutron::metadata_agent [10:10:36] I was expecting to be set by puppet... [10:11:14] systemd units are generally shipped as a part of the packaging? [10:12:32] yep, though as our openstack puppet setup has been historically quite "invasive", I was expecting to still be [10:17:06] nice, we have systemd::service in puppet that allows declaring overrides [10:45:09] git this https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173352 [10:45:13] will merge after lunch [13:11:21] paws-nfs is full again :/ [13:11:27] deployed [13:11:29] (full = 85%) [13:12:52] dcaro: friendly reminder to archive the weekly etherpad and change the IRC topic :) [13:13:11] thanks! that sounds like a nice thing to automate xd [13:13:27] (and kinda non-trivial, due to the many systems involved) [13:13:46] oh, I already archived it [13:14:21] LOL yes maybe we need a clinicdutybot :) [13:16:27] https://xkcd.com/1205/ [13:17:12] novafullstack worked :) [13:17:13] Jul 28 13:16:50 cloudcontrol1007 nova-fullstack[891767]: @cee: {"ecs.version": "1.7.0", "log.level": "INFO", "log.origin.file.line": 73, "log.origin.file.name": "nova-fullstack", "log.origin.file.path": "/usr/local/sbin/nova-fullstack", "log.origin.function": "add_stat", "labels": {"test_hostname": "fullstackd-20250728094529"}, "message": "cloudvps.novafullstack.verify.success => 1.000000 1753708610", "process.name": [13:17:13] "MainProcess", "process.thread.id": 891767, "process.thread.name": "MainThread", "timestamp": "2025-07-28T13:16:50.520107"} [13:17:40] we get ~21hours I'd say then [13:17:55] (to automate the weekly archival and such) [13:18:22] (if we keep the same process for 5 years xd) [13:18:35] the xkcd is also missing the z-axis with "satisfaction" :D [13:18:45] fair point [13:19:09] hahaha [13:26:06] the fullstack failures are pretty much always because of T395742, was today's any exception? [13:26:07] T395742: Neutron metadata service failing for all VMs - https://phabricator.wikimedia.org/T395742 [13:26:22] I just restarted the service and deleted VMs before reading the backscroll here :( [13:27:09] ok, now I'm caught up! [13:35:14] dhinus: regarding your question about puppet on codfw1dev cloudcontrols: I'm still doing test/dev there and it's nice to not have to make my config changes on all three hosts. I turned things on and back off again last night so at least the elapsed time since last puppet run is shorter now. [13:38:04] andrewbogott: ack, I was just not sure whether there was something that needed fixing :) [13:44:51] who would like to help me debug an almost-working magnum cluster? It comes up and starts a whole lot of things but some of the pods are in CrashLoopBackOff and I'm worried that that reflects a generally dysfunctional network [13:45:14] sure [13:46:20] on cloudcontrol2010-dev, [13:46:26] export KUBECONFIG=/home/andrew/config [13:47:11] the pod I'm most interested in is openstack-cinder-csi-controllerplugin-547d87898c-khmfz , because that's the one the upstream dev noticed right away when I showed him my output [13:47:28] but I suspect/home that many of the failed pods are failing for similar reasons [13:48:12] the hosting VMs themselves are in the 'admin' project because I'm lazy [13:49:28] > W0728 13:46:40.798452 10 main.go:127] Failed to GetOpenStackProvider : Post "https://openstack.codfw1dev.wikimediacloud.org:25000/v3/auth/tokens": dial tcp: lookup openstack.codfw1dev.wikimediacloud.org on 172.24.0.10:53: server misbehaving [13:49:34] is this the error I'm supposed to be looking at? [13:51:02] It's not the one I'm looking at but it might be important [13:51:17] I've been staring at the very brief [13:51:18] kubectl logs --namespace openstack-system openstack-cinder-csi-controllerplugin-547d87898c-khmfz [13:51:50] try `kubectl logs -n openstack-system openstack-cinder-csi-controllerplugin-547d87898c-khmfz cinder-csi-plugin` [13:51:53] In theory this is possible without barbican but I'm pretty sure the devs have barbican so it's very possible that some kind of secret-sharing failure is the issue. [13:52:20] which then leads me to `kubectl logs -n kube-system coredns-674b8bbfcf-5nff9` struggling to talk to powerdns [13:53:36] so, betraying my ignorance -- that 'cinder-csi-plugin' on the end means 'show me the logs of the cinder-csi-plugin service within the specified pod'? [13:54:45] s/service/container/, but yes [13:55:12] cool [13:55:27] ok, so this is a firewall issue or an allowed-client list thing... [13:56:23] well, maybe [13:56:29] or routing [13:58:21] what is 172.18.111.205 actually? [13:59:02] cluster internal ip of the coredns pod [13:59:36] does the outside network (cloudservices in this case) see that IP or see some natted thing? [13:59:48] it should be natted to the node ip [14:00:01] speaking of which, why is this cluster in the legacy vlan network? [14:00:27] It doesn't need to be, you're just seeing the latest attempt where I was trying to see if behavior varied based on network [14:00:40] idn't [14:00:42] it didn't [14:02:09] if it's really natted to the node IP then this should just work. It's not like other VMs can't talk to DNS... [14:02:11] * andrewbogott makes sure that is true [14:03:03] it is... [14:04:44] taavi: the other weird thing you'll see (which is a to do) is that the new driver uses a floating IP network to assign the k8s server IP. That means that at the moment it's using an actual public IP because I haven't yet made a pool of floating internal IPs for it to use. [14:15:52] andrewbogott: the feeling I'm getting is that there is some layer somewhere that's blocking access to private ip address space [14:16:32] a layer within the k8s cluster, right? Nothing in particular to do with the dns server? [14:16:36] yes [14:17:03] ok. Want me to rebuild with the dual-stack network and see if it's different, or does that seem unrelated? [14:17:47] give me a second [15:25:59] andrewbogott: is https://github.com/azimuth-cloud/capi-helm-charts the driver you're trying out? [15:28:02] yes. It's duplicated on chartmuseum.magnum.codfw1dev.wikimedia.cloud [15:28:19] ok, the issue is that https://github.com/azimuth-cloud/capi-helm-charts/blob/bfbd1b99c4ca9e77e8741c54972589555c17a016/charts/openstack-cluster/values.yaml#L61 conflict with what we use internally [15:28:29] we need to override those somehow [15:28:56] dang, ok [15:30:08] toolsbeta puppet seems to be failing, anyone touched anything there? [15:30:43] no, the issue there is "TypeError: BaseGenericPlugin.__init__() got an unexpected keyword argument 'application_credential_id'" but I've no idea why that suddenly started appearing [15:32:51] yep [15:33:02] maybe Raymond_Ndibe was changing some hiera stuff? [15:33:50] oh no, that might have been me a bit ago [15:33:51] looking [15:34:19] andrewbogott: does that give you enough clues to go forward? [15:34:51] probably! I can at least test with a forked chart and see if that gets me to the finish line. [15:35:05] yep, I added an extra entry in cloud.yaml to test the credentials, and it seems the puppet-enc cli uses the latest entry in the clouds.yaml (I think) [15:35:18] wait, no it just does not understand the yaml key [15:36:11] that's weird [15:36:40] anyhow, fixed [15:45:06] taavi: I don't totally follow why the 172.16.0.0/13 thing is a problem; isn't everything in the calico network natted and so invisible/irrelevant to external networks? [15:46:51] the problem is that both the VMs and other infrastructure (in particular, our recursive DNS server) are inside that, and so because calico thinks those are now inside the cluster it doesn't apply the egress nat and similar [15:47:30] plus there's the possibility of calico assigning a pod the ip of the dns recursor for example, which would break everything trying to talk to the real DNS server inside the cluster [15:47:34] oh, of course. [15:47:34] ok [15:47:47] So ideally it would be e.g. 10.1.0.0/13 [15:47:57] so that it's obviously non-overlapping [15:48:35] yeah, using something in 10.0.0.0/8 or 192.168.0.0/16 would be fine. I don't think it strictly needs to be a /13, the comment seems to suggest they just cut that chunk of private v4 space in half [15:49:10] yeah, it's a whole lot of addresses [15:49:55] for example toolforge uses a /16 for pods, which itself is more than enough [15:50:41] * andrewbogott nods [16:18:41] taavi, any practice with 'helm push'? This is telling me "Chart.yaml file is missing" even though it is not missing [16:19:28] do you have an example? [16:23:24] /home/labtestandrew on chartmuseum.magnum.codfw1dev.wikimedia.cloud [16:23:40] I'm trying to [16:23:41] helm push openstack-cluster.0.16.0-wm.tgz oci://magnum-chartmuseum.codfw1dev.wmcloud.org/charts/openstack-cluster [16:26:52] looking at `tar -ztvf openstack-cluster.0.16.0-wm.tgz` the file names in there seem a bit different (they have a leading `./`) compared to `openstack-cluster-0.16.0.tgz` which I guess is messing something up [16:27:58] ok, let me try re-tarring... [16:28:44] hm, that at least does something different! [16:28:49] (try in subdir 'too' to see) [16:28:53] 'foo' [16:29:07] Error: unexpected status from POST request to https://magnum-chartmuseum.codfw1dev.wmcloud.org/v2/charts/openstack-cluster/openstack-clusterwm/blobs/uploads/: 404 Not Found [16:29:13] I wonder if it doesn't like my weird version string [16:29:51] oh, no, it's even sillier than that [16:31:55] nope, still fails even without my typi [16:42:08] * andrewbogott -> doctor things, will revisit this silly helm issue later [17:08:05] * dhinus off [17:47:35] * dcaro off