[09:30:52] please leave tools-k8s-worker-nfs-36 along I'd like to take a look [09:30:55] at the D state [09:35:31] ack [09:58:24] oh, sorry, I might have just rebooted it [09:58:52] I was testing https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1177960 [09:58:57] godog: ^ [09:59:34] I'll leave the next one for you to debug [10:02:02] dcaro: all good, thank you [10:03:35] fyi. wmcs I'm going to start the kyverno upgrade on tools, avoid doing deployments during this time, or run the functional tests (I'll be running them on a loop) [10:03:47] cteam ^ [10:07:21] ack [10:07:39] hmpf [10:07:42] Error: UPGRADE FAILED: cannot patch "cleanuppolicies.kyverno.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "cleanuppolicies.kyverno.io" is invalid: status.storedVersions[0]: Invalid value: "v2alpha1": must appear in spec.versions && cannot patch "clustercleanuppolicies.kyverno.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io [10:07:42] "clustercleanuppolicies.kyverno.io" is invalid: status.storedVersions[0]: Invalid value: "v2alpha1": must appear in spec.versions && cannot patch "policyexceptions.kyverno.io" with kind CustomResourceDefinition: CustomResourceDefinition.apiextensions.k8s.io "policyexceptions.kyverno.io" is invalid: status.storedVersions[0]: Invalid value: "v2alpha1": must appear in spec.versions [10:07:45] :/ [10:08:40] that did not happen on toolsbeta [10:08:49] bummer :( [10:11:13] need help? did that result in an outage or other incident? [10:11:40] currently testing, no issues found so far, it seems it did not cause an outage [10:12:14] the functional tests fail on the maintain-harbor side though [10:13:04] the user/tool test are passing so far [10:14:48] it did update all the kyverno container images it seems [10:21:59] hmpf.... crds are the worst [10:23:47] anyhow, it seems that it failed to patch the CRD to store only the newer "v2beta1" version for `cleanuppolicies` and `policyexceptions` [10:23:54] it does have the versions now: [10:23:58] https://www.irccloud.com/pastebin/JZZCO010/ [10:24:07] but it's storing both [10:24:09] https://www.irccloud.com/pastebin/7x7IL0d6/ [10:25:54] btw. tests for tools pass, looking into the maintain-harbor errors, though I suspect it might not be related [10:27:41] hmmm... in toolsbeta we don't have `v2alpha1` [10:27:45] https://www.irccloud.com/pastebin/1aPFnjcT/ [10:27:57] but in lima-kilo it's there [10:28:08] https://www.irccloud.com/pastebin/gjCyAZEZ/ [10:28:28] oh, I think I have not upgraded in lima-kilo [10:28:37] (rebuilt it yesterday) [10:30:09] and upgraded without issues :/ [10:30:27] and now it removed v2alpha1 [10:30:29] https://www.irccloud.com/pastebin/vgCVdl2T/ [10:31:44] I'm tempted to leave it like this until we remove it after the k8s upgrade [11:53:09] For the prometheus stats, it also happened in toolsbeta [12:00:23] I think it might be getting confused as all the policies are named the same, and now it's not exposing the namespace in the metric (that it exposed before), so it can't differentiate them [12:12:26] hmm... the docs say it should be there https://release-1-13-0.kyverno.io/docs/monitoring/policy-rule-info-total/ [12:14:24] curling the background controller metrics shows them there [12:14:37] `kyverno_policy_changes_total{otel_scope_name="kyverno",otel_scope_version="",policy_background_mode="true",policy_change_type="created",policy_name="toolforge-kyverno-pod-policy",policy_namespace="tool-aalertbot",policy_type="namespaced",policy_validation_mode="enforce"} 1` [12:14:54] maybe the prometheus config [12:16:32] oh, that's a different metric name [12:18:26] this one does not have namespace [12:18:28] `kyverno_policy_rule_info_total{otel_scope_name="kyverno",otel_scope_version="",policy_background_mode="true",policy_name="toolforge-kyverno-pod-policy",policy_type="namespaced",policy_validation_mode="enforce",rule_name="toolforge-validate-pod-policy",rule_type="validate",status_ready="true"} 1` [12:29:32] oh, it might be disabled by config https://github.com/kyverno/kyverno/blob/a3050f07c05da834ab51227f72b91c0e64d21db0/charts/kyverno/values.yaml#L449 [12:30:41] that feels like something that indeed has the possibility to blow up the unique metric count [12:31:21] yep https://github.com/kyverno/kyverno/commit/eb72b04d2c130fabd39bdd0eec0ea36fd8a08c70 [12:37:46] I created a couple subtasks to follow up [12:56:03] I manually upgraded those crds (it said it was ok), and I'm redeploying kyverno, it's taking it's time [12:56:15] iirc last time it timed out [12:58:31] yep, same [12:58:40] https://www.irccloud.com/pastebin/KL5cdi8e/ [13:02:05] the hook that timed out what it does is actually just run the `kyverno migrate ...` command xd [13:04:31] though might probably be this one that iterates all namespaces and deletes all the policyreports: https://github.com/kyverno/kyverno/blob/release-1.13/charts/kyverno/templates/hooks/post-upgrade-clean-reports.yaml [13:04:44] (was removed in a later version) [13:08:04] yep, that one seems to be the culprit, it takes a long time to iterate through all namespaces [13:14:46] yep, it got to the letter `s` before timing out the helm deployment [13:16:23] sounds like we should up the timeout then? [13:23:43] yep, looking at the configuration options [13:23:59] I think it's a helmfile setting to set [13:30:08] I found an option in helmfile.yaml, `helmDefaults.timeout`, let's see if it works [13:34:55] it got past it \o/ [13:35:04] https://www.irccloud.com/pastebin/qG4LLXUu/ [13:46:28] neat! [13:49:42] taavi: I realised I'll be out tomorrow for the decision meeting on T398285 -- I don't have a strong preference, I'll leave a comment in the task [13:49:43] T398285: Decision request - Reuse toolforge user tools central logging for toolforge infrastructure logging - https://phabricator.wikimedia.org/T398285 [13:57:45] taavi, are all your Trixie fixes merged now? [13:57:55] any objections to me capturing dhcp traffic on cloudnet100[56] re: T400223 ? [13:57:56] T400223: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223 [13:58:32] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177403 is still pending, otherwise yes [13:59:56] godog: I think it's fine, for how long do you need to capture? [14:00:13] dhinus: let's say 36h max [14:02:51] godog: sgtm, make sure andrewbogott knows how to kill it, just in case something breaks when we're all out [14:03:04] makes sense! will update the task [14:04:47] godog: +1 from me [14:06:04] got back the stats for toolsbeta :), deploying the config changes for tools kyverno https://grafana-rw.wmcloud.org/d/kyverno/kyverno?orgId=1&from=now-1h&to=now&timezone=browser&var-DS_PROMETHEUS_KYVERNO=P6466A70779AF0C39 [14:06:05] cheers, and of course one tcpdump per network namespace ... joy [14:07:05] mmhh or maybe not actually, nevermind [14:10:00] andrewbogott: I used this today to reboot the workers https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1177960 [14:10:01] id [14:10:02] xd [14:10:43] godog: since it touches every server... can you +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1177403 ? [14:11:36] dcaro: nice! So that automatically picks which hosts to reboot from prometheus? [14:12:03] yep, not ideal, but better than copy-pasting [14:12:35] hmm... it could ask for confirmation though :/, I might add that [14:32:10] godog: nevermind, now it's merged [14:56:09] andrewbogott: can you leave a +1 in T401347? then I'll create the project [14:56:10] T401347: Trove for cluebotng-review? - https://phabricator.wikimedia.org/T401347 [14:56:38] yep, done [14:56:42] thanks! [15:03:27] andrewbogott: hah very efficient re: 1177403 [15:04:30] yeah! I thought mor.itz was on leave but I must've imagined that [15:05:20] anyone knows how to create a promtool test rule with thousands of series? (we have an alert that counts the amount of series) [15:07:40] hmm, I can make the limit dynamic, like `count(count(kyverno_policy_rule_info_total{policy_validation_mode="enforce",rule_type="validate",status_ready="true"}==1) by (policy_namespace)) < (count(kube_namespace_created) / 2)` [15:08:10] that will make it work in tools also, and if namespaces are getting deleted in bulk, we would notice anyhow [15:08:47] as far as I'm aware there's no way to meta-program test rules like that, if you really need it though you can totally auto generate the test rule [15:08:55] and yes a simpler solution is probably best [15:12:23] open for reviews if anyone is interested :) https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 [15:12:28] (the alert change) [15:31:11] phew ok finally getting somewhere with T400223 [15:31:12] T400223: Investigate daily disconnections of IRC bots hosted in Toolforge - https://phabricator.wikimedia.org/T400223 [15:32:15] tl;dr we must make sure /etc/machine-id is actually unique amongst VMs [15:32:24] andrewbogott: ^ [15:32:30] that feels somewhat familiar [15:33:15] T351507 says that was fixed in mid 2024, I wonder if the workers pre-date that [15:33:16] T351507: VMs in Cloud VPS share the same machine-id - https://phabricator.wikimedia.org/T351507 [15:34:22] could be yeah, good find [15:35:06] it doesn't look like existing VMs with same id were fixed ? [15:40:13] ok I'm definitely too tired to fix it now, I'll take a look tomorrow unless someone beats me to it [15:40:32] from a quick cumin run about ~50% of tools VMs have the same machine-id [15:43:05] great find godog, I think we never fixed existing VMs no, it looked like they were working fine... until they didn't :) [15:44:43] cheers [15:46:03] oops, apparently I missed mono from the initial patch for T400255. follow-up incoming [15:46:05] T400255: Build Trixie based Toolforge pre-built images - https://phabricator.wikimedia.org/T400255 [15:48:59] https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/1178019 [15:51:33] which is trivial enough that I'm just merging [15:58:37] Are we convinced that that machine-id issue is only happening on VMs that predate the fix? [16:04:02] andrewbogott: worth double checking I guess, but my understanding from that old task was that we fixed the issue for jelto's use case [16:04:41] mine too [16:04:55] majority of the k8s workers pre-date that fix [16:05:04] yeah so that adds up [16:05:55] I don't have specific proof that all the older VMs have the issue and all the newer VMs don't, but that would be in line with what we're observing I think [16:06:37] https://wikitech.wikimedia.org/w/index.php?title=Module:Toolforge_images/data.json&curid=452250&diff=2332139&oldid=2232143 [16:07:37] ^ this should, thanks to some wikitech template magic, update all the example commands and other documentation to use the latest language versions [16:07:57] can anyone else ssh into bastion-codfw1dev with your personal key? [16:08:33] taavi-clouddev@bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org: Permission denied (publickey). [16:09:00] same here [16:09:07] Aug 12 16:04:26 bastion-codfw1dev-04 sssd[471]: dbus[471]: arguments to dbus_server_get_address() were incorrect, assertion "server != NULL" failed in file ../../../dbus/dbus-server.c line 840. [16:09:16] sssd is failing to start for a very worrying-sounding error [16:09:19] yep [16:09:24] maybe the ldap server is down [16:10:02] hm, nope [16:10:19] was it upgraded lately or something? [16:10:39] (it seems as if sssd was trying to use a different version of dbus libs) [16:10:51] andrewbogott: in unrelated news, trixie images for toolforge are out. should I wait for you to finish the cloud-vps image to send a single combined announcement? [16:11:19] yes, probably -- the base images should be ready soon [16:11:37] for example, sssd works properly on the trixie VM I just built... [16:11:57] want to try rebooting the bastion? [16:12:08] any reviewers for the alert https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/34 ? if I merge it soon-ish I'll be able to test it with the current tools status (otherwise we will have to wait for kyverno to fail again to test in production) [16:14:09] sssd package versions are the same on that bastion as on a different working host [16:14:17] dcaro: looks reasonable, so +1 [16:15:17] and dbus [16:15:26] thanks! [16:16:00] oh but the bastion doesn't have dbus-user-session [16:22:24] the temptation to throw away this bastion and just build a fresh one is very strong [16:23:22] * dcaro nods in sympathy [16:24:25] do it on trixie :D [16:24:45] 🚢 [16:25:29] want to make sure we can actually build working sssd on bookworm first... [16:28:14] can [16:43:13] ok, trixie bastion is live, you'll need to adjust your host keys accordingly [16:44:02] 🎉 [16:50:18] woo-hoo [16:50:39] andrewbogott: can you approve this for the trove project? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/260 [16:50:49] * dcaro off [16:50:52] side note: I don't think we need the network policies in this case [16:51:09] so we should maybe modify the cookbook to skip creating those [16:51:28] cya tomorrow! (or later 🏖️!) [16:51:34] trove-only projects also raise so many other questions, like: can they graduate to full projects later? [16:51:43] dcaro: see you next week :) [16:52:41] I also started thinking: should we consider "object-storage only projects", and so on? [16:53:28] tl;dr I think we should go ahead with this one, but we need a clearer strategy for the future [16:54:35] dhinus: looks ok to me! [16:54:51] please note the plan shows a change in codfw, resetting the number of floating ips [16:54:52] I have, in turn, this silly patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178038/1/modules/ldap/manifests/client/sssd.pp [16:55:04] is it gonna break your tests with magnum? [16:55:29] dhinus: it's OK, I was trying an active/active thing but I'm done with that for now. [16:55:39] ack [16:56:52] btw an incremental/testing version of that sssd patch is why the old bastion broke i think [16:57:06] sssd just won't start at all if it doesn't like the conf permissions. Picky. [17:03:20] taavi: what is the magic sauce to generate this? https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org [17:03:41] I found "gen_fingerprint" but it gives me an empty output, maybe because of trixie? [17:03:43] dhinus: run `gen_fingerprints` on the host [17:04:03] I thought gen_fingerprints was broken on cloud, but maybe just broken on trixie? [17:04:19] oh I guess it might need updating for the sshd config changes happening in our puppet setup in trixie+ [17:04:43] makes sense [17:04:54] no rush [17:05:00] i might have a simple fix, one second [17:07:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178043 [17:08:22] I'll apply that patch manually to the new bastion and see if it works [17:09:12] yay [17:09:14] thanks :) [17:11:15] can I delete the fingerprints for the old bastino-codfw1dev-04 or do you plan to resurrect it? [17:11:43] Um... let's keep them in case there's a trixie disaster [17:11:58] should bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org have an AAAA record now as well? [17:12:50] can't hurt [17:12:59] well, I guess it could. But let's do it anyway [17:13:12] page updated: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org [17:25:44] trixie images are now public in eqiad1 and codfw1dev both [17:25:52] Going to refresh bookworm and bullseye while I'm at it [17:26:51] oh, wait, i take it back, going to do one more build for eqiad1 [17:50:57] andrewbogott: the trove project is created, and I added a new section https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Creating_Trove-only_projects [17:51:11] I also linked to that section from the clinic duties wiki [17:53:21] great, thank you! [17:54:45] and with that, I'm off! see you next week :) [17:57:38] * andrewbogott waves [18:00:52] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178038 broke puppet on tools-sgebastion-10 (buster) [18:01:15] dammit [18:01:19] what was the mode on Buster? [18:01:27] Aug 12 17:44:33 tools-sgebastion-10 puppet-agent[28602]: (/Stage[main]/Ldap::Client::Sssd/File[/etc/sssd/sssd.conf]/mode) mode changed '0600' to '0640' [18:01:33] 0600 if I'm reading that correctly [18:01:57] ...is buster not 'le' bookworm? [18:02:56] it is, but it is not ge('bullseye') [18:04:05] ooooh [18:04:07] ok will fix [18:07:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1178055/1/modules/ldap/manifests/client/sssd.pp [18:07:51] can you do a pcc? [18:08:26] sure [18:15:58] *annoyed* [18:23:49] ok, finally found one of each OS that pcc actually believes in https://puppet-compiler.wmflabs.org/output/1178055/4677/