[08:28:08] another theory: what if the cluster were bootstrapped with the feature flag enabled, but later removed from the kubadm config, then some upgrade dropped the flag from the apiserver in toolsbeta, but the feature is still activated [08:29:33] also on lima-kilo? [08:29:50] the kubeadm configmap in toolsbeta doesn't mention it [08:30:32] I hate that I can't no longer find the docs for k8s 1.24 online, to know things like what their defaults were back then [08:31:51] if you remember the url maybe wayback machine might help [08:33:29] https://web.archive.org/web/20240305110217/https://v1-24.docs.kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ [08:35:17] wait [08:35:31] https://www.irccloud.com/pastebin/z5NmN5Ep/ [08:35:36] so toolsbeta shows the same problem [08:35:57] and so lima-kilo [08:35:59] https://www.irccloud.com/pastebin/6WfZPDAt/ [08:36:25] so the question is: why webservice shell has this problem, but webservice start generates a valid deployment spec [08:36:39] interesting, it worked for me in lima-kilo [08:36:44] I might be missing some setup :/ [08:37:11] https://www.irccloud.com/pastebin/qNrIvhlK/ [08:37:29] which version of toolforge webservice? [08:37:51] oooohhh, wrong version of the client yes [08:38:07] awesome, so that will be caught by functional tests [08:39:49] https://www.irccloud.com/pastebin/Y99uf6EX/ [08:40:13] so webservice start generates a Deployment that can have a procMount entry [08:40:19] but not webservice shell [08:42:45] hmm, so the problem is creating the pod directly, instead of through a deployment? [08:43:35] very weird [08:46:57] with this diff, the shell gets created [08:47:00] https://www.irccloud.com/pastebin/FbQbrRLH/ [08:47:07] https://www.irccloud.com/pastebin/Pi6Z2egH/ [08:47:21] given this is a non-default feature flag, I think I will drop that entry entirely [08:48:03] added the test https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/347, we can't merge until solved though xd [08:48:44] I was thinking about that value also, but jobs-api has the same value too [08:48:53] so I guessed it was ok [08:49:02] maybe the value in the deployment is different than in the pod? [08:49:59] yep [08:50:02] https://www.irccloud.com/pastebin/vREq8M7K/ [08:50:12] from https://kubernetes.io/docs/concepts/security/pod-security-standards/ [08:50:26] the value is `Default` not `DefaultProcMount` [08:51:06] yeah, the documentation is misleading [08:51:14] because here https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#securitycontext-v1-core it says `DefaultProcMount` [08:51:57] https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/43 <-- proposal [08:52:27] they use `Default` here too https://kyverno.io/policies/pod-security/baseline/disallow-proc-mount/disallow-proc-mount/ [08:53:16] feels to me like a typo (`DefaultProcMount` would be the golang variable name I'd set for the actual string value xd) [08:53:30] https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/97 <-- similar proposal [08:54:18] yep [08:54:20] https://www.irccloud.com/pastebin/zZjanDDw/ [08:54:39] https://github.com/kubernetes/api/blob/master/core/v1/types.go#L7465 [08:54:41] :-( [08:56:10] that definitely explains why I introduced the code in the first place. I think why the different behavior in jobs/webservice/shell remains a mystery [08:56:11] +1d [08:56:38] might be the deployment vs pod part, maybe the deployment just ignores the value or something if it's wrong [08:56:42] (or unsupported) [08:57:00] we can play with that :) [08:57:35] that makes a lot of sense [08:57:50] after all the deployment just contains a template [08:58:06] it doesn't make sense for k8s to generate an invalid pod from a template it doesn't understand [08:59:07] hmm, cronjobs are created with `Default` [08:59:09] https://www.irccloud.com/pastebin/xW3WgbMO/ [08:59:30] yeah, from the template [08:59:32] makes sense [09:00:07] thanks for the assistance dcaro, I'll update the tickets, redeploy etc [09:00:09] https://www.irccloud.com/pastebin/liV44nbN/ [09:00:16] hmm, deployments too :/ [09:00:34] if the docs for the latest k8s versions still have the typo we may want to send a bug report [09:02:44] I think they are, you can also send a patch :) [09:03:01] will do [09:12:25] oh! T362050 now works on gitlab MR descriptions :) [09:12:25] T362050: toolforge: review pod templates for PSP replacement - https://phabricator.wikimedia.org/T362050 [09:15:41] you mean the link? [09:17:46] yep, now `TXXXX` will show as a link in the UI [09:17:56] (or it does for me at least) [09:22:58] yeah, same here [09:25:08] tools-k8s-ingress-9 seems down (alert) [09:25:58] went away [09:26:41] was it rebooted? [09:26:46] https://www.irccloud.com/pastebin/73rM7LC2/ [09:27:00] dcaro: ongoing OVS migration [09:27:14] aaaahhh, okok [09:27:20] * dcaro should look in feed [09:27:20] that new alert makes me loss heartbeats [09:27:33] maybe we should make it a bit less sensitive [09:27:37] maybe we should have a warning one for short losses xd [09:27:39] and only trigger after, say 15 minutes [09:27:54] (or add the silence to the cookbook) [09:28:23] the cookbook already silences the normal hostdown alert, but the labels are different on the haproxy one :/ [09:29:29] something similar happens with ceph, I remember having to do some shenanigans (I have pending to redo that part as the monitoring code changed xd) [09:30:06] I used to add a `service` label iirc, and silence by that [09:30:21] might not be useful for this one though, might be too wide [09:31:34] dhinus: has the toolsdb replica caught up with the primary already? [09:32:19] taavi: apparently not :/ [09:32:33] still going up [09:32:51] fun [09:33:15] currently at 3.67 days, it's not the first time it gets so high, but this seems quite a bad transaction [09:33:39] I think we could possibly skip it, but I'd wait a couple more days [09:33:57] how long do we keep binlogs for? [09:34:01] 14 days [09:34:24] and worst-case scenario recreating a new replica from scratch now takes about 12 hours [09:35:06] I'm tempted to create a cookbook so you could just launch it, and when the new replica is ready drop the old one [10:32:38] so some openstack apis turn object-like data to a string with a format like this: [10:32:42] "port_details": "admin_state_up='True', device_id='1870a647-224b-4b60-b16f-64714e15cf7b', device_owner='compute:nova', mac_address='fa:16:3e:18:0e:85', name='', network_id='7425e328-560c-4f00-8e99-706f3fb90bb4', status='ACTIVE'", [10:33:01] do we have a parser for that format in wmcs-cookbooks or in python more generally? [10:33:44] :-( [10:34:00] why it can't emit JSON? [10:35:14] no idea [10:38:58] that looks quite scrambled xd [10:40:09] it's not even just `repr(myobj)`, so you can't evaluate it directly [10:40:58] (unless the object itself was already a bit messed up, that `'True'` string might actually be as string in the code, not a bool?) [10:41:22] i mean i can just write a simple parser for it, but that just feels an unnecessary thing to do if i can avoid it [10:41:54] there's some code dealing with cloudnets, though I don't remember if there's a parser there for it [10:51:59] * dcaro lunch [12:30:41] I think I run out of things to test/check/verify before setting kyverno policies to Enforce [12:31:01] btw, I'm just migrating the last toolforge k8s worker node to OVS :P [12:31:10] taavi: nice [12:33:41] yeah great milestone :) [12:38:05] \o/ [12:39:20] arturo: happy to help if you want a second pair of eyes, do you have a list of things you tested? [12:39:48] I don't have a written down list, but I can brain-dump [12:40:29] 1) re-read the upstream docs about what happens if you set policies to enforce while there are offending resources defined in the cluster. This is the case of webservices defined by older versions of the CLI (same for jobs) [12:40:58] 2) performance impact of changing thousand of policies from Audit to Enforce [12:43:00] 3) templates generated by newer jobs/webservices are valid for the new policies [12:43:21] 4) functional tests passes pre/post changes, in lima-kilo [12:43:32] am I missing something obvious? [12:48:52] * dhinus paged: metricsinfra alertmanager down [12:49:11] VM migrating [12:49:21] hm? [12:49:24] Get "https://metricsinfra-alertmanager-rw.wmcloud.org/api/v2/status": context deadline exceeded [12:49:37] proxy-03 migrated to OVS [12:50:15] arturo: do you have a list of the current objects not matching the policies? [12:50:59] arturo: there's keepalived, the migration should cause maybe a split-second outage [12:51:05] that URL loads to me just fine [12:51:13] dcaro: I don't have that list but we could craft it from `kubectl get policyreport -A` [12:51:45] nice 2.31-13+deb11u10 [12:51:52] oops T368394 [12:51:52] T368394: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T368394 [12:53:02] arturo: do you expect the migration to take the host down for more than 5 mins? [12:54:03] "karma_alertmanager_up" went from 1 to 0 at 12:42 [12:54:15] (~12 min ago) [12:54:27] ^that's for me UTC-local timezone xd [12:54:38] dhinus: no :-( [12:54:50] dcaro: sorry, should've used UTC :) [12:55:55] is this an incident? not really affecting users, is it? [12:56:33] I wont be doing the change for T368141 today. I need to have the full work day ahead of me. I'll try tomorrow early in the morning [12:56:33] T368141: toolforge: kyverno: change policies to Enforce - https://phabricator.wikimedia.org/T368141 [12:58:43] dhinus: I'd say it is, though not user-facing per-se (it means we have no alerts, so we have no visibility) [12:59:04] * arturo food [12:59:08] ok I'll be the IC, creating a doc [12:59:08] though proxy-03 might affect also any cloudvps web proxy no? [12:59:34] that URL loads just fine for me, this seems like a karma glitch? [12:59:48] https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive [12:59:51] ^that works yeps [13:00:06] the alert is back to green [13:00:07] the alert is gone also [13:03:00] I guess that in summary I would say though that metcricsinfraAlertmanagerDown should be an incident if not planned yes (I'm ok with it being a #page also) [13:03:21] puppet errors popped up now on tools projects [13:03:52] dcaro: please type # page (with a space) or something instead, people stalk the actual thing on irc [13:04:01] i'll have a look [13:04:31] RhinosF1: sorry [13:04:31] ping? [13:04:43] oh, now, I thought I lost connection to irc again [13:05:12] dcaro: agreed, pages should be incidents. I am resolving the incident without creating a doc for this time, as it was really short [13:05:32] we can use T368394 for tracking what happened [13:05:33] T368394: MetricsinfraAlertmanagerDown - https://phabricator.wikimedia.org/T368394 [13:06:53] looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/1049446 caused a bunch of 'Systemd restart for sssd-nss failed!' failures [13:07:12] so the alert itself will clear with the next puppet run, although it'd be interesting to figure out why the restarts failed [13:07:38] RhinosF1: would you find it useful if we used # page in this channel when there's an actual page (like 20 mins ago)? or would it be noise? I generally use "/me paged" (copying from what I saw other people use) [13:07:47] taavi: thanks [13:08:10] dhinus: if there's an actual incident then actual alerts or humans wanting attention is fine [13:08:18] dhinus: the alertmanager-irc-relay message on -feed already includes the magic string [13:08:45] taavi: right, so probably that one is enough [13:09:21] if one page has gone off already, more notifications don't really matter. I know a fair few members of SRE stalk the # page and if you don't have klaxon it nearly always attention business hours [13:10:13] RhinosF1: gotcha [13:10:58] btw, VictorOps alerts always take a bit longer, but they do autoresolve. it just went back to green. [13:19:57] @dcaro re: the package version script, as a nice-to-have feature, I'm thinking it would be useful to show the versions in different colors depending on if they are same or different than in toolforge-deploy. wdyt? [13:22:04] blancadesal: I thought the same :) +1 from me (I was also thinking that eventually would be useful also in production, so have an eye on making it re-usable there and such, not yet very relevant though) [13:26:20] dcaro: 👍 I'll put it on my todo list then, but will probably not get to it immediately [13:27:51] ǹp [14:19:37] "stray dns records" are increasing (from 12 to 20), is it expected? [14:21:10] maybe it's caused by the VM migrations? [14:21:17] sounds weird though [14:25:17] dhinus: that's likely just people deleting old buster VMs [14:25:38] The numbers at https://os-deprecation.toolforge.org/ went down a bit [14:34:16] fyi we have a nev envvars-api dashboard (copied the builds-api one that you made dhinus) https://grafana-rw.wmcloud.org/d/8H1LfdwSz/envvars-api?orgId=1&from=now-7d&to=now [14:56:23] dcaro: nice! it's missing the tags (that will make it appear in the dropdowns from other dashboards) [14:56:52] 🤦‍♂️ they were in the json :/, maybe it can't be set from the raw json [14:57:05] arturo, dcaro: thanks for crawling into that deep rabbit hole to fix `webservice shell`. Computers were obviously a mistake. ;) [14:57:40] bd808: I hope you enjoyed the explanation of the mystery :-P [15:27:44] arturo: if I do `kubectl get event -A --watch` on a toolforge bastion I see some 'PolicyViolation' events popping up, is that something to be worried about? [15:28:06] taavi: will review [15:28:16] I don't know [15:32:15] I see ~6,5k objects failing to validate (I think): [15:32:17] https://www.irccloud.com/pastebin/Rg93KqPc/ [15:32:51] or maybe validation errors in the sense of "rules that don't pass", more than objects that don't pass [15:33:29] can you find any that is not related to the `fsGroup` field? [15:33:50] there maybe a typo somewhere [15:33:52] maybe, that needs to dump the whole report right? (describe maybe?) [15:33:55] I could not find any running this [15:33:56] aborrero@tools-k8s-control-7:~$ sudo -i kubectl get event -A | grep PolicyViolation | grep -v fsGroup [15:35:25] well, is one of the few attributes not marked as optional in the kyverno policy: [15:35:26] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/blob/7412b2b09f890e20e0b188770ccf799a8d3587a8/maintain_kubeusers/resources/kyverno_pod_policy.yaml.tpl#L77 [15:36:27] let me change that [15:36:59] it seems all of them are that yes [15:37:02] https://www.irccloud.com/pastebin/EQqxOLMY/ [15:38:44] what happens if you don't set that field? [15:39:36] If unset, the Kubelet will not modify the ownership and permissions of any volume. [15:39:43] A special supplemental group that applies to all containers in a pod. Some volume types allow the Kubelet to change the ownership of that volume to be owned by the pod: 1. The owning GID will be the FSGroup 2. The setgid bit is set (new files created in the volume will be owned by FSGroup) 3. The permission bits are OR'd with rw-rw---- If unset, the Kubelet will not modify the ownership and permissions of any volume. [15:40:20] I don't know what all that means, but the 'if unset' case seems safe to me [15:40:26] so that seems irrelevant with our hostPath mounts? [15:41:05] we mainly care that the tool pod runs as the tool user/group id, but I think the runAsUser/runAsGroup fields set that already [15:42:54] agreed [15:43:00] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/50 [15:44:25] I'll deploy this change, then review tomorrow for any additional policyreports [15:47:23] could any of you please +1 that change? [15:47:35] it might even be bad for us if it's able to change the ownership of mounts to the tool ones xd [15:48:03] taavi: examples of VMs I can't access are tf-bastion.testlabs.codfw1dev.wikimedia.cloud, cloudinfra-db.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud -- that's unrelated to ovs things, right? [15:48:14] no idea [15:48:18] have you tried rebooting them? [15:48:50] andrewbogott: not even with serial console? [15:49:20] I haven't investigated yet, just checking to make sure things aren't in a known inbetween state. [15:50:04] ack [15:51:03] dcaro: thanks for the +1 [15:51:07] I'm deploying now [15:53:37] * arturo is enjoying recently discovered terminator layouts [15:57:13] I moved to byobu/tmux so I could re-join the layouts when sshing on my laptop xd [15:57:57] but I used terminator for a long time as my terminal [15:58:04] quite useful :) [15:59:12] ok, new policy being deployed by maintain-kubeusers with the fsGroup change [15:59:23] tomorrow will check again, and set to enforce [15:59:33] does kyverno emit any prometheus metrics? [15:59:45] yes [16:00:18] https://kyverno.io/docs/monitoring/ [16:00:22] but is not enabled at the moment [16:00:34] we probably want to get those to tools-prometheus [16:00:48] sure [16:02:42] ^ yep, both being up and stats of failures and such would be really helpful when tracking changes :) [16:03:42] agreed [16:03:52] * arturo offline [16:14:38] taavi: fyi, my (ambitious) plan for today is to make cloudvirt200[56]-dev into ovs nodes, and drain and decom cloudvirt200[12]-dev. Do you think it's useful to leave some linuxbridge things running in codfw1dev for now? [16:19:11] nah, we can go fully ovs there I think [16:20:21] ok [16:24:58] taavi: also, when you have a minute please note which of these need to be saved vs. deleted: https://etherpad.wikimedia.org/p/taavi's_orphans (or delete them yourself if you're feeling ambitious) [16:26:04] andrewbogott: feel free to delete it all [16:26:35] Ok! [17:05:19] this is new `Project redirects instance redirects-nginx02 is down` is that something you are working on? [17:06:56] not me [17:21:57] just rebooted it, but I'm not sure what's it supposed to do (I'm not a member of the project) [17:23:38] oh, it redirects a bunch of *.wmflabs.org to the right things (like horizon for example) [17:24:41] it's working as expected (I think), we might want to monitor and upgrade the VMs there at some point [17:24:45] * dcaro off [17:33:47] taavi: any idea where I would find the keyholder passphrase for enc-1.cloudinfra-codfw1dev.codfw1dev.wikimedia.cloud ? [17:37:59] dcaro: mutante started the os upgrade in the redirects project last week, but hit some snag. I haven't gotten around to looking at it yet, but it is on my list of TODOs for the week. [17:38:35] this is yet another project that could be made "official" by handing off to WMCS if y'all are somehow looking for more things to do. ;) [19:17:59] andrewbogott: most likely in a plain text file in the "private" git repo. if not, it's probably somewehere in my password manager and I should move it to the private repo or to pwstore [20:55:57] If it's in the private repo I don't see it (or don't know where to look) [21:07:45] found it :)