[06:51:41] good morning :) [06:52:00] so to restart from last week's istio deployment, this is what get events returns to me [06:52:03] Error creating: pods "istiod-7c49cdc6bd-" is forbidden: unable to validate against any pod security policy: [] [06:52:11] that explains why istio pods are not up :) [07:02:09] elukey: I guess you defined no PSP [07:04:12] hm..I thought you had PSPs disabled in ml cluster in first place [07:11:40] hello :) [07:12:28] I think that psp and stuff like GlobalNetworkPolicies were left TBD [07:17:29] I am reading https://istio.io/v1.9/docs/ops/deployment/requirements/, but is seems that by default (without the CNI plugin) istio runs an init-container that requires NET_ADMIN to operate [07:24:26] I am going to try to see if I can come up with some psp settings for ml-serve [07:24:55] knative may need the something similar as well [07:25:14] (and kfserving, that create pods when needed etc..) [07:38:28] elukey: if you want to test your setup first. You may disable PSPs via hiera for your cluster [07:42:06] jayme: ah this would be nice indeed, I guess that I'd need to change puppet + restart the kubemasters right? [07:42:50] elukey: Relevant services should auto-restart on puppet run [07:42:57] nice [07:42:58] if that's not the case, let me know :) [07:45:12] I'd be a little worried though about how to test the psp afterwards (I guess probably removing the whole stack and re-deploying would be the best test) [07:45:51] I'd like to avoid to discover on a sunday morning in say 2 months that I need an obscure psp rule :D [07:46:10] (because a pod for some reason dies and restart) [07:46:52] elukey: pods that are running will not be affected when you enable PSP again, that's correct. But if you kubectl delete, they will error out again if you miss PSPs [07:47:34] if you want to be sure, you should clean out your etcd after you're done testing and re-deploy everything from scratch [07:48:18] but: As PSPs are going to go away in future anyways investing much time into that might not be worth it anyways [07:52:11] I was reading that yes, from 1.21 right? [07:52:43] Is is due to a simpler replacement or just that psp are too heavy/not-useful in general? [08:08:43] PodSecurityPolicy is deprecated as of Kubernetes v1.21, and will be removed in v1.25. Its planned to be replaced by a new admission controller (https://github.com/kubernetes/enhancements/issues/2579). My understanding is that the new admission controller should be more light weight and more simple to use and for more adavanced use cases you have to use external admission controller [08:15:42] so it's a net loss of functionality for us [08:15:48] as usual :P [08:23:26] yeah...it [08:23:56] it's bad and we probably have to settle with yet another complex software to manage that instead [08:33:29] I used Open Policy Agent (OPA) in the past to replace PSPs. But OPA and the policy language (Rego) really is yet another complex software [08:35:28] yeah, seems like that evolves to the de-facto standard. [08:36:06] we probably want something like that anyways in the future to ensure things like "nothing is running with tag :latest" etc. [08:36:43] but for now, at least in prod clusters, we can stick to the PSP implementation we already have and keep the migration as a future-fun-project :P [08:44:25] 10serviceops, 10GitLab, 10SRE, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) @MoritzMuehlenhoff No worries, 100% with you here. The only reason to do it like that was that I didn't have access to look it up and was planning to use it as an example... [08:56:08] jayme: what is the plan for this coming fiscal? K8s 1.20? [08:56:21] if so I think that we should be really good with psps for a while [08:56:53] maybe the ML team will need to jump to latest kubernetes due to a breaking change + regression-bugs in istio/knative/kfserving [08:57:15] so 1.20 could be not up to date enough [08:57:21] * elukey stops the sarcasm here [08:57:49] elukey: we've not decided on a version yet thb [08:58:22] I am saying it as half joke, istio 1.9.5 (fixes all outstanding CVEs) does not support officially 1.16 [08:58:35] they say "use it and enjoy the unsupported land" [08:59:00] knative 0.18 is the last one for 1.16, no idea if newer versions work [08:59:10] and kfserving in theory should be relatively good on 1.16 [08:59:41] so if you want to experiment 1.2x on a cluster you know which one to choose :D [09:28:57] eheh, yeah. I think we do want to go 1.2x this FY but we will have to do the helm2 -> helm3 migration first as helm2 does not support k8s >1.16 [09:34:59] makes sense yes [09:35:59] <_joe_> and that migration consists of "how to model RBAC for deployments in a way that makes sense for us" [09:36:14] <_joe_> AIUI given I use helm3 without issues for all of our charts on my minikube [09:43:09] 10serviceops, 10MW-on-K8s, 10SRE: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [09:43:49] 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) a:05Joe→03None [09:46:09] 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Enable TLS termination on the mwdebug deployment. fix the service definition in the chart - https://phabricator.wikimedia.org/T284421 (10Joe) 05Open→03Resolved Boldly resolving, I think this was done. [09:46:15] 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [09:46:58] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) a:03jijiki [09:47:20] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) @jijiki I think this task is resolved as well, correct? [09:50:53] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10jijiki) 05Open→03Resolved [09:50:56] 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki) [09:52:14] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) First tests with the staging version of the mwdebug deployment, and I get the following non-encouraging timings (in ms, approximated from multiple runs): | page | k8s staging | mwd... [09:57:06] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @joe appservers are running onhost memcached, which can be a factor for this specific test: https://phabricator.wikimedia.org/T263958#6510350 [10:09:46] 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-April-June): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) Usually there is a request... [10:09:51] jayme: one question about psp - should I try to add the ones for the istio-system namespace to helmfile_psp.yaml or is there a better scope/placement? (I still haven't got how helmfile_psp.yaml fits into the deployment-charts/helm machinery, namely where/how it gets called) [10:10:34] of course calico and core-dns run in kube-system, so I'd need to create something different for the istio-system namespace [10:29:47] ahh I see from https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Add_RBAC/PSPs [10:30:19] so probably I should find a way to create a pod-security-policy for ml-serve [10:30:22] okok [10:37:56] old but interesting - https://github.com/istio/istio/issues/6806 [10:38:17] in theory, since we don't use sidecar injection, it should be easier to design a psp [10:38:31] also avoiding stuff like https://github.com/istio/istio/tree/master/cni [10:38:43] but for future use cases, things might get complicated a little [10:39:31] (bbl) [12:37:13] 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) a:03Jelto [12:50:29] 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) I merged and rolled out the [additional parameter for intra-service dependencies in systemd::timer::job](https://gerrit.wikimedia.org/r/701525) and [the chang... [13:01:05] elukey: yeah, that might get more complicates. As you've probably figured already the helmfile_psp.yaml is included in the helmfile.yaml [13:02:45] the thing is: It's currently not seperated by cluster. So you would have to figure out a way to have that seperated [13:12:06] I did the spot check of appserver type distribution eqiad vs codfw before the switch later on: we are going 63 -> 64 machines for app and API and 24 -> 22 machines for jobrunner/videoscaler. only minor differences in the weight/pool status for the dedicated videoscalers/jobrunner pairs we did. https://phabricator.wikimedia.org/P16732 [13:12:19] legoktm: ^ just to confirm [13:28:48] jayme: lovely [13:30:44] <_joe_> elukey: usually the cluster name is exposed as .Environment.Name in helmfile [13:31:22] <_joe_> and you can change values for each environment anyways [13:31:53] <_joe_> so you could put feature flags in helmfile_psp.yaml and/or include it only depending on the environment; it should be doable [13:35:28] mutante: ack, thanks [13:36:59] _joe_: is it OK if I un-exclude the thanos-* services for the switchover? see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/701484/1/cookbooks/sre/switchdc/services/__init__.py [13:41:32] _joe_: ack thanks will try [13:42:43] <_joe_> legoktm: uhm yes, but I don't really see the point of confusing switching over the o11y layer and the services one [13:42:54] <_joe_> we can do it by hand this time [13:43:03] <_joe_> given we have 20 minutes to go time [13:43:25] <_joe_> btw, who is going to actually do the honors? jayme or legoktm? [13:44:51] jayme hopefully, if he's still up for it :) [13:44:51] pressing enter you mean? I though that was was I volunteered for last monday :) [13:45:02] *was what [13:46:31] jayme: btw, you should run the cookbooks in a tmux session, `sudo -i tmux new -s switchdc` and then the rest of us can watch with `sudo -i tmux attach -rt switchdc` [13:46:49] sure [13:47:19] legoktm: fyi https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1916943&oldid=1916942 [13:47:48] oops, ty [13:50:55] good morning 👋 [13:52:20] I did setup the switchdc tmux session on cumin1001. Please join with `sudo -i tmux attach -rt switchdc` [13:52:30] (if you like :)) [13:52:47] * rzl in [13:52:52] * _joe_ too [13:53:13] in and lurking [13:53:19] and please use decent sized terminals :D [13:53:41] in [13:54:27] are we coordinating in here, then? we should let the non-serviceopsen know if so :) [13:54:43] I think we should [13:54:52] can we move to -operations please? [13:55:05] ack [14:08:39] 10serviceops, 10SRE, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Legoktm) > and also `filerepo_file_foreign_description` This is something that @ladsgroup and @... [17:23:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you please cable mc1039, 1051-1054. I could not find mc1039 so please update netbo... [17:41:21] 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) [17:56:06] 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Legoktm) [19:12:57] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) A multi-version image with the portals directory has been published to the... [19:17:10] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10Legoktm) So...is there a reason the portals need to be deployed with MediaWiki? T... [19:35:32] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) I posted an update on today's switchover to wikitech-l: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/XI57Z6T... [20:07:33] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) T238747 for hosting portals as a service hasn't be completed due to not ha... [21:26:32] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10Joe) >>! In T285325#7182067, @Legoktm wrote: > So...is there a reason the portals... [21:33:18] 10serviceops, 10Release Pipeline, 10Wikimedia-Portals, 10Release-Engineering-Team (Seen): Migrate www.wikipedia.org (and other www portals) to be its own service - https://phabricator.wikimedia.org/T238747 (10Krinkle) [22:56:25] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) I did a successful run through of the live-test mode just now, where we "switch" from codfw -> eqiad. The only issue I ran into is T285519#718237...