[06:51:41] <elukey>	 good morning :)
[06:52:00] <elukey>	 so to restart from last week's istio deployment, this is what get events returns to me
[06:52:03] <elukey>	 Error creating: pods "istiod-7c49cdc6bd-" is forbidden: unable to validate against any pod security policy: []
[06:52:11] <elukey>	 that explains why istio pods are not up :)
[07:02:09] <joe>	 elukey: I guess you defined no PSP
[07:04:12] <jayme>	 hm..I thought you had PSPs disabled in ml cluster in first place
[07:11:40] <elukey>	 hello :)
[07:12:28] <elukey>	 I think that psp and stuff like GlobalNetworkPolicies were left TBD
[07:17:29] <elukey>	 I am reading https://istio.io/v1.9/docs/ops/deployment/requirements/, but is seems that by default (without the CNI plugin) istio runs an init-container that requires NET_ADMIN to operate
[07:24:26] <elukey>	 I am going to try to see if I can come up with some psp settings for ml-serve
[07:24:55] <elukey>	 knative may need the something similar as well
[07:25:14] <elukey>	 (and kfserving, that create pods when needed etc..)
[07:38:28] <jayme>	 elukey: if you want to test your setup first. You may disable PSPs via hiera for your cluster 
[07:42:06] <elukey>	 jayme: ah this would be nice indeed, I guess that I'd need to change puppet + restart the kubemasters right?
[07:42:50] <jayme>	 elukey: Relevant services should auto-restart on puppet run
[07:42:57] <elukey>	 nice
[07:42:58] <jayme>	 if that's not the case, let me know :)
[07:45:12] <elukey>	 I'd be a little worried though about how to test the psp afterwards (I guess probably removing the whole stack and re-deploying would be the best test)
[07:45:51] <elukey>	 I'd like to avoid to discover on a sunday morning in say 2 months that I need an obscure psp rule :D
[07:46:10] <elukey>	 (because a pod for some reason dies and restart)
[07:46:52] <jayme>	 elukey: pods that are running will not be affected when you enable PSP again, that's correct. But if you kubectl delete, they will error out again if you miss PSPs
[07:47:34] <jayme>	 if you want to be sure, you should clean out your etcd after you're done testing and re-deploy everything from scratch
[07:48:18] <jayme>	 but: As PSPs are going to go away in future anyways investing much time into that might not be worth it anyways
[07:52:11] <elukey>	 I was reading that yes, from 1.21 right? 
[07:52:43] <elukey>	 Is is due to a simpler replacement or just that psp are too heavy/not-useful in general?
[08:08:43] <jelto>	 PodSecurityPolicy is deprecated as of Kubernetes v1.21, and will be removed in v1.25. Its planned to be replaced by a new admission controller (https://github.com/kubernetes/enhancements/issues/2579). My understanding is that the new admission controller should be more light weight and more simple to use and for more adavanced use cases you have to use external admission controller
[08:15:42] <joe>	 so it's a net loss of functionality for us
[08:15:48] <joe>	 as usual :P
[08:23:26] <jayme>	 yeah...it
[08:23:56] <jayme>	 it's bad and we probably have to settle with yet another complex software to manage that instead 
[08:33:29] <jelto>	 I used Open Policy Agent (OPA) in the past to replace PSPs. But OPA and the policy language (Rego) really is yet another complex software
[08:35:28] <jayme>	 yeah, seems like that evolves to the de-facto standard.
[08:36:06] <jayme>	 we probably want something like that anyways in the future to ensure things like "nothing is running with tag :latest" etc. 
[08:36:43] <jayme>	 but for now, at least in prod clusters, we can stick to the PSP implementation we already have and keep the migration as a future-fun-project :P
[08:44:25] <wikibugs>	 10serviceops, 10GitLab, 10SRE, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Dzahn) @MoritzMuehlenhoff No worries, 100% with you here. The only reason to do it like that was that I didn't have access to look it up and was planning to use it as an example...
[08:56:08] <elukey>	 jayme: what is the plan for this coming fiscal? K8s 1.20?
[08:56:21] <elukey>	 if so I think that we should be really good with psps for a while
[08:56:53] <elukey>	 maybe the ML team will need to jump to latest kubernetes due to a breaking change + regression-bugs in istio/knative/kfserving
[08:57:15] <elukey>	 so 1.20 could be not up to date enough
[08:57:21] * elukey stops the sarcasm here
[08:57:49] <jayme>	 elukey: we've not decided on a version yet thb
[08:58:22] <elukey>	 I am saying it as half joke, istio 1.9.5 (fixes all outstanding CVEs) does not support officially 1.16
[08:58:35] <elukey>	 they say "use it and enjoy the unsupported land"
[08:59:00] <elukey>	 knative 0.18 is the last one for 1.16, no idea if newer versions work
[08:59:10] <elukey>	 and kfserving in theory should be relatively good on 1.16
[08:59:41] <elukey>	 so if you want to experiment 1.2x on a cluster you know which one to choose :D
[09:28:57] <jayme>	 eheh, yeah. I think we do want to go 1.2x this FY but we will have to do the helm2 -> helm3 migration first as helm2 does not support k8s >1.16
[09:34:59] <elukey>	 makes sense yes
[09:35:59] <_joe_>	 and that migration consists of "how to model RBAC for deployments in a way that makes sense for us"
[09:36:14] <_joe_>	 AIUI given I use helm3 without issues for all of our charts on my minikube
[09:43:09] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe)
[09:43:49] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10observability: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) a:05Joe→03None
[09:46:09] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Enable TLS termination on the mwdebug deployment. fix the service definition in the chart - https://phabricator.wikimedia.org/T284421 (10Joe) 05Open→03Resolved Boldly resolving, I think this was done.
[09:46:15] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe)
[09:46:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) a:03jijiki
[09:47:20] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10Joe) @jijiki I think this task is resolved as well, correct?
[09:50:53] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Add all redis and memcached backends to mw on k8s automatically - https://phabricator.wikimedia.org/T284420 (10jijiki) 05Open→03Resolved
[09:50:56] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10jijiki)
[09:52:14] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) First tests with the staging version of the mwdebug deployment, and I get the following non-encouraging timings (in ms, approximated from multiple runs):  | page | k8s staging | mwd...
[09:57:06] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) @joe appservers are running onhost memcached, which can be a factor for this specific test: https://phabricator.wikimedia.org/T263958#6510350
[10:09:46] <wikibugs>	 10serviceops, 10CX-cxserver, 10Wikidata, 10wdwb-tech, 10Language-Team (Language-2021-April-June): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) Usually there is a request...
[10:09:51] <elukey>	 jayme: one question about psp - should I try to add the ones for the istio-system namespace to helmfile_psp.yaml or is there a better scope/placement? (I still haven't got how helmfile_psp.yaml fits into the deployment-charts/helm machinery, namely where/how it gets called)
[10:10:34] <elukey>	 of course calico and core-dns run in kube-system, so I'd need to create something different for the istio-system namespace
[10:29:47] <elukey>	 ahh I see from https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Add_RBAC/PSPs
[10:30:19] <elukey>	 so probably I should find a way to create a pod-security-policy for ml-serve
[10:30:22] <elukey>	 okok
[10:37:56] <elukey>	 old but interesting - https://github.com/istio/istio/issues/6806
[10:38:17] <elukey>	 in theory, since we don't use sidecar injection, it should be easier to design a psp
[10:38:31] <elukey>	 also avoiding stuff like https://github.com/istio/istio/tree/master/cni
[10:38:43] <elukey>	 but for future use cases, things might get complicated a little
[10:39:31] <elukey>	 (bbl)
[12:37:13] <wikibugs>	 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) a:03Jelto
[12:50:29] <wikibugs>	 10serviceops, 10SRE, 10docker-pkg, 10Patch-For-Review: Refresh all images in production-images - https://phabricator.wikimedia.org/T284431 (10Jelto) I merged and rolled out the [additional parameter for intra-service dependencies in systemd::timer::job](https://gerrit.wikimedia.org/r/701525) and [the chang...
[13:01:05] <jayme>	 elukey: yeah, that might get more complicates. As you've probably figured already the helmfile_psp.yaml is included in the helmfile.yaml
[13:02:45] <jayme>	 the thing is: It's currently not seperated by cluster. So you would have to figure out a way to have that seperated 
[13:12:06] <mutante>	 I did the spot check of appserver type distribution  eqiad vs  codfw before the switch later on:  we are going 63 -> 64 machines for app and API and  24 -> 22 machines for jobrunner/videoscaler. only minor differences in the weight/pool status for the dedicated videoscalers/jobrunner pairs we did.  https://phabricator.wikimedia.org/P16732
[13:12:19] <mutante>	 legoktm: ^ just to confirm
[13:28:48] <elukey>	 jayme: lovely
[13:30:44] <_joe_>	 elukey: usually the cluster name is exposed as .Environment.Name in helmfile
[13:31:22] <_joe_>	 and you can change values for each environment anyways
[13:31:53] <_joe_>	 so you could put feature flags in helmfile_psp.yaml and/or include it only depending on the environment; it should be doable
[13:35:28] <legoktm>	 mutante: ack, thanks
[13:36:59] <legoktm>	 _joe_: is it OK if I un-exclude the thanos-* services for the switchover? see https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/701484/1/cookbooks/sre/switchdc/services/__init__.py
[13:41:32] <elukey>	 _joe_: ack thanks will try
[13:42:43] <_joe_>	 legoktm: uhm yes, but I don't really see the point of confusing switching over the o11y layer and the services one
[13:42:54] <_joe_>	 we can do it by hand this time
[13:43:03] <_joe_>	 given we have 20 minutes to go time
[13:43:25] <_joe_>	 btw, who is going to actually do the honors? jayme or legoktm?
[13:44:51] <legoktm>	 jayme hopefully, if he's still up for it :)
[13:44:51] <jayme>	 pressing enter you mean? I though that was was I volunteered for last monday :)
[13:45:02] <jayme>	 *was what
[13:46:31] <legoktm>	 jayme: btw, you should run the cookbooks in a tmux session, `sudo -i tmux new -s switchdc` and then the rest of us can watch with `sudo -i tmux attach -rt switchdc`
[13:46:49] <jayme>	 sure
[13:47:19] <majavah>	 legoktm: fyi https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1916943&oldid=1916942
[13:47:48] <legoktm>	 oops, ty
[13:50:55] <rzl>	 good morning 👋
[13:52:20] <jayme>	 I did setup the switchdc tmux session on cumin1001. Please join with `sudo -i tmux attach -rt switchdc`
[13:52:30] <jayme>	 (if you like :))
[13:52:47] * rzl in
[13:52:52] * _joe_ too
[13:53:13] <jelto>	 in and lurking
[13:53:19] <jayme>	 and please use decent sized terminals :D
[13:53:41] <wkandek>	 in
[13:54:27] <rzl>	 are we coordinating in here, then? we should let the non-serviceopsen know if so :)
[13:54:43] <jayme>	 I think we should
[13:54:52] <legoktm>	 can we move to -operations please?
[13:55:05] <jayme>	 ack
[14:08:39] <wikibugs>	 10serviceops, 10SRE, 10Wikimedia-Incident: June 2021: appservers accumulating active php-fpm workers, requiring rolling restarts to avoid user-visible latency impact - https://phabricator.wikimedia.org/T285634 (10Legoktm) > and also `filerepo_file_foreign_description`  This is something that @ladsgroup and @...
[17:23:42] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10User-jijiki: (Need By: TBD) rack/setup/install mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T274925 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Can you please cable mc1039, 1051-1054.  I could not find mc1039 so please update netbo...
[17:41:21] <wikibugs>	 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm)
[17:56:06] <wikibugs>	 10serviceops, 10RESTBase, 10Datacenter-Switchover: Figure out plan for restbase-async w/r database switchover - https://phabricator.wikimedia.org/T285711 (10Legoktm)
[19:12:57] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) A multi-version image with the portals directory has been published to the...
[19:17:10] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10Legoktm) So...is there a reason the portals need to be deployed with MediaWiki? T...
[19:35:32] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) I posted an update on today's switchover to wikitech-l: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/XI57Z6T...
[20:07:33] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) T238747 for hosting portals as a service hasn't be completed due to not ha...
[21:26:32] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10Joe) >>! In T285325#7182067, @Legoktm wrote: > So...is there a reason the portals...
[21:33:18] <wikibugs>	 10serviceops, 10Release Pipeline, 10Wikimedia-Portals, 10Release-Engineering-Team (Seen): Migrate www.wikipedia.org (and other www portals) to be its own service - https://phabricator.wikimedia.org/T238747 (10Krinkle)
[22:56:25] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) I did a successful run through of the live-test mode just now, where we "switch" from codfw -> eqiad. The only issue I ran into is T285519#718237...