[08:33:00] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11828470 (10DPogorzelski-WMF) the issue can be reproduced locally with a simple kserve "hello world" [11:15:48] (03CR) 10Kevin Bazira: [C:03+1] "Thank you for working on this. I've tested it locally and it reduces the logs: https://phabricator.wikimedia.org/P90912" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1270939 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [11:18:18] (03CR) 10Ozge: [C:03+1] python/logging_utils: add configurable framework logger level overrides [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1270939 (https://phabricator.wikimedia.org/T416384) (owner: 10AikoChou) [12:45:57] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829192 (10DPogorzelski-WMF) the issue seems to be solved locally by simply appending the securityContext to the container, but the same doesn't seem to work on... [13:02:18] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829398 (10elukey) Is there a diff in this call between local and staging? ` kubectl get mutatingwebhookconfiguration -n kserve -o json | \ jq -r '.items[] |... [13:22:33] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829523 (10DPogorzelski-WMF) adding seccompProfile: type: RuntimeDefault to the chart values and handling it in the configmap patch seems to sol... [13:29:49] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:29:49] Deployment gpt-oss-safeguard-20b-predictor-00001-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:29:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00001-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:34:49] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [13:34:49] Deployment gpt-oss-safeguard-20b-predictor-00001-deployment in experimental at eqiad has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - ... [13:34:49] https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=eqiad&var-cluster=k8s-mlserve&var-namespace=experimental&var-deployment=gpt-oss-safeguard-20b-predictor-00001-deployment - https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:47:42] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829657 (10DPogorzelski-WMF) nvm, i had a typo, it doesn't actually solve anything. i'll keep looking [13:50:38] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11829670 (10DPogorzelski-WMF) >>! In T423149#11829398, @elukey wrote: > Is there a diff in this call between local and staging? > > ` > kubectl get mutatingwebh... [15:21:07] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11830310 (10DPogorzelski-WMF) ok it works, this missing bit was workloadType: initContainer in ` apiVersion: serving.kserve.io/v1alpha1 kind: ClusterStorageC... [15:23:52] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11830337 (10elukey) Very weird, I recall that we removed the workloadType because it wasn't in the CRD spec, I am very confused. [15:35:31] 10Lift-Wing, 06SRE, 06Machine-Learning-Team (Q4 FY2025-26): Fix securityContext propagation in liftwing - https://phabricator.wikimedia.org/T423149#11830431 (10DPogorzelski-WMF) i think the difference lies in the fact that without initContainer field the ClusterStorageContainer is not used at all to construc... [16:35:42] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11830736 (10Clement_Goubert) Routes merged into the rest-gateway, initial tests look good: `lang=shell $ curl -H 'Host: api.wikimedia.org' h... [16:36:00] 10Lift-Wing, 06Machine-Learning-Team, 06ServiceOps new, 10ServiceOps-SharedInfra, and 5 others: Reroute LiftWing endpoints - https://phabricator.wikimedia.org/T422804#11830737 (10Clement_Goubert) [16:37:34] I have a question for when we start moving liftwing traffic over from the api-gateway to the rest-gateway. Do you want us to do a progressive rollout for each model, or do a 100% flip of each model? [16:39:00] Here's the granularity we're going for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1271804 [16:40:47] How careful do you want us to be, basically :D [17:28:25] o/ claime currently most of our team members are offline so we'll get back to you tomorrow about this [17:29:51] just to clarify sth: all the traffic from api.wikimedia.org will be routed through the rest gateway instead of the api gateway correct? that means that ppl will still be able to make requests like the do right now