[05:56:15] (03PS1) 10Kevin Bazira: article-country: update naming for prediction classification change stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115160 (https://phabricator.wikimedia.org/T382295) [06:01:52] (03CR) 10Kevin Bazira: [C:03+2] article-country: send prediction results to weighted tags stream (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1114600 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [07:56:21] good morning! [08:15:02] good morning folks [09:07:02] Bonan matenon! [09:15:12] hey folks! I you are ok I'd proceed with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1115322 in staging [09:28:43] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115160 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:39:07] (03PS1) 10Ilias Sarantopoulos: docs: correct apple silicon instructions for hf image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115326 [09:39:57] deployed thanks :) [09:40:11] I am going to kill some pods here and there in staging to verify that nothing complains [09:41:16] (03CR) 10Kevin Bazira: [C:03+2] "thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115160 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:43:31] ack, thank you Luca [09:44:37] (03Merged) 10jenkins-bot: article-country: update naming for prediction classification change stream [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115160 (https://phabricator.wikimedia.org/T382295) (owner: 10Kevin Bazira) [09:45:55] ah! Found an issue [09:45:56] Error creating: pods "enwiki-damaging-predictor-default-00028-deployment-7d76447ztcxk" is forbidden: violates PodSecurity "restricted:latest": seccompProfile (pod or containers "istio-validation", "storage-initializer", "kserve-container", "queue-proxy", "istio-proxy" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") [09:55:07] ok I thought this was fixed but apparently not [09:55:12] need to dig a bit into it [09:55:21] all containers are missing seccomp [10:17:35] I am reverting, there is more work to be done sigh [10:17:41] I'll add some thoughts to the task [10:47:39] ok so this is better [10:47:43] Invalid value: "The edited file failed validation": [ValidationError(InferenceService.spec.predictor.securityContext): unknown field "allowPrivilegeEscalation" in io.kserve.serving.v1beta1.InferenceService.spec.predictor.securityContext, ValidationError(InferenceService.spec.predictor.securityContext): unknown field "capabilities" in [10:47:48] io.kserve.serving.v1beta1.InferenceService.spec.predictor.securityContext [10:48:14] https://github.com/kserve/kserve/blob/release-0.11/pkg/apis/serving/v1beta1/podspec.go#L148 [10:51:10] I am not getting where the spec is defined [10:53:38] In theory kserve now should support a full securityContext spec for pods, but maybe we need to upgrade to a newer version [11:01:38] ufff yes I found it [11:01:41] not supported [11:02:28] so 0.11 doesn't support those two field [11:02:32] lemme see 0.12 [11:06:21] ouch [11:06:49] nope, checking 0.13 [11:07:32] we can go all the way up to 0.14.1 [11:09:56] even 0.14 seems not supporting it [11:10:19] isaranto: basically I am checking https://github.com/kserve/kserve/releases/download/v0.11.0/kserve.yaml [11:10:26] that is what we add into the kserve chart [11:10:29] at least IIRC [11:10:51] and then I checked under "predictor" -> securityContext [11:11:04] that should specify the bits that we need at the pod level [11:11:47] ah but https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.25/#podsecuritycontext-v1-core [11:12:00] so we're looking for these fields right? [11:12:00] InferenceService.spec.predictor.securityContext.allowPrivilegeEscalation and [11:12:00] InferenceService.spec.predictor.securityContext.capabilities [11:13:06] I think that those cannot be set at pod level [11:13:11] at least from the link above [11:14:14] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10507902 (10elukey) Tried to enforce the restricted PSS, this is the result of killing a revscoring damaging pod in staging: ` Error creati... [11:14:47] our original issue was only with seccompProfile though [11:14:57] so maybe I just need to set that at pod level [11:19:50] lemme do a quick test [11:26:14] fails to reconcile predictor: fails to update knative service: admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.securityContext, spec.template.spec.securityContext.seccompProfile [11:38:38] ah i may be knative, one level deeper [11:46:05] we may have an issue sigh [11:46:06] https://github.com/knative/serving/blob/release-1.7/config/core/configmaps/features.yaml#L103 [11:46:45] https://github.com/knative/serving/commit/e82287df024cac9346869ec349ac181b8960b202 [11:46:52] from knative 1.8.0 it is allowed [11:47:00] and we have 1.7.x [11:47:33] no bueno [11:51:12] 06Machine-Learning-Team, 07Kubernetes, 13Patch-For-Review: Migrate ml-staging/ml-serve clusters off of Pod Security Policies - https://phabricator.wikimedia.org/T369493#10508041 (10elukey) I tried to set only the seccompProfile settings only, but I got a validation error from the kserve webhook: ` fails to... [11:52:31] updated the task and restored the ml-staging-state [11:52:45] some more non-trivial work is needed :( [12:05:35] :( thanks for checking all this [12:57:33] dang, if I'd gotten more progress on knative updates we'd be ahead of this :-/ [13:01:17] (03CR) 10Gkyziridis: [C:03+1] "Thnx for fixing it." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115326 (owner: 10Ilias Sarantopoulos) [13:02:07] (03PS2) 10Ilias Sarantopoulos: docs: correct apple silicon instructions for hf image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115326 [13:21:58] (03CR) 10Ilias Sarantopoulos: [C:03+2] docs: correct apple silicon instructions for hf image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115326 (owner: 10Ilias Sarantopoulos) [13:22:43] (03Merged) 10jenkins-bot: docs: correct apple silicon instructions for hf image [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1115326 (owner: 10Ilias Sarantopoulos) [13:50:46] double checked with https://github.com/knative/serving/releases/tag/knative-v1.8.0 [13:50:53] Services may now set seccompProfile in SecurityContext to allow users to comply with the restricted Pod Security Standards best-practice (#13401, @evankanderson) [13:51:16] and the patch was https://github.com/knative/serving/pull/13401/files [13:51:52] so in theory we could try to backport this to our knative production images [13:59:00] I'll see if our version would build with that patch applied [13:59:38] already trying [13:59:53] ah [14:00:10] hopefully it applies cleanly [14:00:42] even if not, it _probably_ wouldn't need much massaging. the patch at https://github.com/knative/serving/pull/13401/files looks fairly simple [14:08:24] all right it seems that it builds [14:08:32] I am rebuilding all the images etc.. [14:08:38] and I'll send a code patch in a few [14:09:43] 06Machine-Learning-Team, 10Observability-Metrics, 10SRE Observability (FY2024/2025-Q3): Gap in metrics rendered from Thanos Rules - https://phabricator.wikimedia.org/T352756#10508363 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm tentatively resolving this task since we've tracked down and fix... [14:12:53] I also did a quick local checkout & patch to see if the tests pass, and they do [14:18:12] https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/1115394 [14:18:25] Looking [14:20:03] +1'd with one minor nit (typo) [14:22:11] thanks fixed :) [14:22:16] ok to merge/build/deploy? [14:22:19] in staging of course [14:22:38] Sure! [14:23:34] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173 (10isarantopoulos) 03NEW [14:24:19] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10508467 (10isarantopoulos) [14:38:42] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10508609 (10isarantopoulos) [14:38:57] 10Lift-Wing, 06Machine-Learning-Team: Use rocm/vllm image on Lift Wing - https://phabricator.wikimedia.org/T385173#10508613 (10isarantopoulos) [15:34:41] very good news that https://phabricator.wikimedia.org/T352756 was fixed! [15:34:49] so no more gaps in the metrics [15:35:06] I think we can restart adding SLOs, probably pyrra is now the best target [15:35:34] hooray! [15:36:07] and agreed re: Pyrra [15:37:36] 06Machine-Learning-Team, 10observability: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#10509091 (10elukey) Status: * The issue with GAP in metrics was fixed!! \o/ Next steps: * Given what is written in the task's description, how should ML go forward with Pyrra? (... [15:54:26] \o/ [17:18:14] Going afk folks,have a nice evening/rest of day [17:24:18] o/ [17:24:20] me too in a bit [17:24:30] I haven't deployed knative, I'll do it on Monday [17:24:34] tomorrow I am off :) [17:48:32] enjoy! [18:18:23] 06Machine-Learning-Team, 06Data-Engineering, 10Event-Platform: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page - https://phabricator.wikimedia.org/T331399#10509906 (10Ladsgroup) In my volunteer capacity, I would love to have a stream of external links added (e.g. l...