[04:39:29] (03CR) 10Santhosh: Recommend articles to translate based on topic (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [07:03:15] (03PS8) 10Santhosh: major: modernize the codebase, keep only translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) [07:03:48] (03CR) 10Santhosh: "Adding more params to the search generator has the disadvantage of query continue and requires more and more api calling. However, we need" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [07:07:20] (03PS9) 10Santhosh: major: modernize the codebase, keep only translation recommendations [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) [07:07:25] (03CR) 10Santhosh: major: modernize the codebase, keep only translation recommendations (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [07:29:56] Guten Morgen o/ [07:34:32] (03PS1) 10Kevin Bazira: readability_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053608 (https://phabricator.wikimedia.org/T369344) [07:53:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053608 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:01:48] (03CR) 10Kevin Bazira: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053608 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:13:31] (03CR) 10Kevin Bazira: [C:03+2] "Thanks for the review :)" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053608 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [08:14:13] (03Merged) 10jenkins-bot: readability_model: migrate to src dir [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1053608 (https://phabricator.wikimedia.org/T369344) (owner: 10Kevin Bazira) [09:03:07] I haven't merged the hf patch, will wait a bit until later if someone else wants to take a look and then I'll merge it [09:06:35] (03CR) 10AikoChou: [C:03+1] "LGTM!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1052845 (https://phabricator.wikimedia.org/T369359) (owner: 10Ilias Sarantopoulos) [09:11:46] (03PS10) 10Ilias Sarantopoulos: huggingface: simplify requirements [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1052845 (https://phabricator.wikimedia.org/T369359) [09:13:57] Morning! [09:15:09] I'll be rolling out the updates to make kserve update to modern network policies to codfw in a bit. There shouldn't be any disruption, so let me know if you see any. [09:18:05] This will also roll out the drop of the example policies that Luca merged the patch for; as well as the update to knative-serving-(controller|autoscaler|activator):1.7.2-2 [09:27:11] morning o/ [09:27:43] ack, I'll be using ml-staging to deploy gemma2 with the latest changes [09:28:00] Roger, no plans for breaking that today ;) [09:28:03] (03CR) 10Ilias Sarantopoulos: [C:03+2] "Thanks for the review!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1052845 (https://phabricator.wikimedia.org/T369359) (owner: 10Ilias Sarantopoulos) [09:29:00] (03Merged) 10jenkins-bot: huggingface: simplify requirements [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1052845 (https://phabricator.wikimedia.org/T369359) (owner: 10Ilias Sarantopoulos) [09:38:15] And the update failes with an internal error :-/ [09:39:33] https://phabricator.wikimedia.org/P66282 for those who want to follow along [09:44:51] :( [09:45:30] Looks like the new policy is missing a few bits [09:47:04] Let's see if I can at least push the examples drop and upgrade changes to see if it's really the NP that is breaking stuff [09:55:44] I think this may be due to the ports list changing from 9443, 443, 6443 to 9443, but I am not sure yet how to test that hypothesis [10:25:37] klausman: you can do two things 1) verify via nsenter that the ports are indeed blocked [10:25:44] 2) kubectl edit etc.. [10:26:22] well, the change never deploys/is rolled back immediately, so I never get the chance to use nsenter [10:27:12] and kubectl edit of a whole policy seems a little scary on a prod cluster [10:43:16] I need to eat something so I can think better. bbiab [10:43:42] not saying to do it in prod, but on ml-staging [10:43:52] or are you seeing this only in prod? [10:47:06] ah yes I see the backscroll, it is prod [10:47:18] yeah, staging works fine [10:47:35] then it shouldn't be an issue of the network policy, in theory [10:48:01] ok I mixed up things, this is knative [10:48:33] Thing is that the deletion of examples and the update of the controller|autoscaler|activator are in the update for prod as well. I tried to deploy them one by one, but checking out older versions of the repo didn't work (the np is still in there) [10:49:01] you can use -l name=kserve to isolate the kserve one [10:49:05] * isaranto afk - lunch [10:49:18] in admin_ng I mean [10:49:39] ah, I should've thought of that [10:50:30] for the knative-serving, I can check later on, but what I'd do is: try to deploy, and then in the time that it passes before the rollback, check the webhook logs [10:50:33] (in the new pods) [10:50:45] there is probably something that errors out [10:50:50] Yeah, I will try that [10:50:52] okok [10:51:06] going to lunch, ping me if needed and I'll check later :) [10:54:14] ack! buon appetito :) [11:47:12] oh well the hf/gemma deployment is failing :( , I'm gonna try to fix it by working on an easy way to test locally on m1 https://phabricator.wikimedia.org/P66283 [11:47:27] basically installing the cpu version of torch on top of the base image [12:01:28] isaranto: ahh that's the same issue I got yesterday [12:02:13] yes.. sorry I should have tested more thoroughly . perhaps it is due to the accelerate version bump [12:02:49] aiko: you weren't getting the error though right? iirc you were getting everything up to the warnings [12:02:56] although perhaps it was hanging [12:49:17] elukey: when you mentioned -l, what command did you mean? [12:50:45] ah, I had it to the right of apply, it must be on the left [12:51:27] a-ha! so the NP change applied cleanly [12:51:51] exactly yes :) [12:52:24] SO it's probably the deletion of examples that is the actual failure. Talk about barking up the wrong tree :) [13:07:44] it is yes, in you paste knative-native was mentioned [13:08:25] if you try to deploy the error should be clear in the new webhook pods [13:08:37] one thing that we may need to do is remove the examples manually from the config-maps [13:08:44] Currently running that with kubetail |tee logfile [13:19:45] grmbl, getting logs of deleted containers is not possible? [13:20:09] I see pod/webhook-75dcb9bfff-2skh6 was created and deleted in the events log, but trying to get its logs fails with "pod not found" [13:21:47] you need to get its logs while it bootstraps [13:21:56] so before the deployment fails [13:22:34] If I could predict its name that would be easier :-S [13:22:55] it is just a matter of kubectl get pods -n knative-serving though [13:23:44] there is also something in https://logstash.wikimedia.org/goto/3e3350814f1e0d93b31c9a6ca883c9a8 but not that easy to find logs [13:25:53] klausman: if you want to start the deployment I can check with you [13:26:01] isaranto: right! up to the warnings [13:27:37] I just did another apply and caught the logs of one activator (it's still temrinating). They're in ~klausman/deploy.log [13:28:48] I think the culprit is in the webhook logs though [13:29:38] I can give that another try, activator was just the first one I saw [13:30:45] (deploying now) [13:33:43] I got the logs of webhook-5c55f8bf9-zzjs9 (in ~klausman/deploy-webhook.log), but I don't see anything useful [13:38:39] I am seeing quite a bit of `Failed to send ping message to ws://autoscaler.knative-serving.svc.cluster.local:8080` [13:39:18] which is a port the NetworkPolicy change deleted [13:39:30] I had thought it was only for Prom [13:39:52] (plus, that is kserve, not knative-serving) [13:40:10] kubectl diff does not mention 8080 being added or deleted anywhere [13:45:27] I've moved away from only looking at webhook pods and instead looked at everything in knative-serving that isn't an INFO level log msg: https://logstash.wikimedia.org/goto/76f0f475f8d7129ba08d3ff0c0a766ef [13:49:02] Also suspicious: E0711 13:48:10.867913 1 leaderelection.go:367] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again [13:57:31] The very first error I see in logstash is Websocket connection could not be established [14:00:33] mmm weird [14:00:39] but helmfile returns https://phabricator.wikimedia.org/P66282 right? [14:06:32] yes [14:07:41] That paste is slightly reformatted (\n and indent) for easier reading, but otherwise exactly what helfmfile apply errors out with [14:09:22] I have rummaged some more and it's not quite clear which actual component fails. The autoscaler in the "before state" shows the above "try the latest version" error, after the update it tries to get a leader election going, but then fails with a Go error that indicates the code is trying to use an already-closed connection. [14:24:40] (03CR) 10Isaac Johnson: Recommend articles to translate based on topic (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052950 (https://phabricator.wikimedia.org/T367873) (owner: 10Santhosh) [14:37:40] klausman: sorry meetings - IIUC what's happening is that the new configs are validated by the istio webhook (via k8s API) and somehow they fail, returning an error [14:38:12] I think that it may be what I've seen in staging, namely that the example config checksum is validated and it fails [14:38:19] now I am not sure if I can find proof of that [14:41:42] What did you do to make them pass in staging? [14:44:54] I removed manually one example config that triggered the issue [14:44:58] but the error was more clear [14:45:10] we can try to do it, nothing really problematic [14:45:33] maybe the error here is more murky because there is some actual tyraffic in the cluster, muddying the whole autoscaling thing [14:45:46] I think we should give the delete a go, see if it changes things. [14:47:11] maybe we could try with config-autoscaler, the first one in https://phabricator.wikimedia.org/P66282 - if it disappears from the list of logs, then we should have a confirmation [14:47:27] (after another deploy attempt I mean) [14:47:43] So what is the object type you deleted? [14:50:36] 06Machine-Learning-Team: Simplify dependencies in hf image - https://phabricator.wikimedia.org/T369359#9973818 (10isarantopoulos) Tested the updated image in ml-staging using the GPU and got the following error: ` 2024-07-11 11:00:24.531 1 kserve INFO [storage.py:download():66] Copying contents of /mnt/models to... [14:51:05] I kubectl edited config-something (don't recall which one) removing the _example data [14:51:11] basically like the change is doing [14:51:17] ah, alright [14:51:39] configmaps? or configurations.serving.knative.dev? [14:53:38] ah, knative-serving (NS) doesn't have any objects of the latter type, I'll edit the confimap [14:54:03] yep [14:54:46] did you delete the whole data: stanza, or only the _example subkey? [14:56:52] the latter [14:58:59] ok, edited, trying a deploy [15:00:54] A new and exciting error: Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: cannot patch "config-autoscaler" with kind ConfigMap: admission webhook "config.webhook.serving.knative.dev" denied the request: validation failed: the update modifies a key in "_example" which is probably not what you want. Instead, copy the respective setting to the [15:00:56] top-level of the ConfigMap, directly below "data" && cannot patch "config-istio" with kind ConfigMap: Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": failed to call webhook: Post "https://net-istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": context deadline exceeded: cannot patch "config-defaults" with kind [15:00:58] ConfigMap: admission webhook "config.webhook.serving.knative.dev" denied the request: validation failed: the update modifies a key in "_example" which is probably not what you want. Instead, copy the respective setting to the top-level of the ConfigMap, directly below "data" [15:02:04] trying a deploy after also deleteing the whole data: subtree [15:03:31] in theory data should stay there but be empty [15:04:01] also it is weird that you modified one config and more than one complain about it [15:04:30] well, now it failed again, but said it couldn't roll back, and now there is no diff :-/ [15:04:59] Error: UPGRADE FAILED: release knative-serving failed, and has been rolled back due to atomic being set: cannot patch "config-autoscaler" with kind ConfigMap: admission webhook "config.webhook.serving.knative.dev" denied the request: validation failed: the update modifies a key in "_example" which is probably not what you want. Instead, copy the respective setting to the top-level of the [15:05:01] ConfigMap, directly below "data" && cannot patch "config-istio" with kind ConfigMap: Internal error occurred: failed calling webhook "config.webhook.istio.networking.internal.knative.dev": failed to call webhook: Post "https://net-istio-webhook.knative-serving.svc:443/config-validation?timeout=10s": context deadline exceeded [15:06:20] hangon, I misread the rollback not having happened, but the statement about there not being a diff is still true [15:07:24] so all the pods are now being created [15:07:28] The new pods are still running (and have been for 5m, no restarts, taking a look at the logs [15:07:56] Image: docker-registry.discovery.wmnet/knative-serving-activator:1.7.2-2 [15:08:21] and they have the new img in theory [15:09:19] we should start to seriously think about upgrading knative [15:09:27] maybe we can couple it with the new k8s version [15:09:38] yeah, agreed [15:09:39] there are a ton of weird bugs corrected over the years [15:09:45] our version is a little brittle [15:10:13] Not seeing any errors in any pods in the knative-serving NS [15:10:14] ok so in theory now it should be good, one thing that I'd like to try is to deploy a config change to staging and see if the webhook issues some alert [15:10:29] in theory it shouldn't [15:10:34] in practice, let's double check [15:10:51] this may be only a pain point to get rid of _example, as one off [15:11:29] What kind of config change did you have in mind? [15:11:47] anything, just to add a setting to a config-map basically [15:11:55] I want to make sure that the _example trap is forgotten [15:12:27] I need to step afk for an errand, we can check later / tomorrow in case [15:12:44] ack, I'll try something in staging, see whate xplodes :) [15:14:03] I edited data/container-concurrency-target-percentage from 84 to 85 [15:14:12] er 85->84 [15:14:20] No errors [15:16:00] only log messages are of the lease edit kind we have seen before [15:54:29] I did some digging on the hf/gemma deployment issues (pasting because wikibugs isn't working ) https://phabricator.wikimedia.org/T369359#9974140 [16:12:33] Logging off folks, cu tomorrow o/ [16:18:58] \o [16:36:05] (03PS1) 10Bartosz Dziewoński: Use stable andExpr() / orExpr() methods [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1053736 [16:38:22] (03PS2) 10Bartosz Dziewoński: Use stable andExpr() / orExpr() methods [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1053736 [16:48:49] (03CR) 10CI reject: [V:04-1] Use stable andExpr() / orExpr() methods [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1053736 (owner: 10Bartosz Dziewoński) [16:57:31] ack for the concurrency change! I'll file tomorrow a change for another configmap to test the helmfile road as well, just-in-case-tm :D [16:57:52] but it looks good, we'll have to put a little more pain for eqiad but after that we should be done/good [17:01:38] (03PS3) 10Bartosz Dziewoński: Use stable andExpr() / orExpr() methods [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1053736 [18:53:25] (03CR) 10Kevin Bazira: "Great job Santhosh! I tested the fastapi based rec-api locally and most of the functionality you kept works just like the flask rec-api. I" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/1052445 (https://phabricator.wikimedia.org/T369484) (owner: 10Santhosh) [19:31:44] FIRING: LiftWingServiceErrorRate: ... [19:31:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=arwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [19:36:45] (03CR) 10Umherirrender: [C:03+2] Use stable andExpr() / orExpr() methods [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1053736 (owner: 10Bartosz Dziewoński) [20:03:35] (03Merged) 10jenkins-bot: Use stable andExpr() / orExpr() methods [extensions/ORES] - 10https://gerrit.wikimedia.org/r/1053736 (owner: 10Bartosz Dziewoński) [20:16:13] 06Machine-Learning-Team: Fix API Gateway examples for Javascript - https://phabricator.wikimedia.org/T369865 (10Isaac) 03NEW [20:31:44] RESOLVED: LiftWingServiceErrorRate: ... [20:31:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-damaging&var-backend=arwiki-damaging-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate