[08:26:42] morning folks [08:31:48] Hola! [08:36:55] hey folks, dcops fixed ml-serve2002 (faulty dimm), I am reimaging again now [08:37:04] hopefully this time it'll work [08:37:19] Grazie Signore o/ [08:50:22] isaranto: If you have time today we can check together the failing pod for edit-check. Otherwise I can focus on the kserve batcher [08:51:39] isaranto: FYI I tested it on a single pod after the `maxreplicas = 1` change, and it was crashing even with 2 users [08:52:17] sure, will ping you later [08:52:40] did you check logs and events ? [08:56:42] I cannot connect to deploy [08:56:53] hmm [08:57:18] https://www.irccloud.com/pastebin/T4KEskcz/ [08:57:29] probably because of the switchover (?) [08:58:10] yeah could be [08:58:25] you should remove the host from the known hosts and it will be ok [08:58:53] the host changed -> you have saved a key for the old host -> your ssh connection fails [09:06:07] so just delete line 12 from known_hosts [09:07:24] SSH fingerprints for deploy1003 are https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1003.eqiad.wmnet [09:07:35] to double check/trust after removing [09:08:06] $ dig deployment.eqiad.wmnet +short [09:08:06] deploy1003.eqiad.wmnet. [09:08:06] 10.64.16.93 [09:08:22] (after the switchover, as Ilias pointed out) [09:08:36] thnx folks [09:30:25] isaranto: I am not sure if I understand this: [09:30:25] ``` [09:30:25] Readiness probe failed: Get "http://10.194.61.118:15020/app-health/queue-proxy/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) [09:30:25] 4m36s Warning Unhealthy pod/edit-check-predictor-00012-deployment-566f9f87cf-sptcm Readiness probe failed: HTTP probe failed with statuscode: 503 [09:30:25] 5m14s Warning Unhealthy pod/edit-check-predictor-00012-deployment-566f9f87cf-sptcm Readiness probe failed: HTTP probe failed with statuscode: 500 [09:30:26] 84s Warning InternalError revision/edit-check-predictor-00012 failed to update deployment "edit-check-predictor-00012-deployment": Operation cannot be fulfilled on deployments.apps "edit-check-predictor-00012-deployment": the object has been modified; please apply your changes to the latest version and try again [09:30:26] 6m22s Warning InternalError serverlessservice/edit-check-predictor-00012 failed to update public K8s Endpoints: Operation cannot be fulfilled on endpoints "edit-check-predictor-00012": the object has been modified; please apply your changes to the latest version and try again [09:30:27] 6m23s Warning InternalError revision/edit-check-predictor-00012 failed to update PA "edit-check-predictor-00012": Operation cannot be fulfilled on podautoscalers.autoscaling.internal.knative.dev "edit-check-predictor-00012": the object has been modified; please apply your changes to the latest version and try again [09:30:27] ``` [09:30:54] "the object has been modified please apply your changes to the latesr version and try again" [09:32:29] georgekyz: the isvc is "wrapped" around some other services, like envoy/istio for the HTTP proxy part and the Knative queue proxy, that buffers requests basically [09:33:00] the model server's python code is probably failing and returning 500s to them [09:33:16] (unless you are testing something knative specific etc..) [09:33:29] This is kinda strange because I see only 200 in the logs [09:33:42] https://www.irccloud.com/pastebin/TGlEceAR/ [09:33:56] if you do kubectl describe pod edit-check-xxxxx and scroll up a bit you will see the reason . it seems it ran out of memory (OOMKilled) [09:33:56] ``` [09:33:56] Containers: [09:33:56] kserve-container: [09:33:56] Container ID: containerd://ff26c474a0bea83ca567502b137a14da765450d1016ab8df928fac886ef3913c [09:33:57] Image: docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-edit-check@sha256:64bbed61be75bf6d659f7d494c2303b0945e572a04c61d5808f3fe4dab0449b1 [09:33:57] Image ID: docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-edit-check@sha256:64bbed61be75bf6d659f7d494c2303b0945e572a04c61d5808f3fe4dab0449b1 [09:33:58] Port: 8080/TCP [09:33:58] Host Port: 0/TCP [09:33:59] State: Waiting [09:33:59] Reason: CrashLoopBackOff [09:34:00] Last State: Terminated [09:34:00] Reason: OOMKilled [09:34:01] ``` [09:34:09] there you go yes :D [09:35:02] so it was scaling horizontally before due to memory issues right? [09:35:26] I will try now to set more memory and test it again. [09:35:27] Thnx for your help folks [09:37:52] I'm not sure, I cant validate the OOM issue from the resource usage https://grafana.wikimedia.org/goto/Yd1HwqhNg?orgId=1 [09:38:10] unless it was instant and for some reason not captured in grafana (not sure if this happens) [09:39:37] re: autoscaling, you can see what it was doing in the "autoscaler" row of https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=codfw%20prometheus%2Fk8s-mlstaging&var-knative_namespace=knative-serving&var-revisions_namespace=experimental [09:39:52] I doubt that we have autoscaling for ml-staging:experimental though [09:41:19] I just set 8gigs RAM and it works smoothly [09:41:27] autoscaling was enabled (maxreplicas was 3) - but we disabled yesterday. georgekyz an increase in rps triggers autoscaling though not memory usage [09:42:24] the current status is `maxreplicas=1` and with 8gigs ran smoothly [09:42:33] I will run heavier tests and see [09:42:34] that is a good sign :) [09:42:47] okok I'll shut up then :) [09:43:06] but it shouldnt fail with 4Gi so we can investigate [09:43:17] elukey: noooo please dont <3 [09:43:32] I killed the tests [09:44:27] but I do not get it... if you check grafana it is using only 1.4G of ram [09:44:38] the same as before... but now is not failing [09:44:49] it never went to exceed the limit of 4Gs [09:46:15] (ml-serve2002 up and running) [09:46:43] elukey: Can I run the tests now ? [09:47:23] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10660949 (10elukey) Worked perfectly, thanks a lot! [09:47:25] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: DIMM B1 issues for ml-serve2002 - https://phabricator.wikimedia.org/T389472#10660950 (10elukey) 05Open→03Resolved [09:47:38] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10660956 (10elukey) [09:48:00] georgekyz: yes yes I am working on prod don't worry, I am updating the chan for awareness, don't mind me) [09:48:56] always mind the SREs 🤣 [09:58:40] 06Machine-Learning-Team, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - https://phabricator.wikimedia.org/T380279#10661061 (10Gehel) [10:36:31] Morning! [10:39:34] \o Tobias! [10:54:10] klausman: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1130098 let me know what you thing [10:55:18] short version: since requests are made from the browser it will be difficult to implement authentication [11:03:37] georgekyz: please wait before deploying your change as Im running load tests on ref-need on ml-staging [11:04:27] I will not deploy anything [11:04:45] I am running locust with LoadTestShape [11:05:04] the locust UI is super handy [11:11:31] (moving ml-serve2003 to containerd) [11:16:41] ack [11:23:45] I am finally seeing good results on reference-need \o/ [11:24:23] will share all the results with multiprocessing together later and probably deploy this change [11:25:36] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10661393 (10elukey) Something worth to note is that on all ml-serve that I reimaged now (3 hosts), the PXE boot got stuck here: ` Booting from BRCM MBA Slot 010... [11:31:05] for now I just want to report 0 throttling! [11:42:24] 🥳 [11:49:39] Nice! [11:51:51] isaranto: +1'd your apigw change. I wish using JWTs safely from an interactive browser was doable, but oh well. [11:52:59] klausman: I merged it, whenever you have time please deploy it, thanks! [11:53:06] will do [12:02:29] 2003 up and running with containerd [12:02:53] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10661489 (10elukey) [12:18:38] * isaranto afk lunch [12:46:22] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661683 (10VRiley-WMF) [13:22:37] 10Lift-Wing, 06Machine-Learning-Team, 13Patch-For-Review: Increased latencies in reference-quality models (ref-need) - https://phabricator.wikimedia.org/T387019#10661843 (10isarantopoulos) I tried multiprocessing and run some load tests under heavy load (50 concurrent users), a scenario under which we are cu... [13:22:45] there is the summary --^ [13:23:20] I will proceed to deploy this folks, and will monitor it over the next hours/days. If it doesnt improve things I'll just revert [13:25:16] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10661856 (10gkyziridis) **Update: ** - There was an OOM issue in the pod which is fixed by increasing the Memory limit from `4Gi` to `8Gi` in [[ https... [13:42:13] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661893 (10VRiley-WMF) [13:42:56] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661899 (10VRiley-WMF) @klausman This has been completed and the drives have been added. Is there anything additional we may need to do on our end? [13:46:30] georgekyz: can you run the same test for 1, 2, 5, 10 users? [13:47:10] since we got the requirements this will give us a better idea about the server's latencies under the expected traffic (+some bursts) [13:47:38] 06Machine-Learning-Team, 06DC-Ops, 10ops-eqiad, 06SRE: Q2:install SSD (hot swap additions) to ml-lab100[12] - https://phabricator.wikimedia.org/T381394#10661916 (10klausman) 05Open→03Resolved a:03klausman >>! In T381394#10661893, @VRiley-WMF wrote: > @klausman This has been completed and the driv... [13:48:22] who can review this please? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1130089 [13:49:30] on it [13:49:41] danke! [13:54:35] isaranto: your apigw quota change has been pushed to staging and codfw, letting the atter soak before I do eqiad [13:55:28] thank you [13:56:49] sorry I sent you an old patch to review - the ref-need patch hasn't been updated as I'm having an issue connecting to gerrit [13:58:07] klausman: can you take a look again please ? [14:02:18] sure [14:03:59] done! [14:17:47] (03CR) 10AikoChou: "Thanks for working on this, Kevin! I have a few suggestions on naming and code structure." [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1129755 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [14:20:53] (03CR) 10AikoChou: [V:03+2 C:03+2] locust: add util for fetching recent change revisions [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/1113755 (owner: 10AikoChou) [14:21:48] 10Lift-Wing, 06Machine-Learning-Team, 10EditCheck, 13Patch-For-Review: Load test the peacock edit check service - https://phabricator.wikimedia.org/T388817#10662074 (10gkyziridis) **Multiple Locust tests eidt-check on GPU using LoadTestShape** This test is closer to real scenario based on the discussion we... [14:22:26] isaranto: [14:22:38] isaranto: https://phabricator.wikimedia.org/T388817#10662074 [14:22:39] georgekyz: shall I deploy this as well? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1129285 [14:22:45] I am deploying on staging and prod [14:22:56] and prod ??? [14:23:28] isaranto: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1129285 this is only for staging. Merge if you want and deploy [14:24:14] sry misundersatanding I am deploying reference-need in staging and prod so I can deploy the edit check on staging with it [14:24:20] awesome load test results [14:24:25] :tata [14:24:30] :tada [14:24:34] 🎉 [14:24:41] 🥳 [14:25:25] \o/ [14:33:07] deployed in staging, deploying re-need to prod 🤞 [14:37:11] nice :) if you are ok folks I'd reimage ml-serve2004 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1130131 [14:41:20] sure go ahead! [14:42:09] elukey: thanks for the debugging help - we have no more cpu throttling https://phabricator.wikimedia.org/T387019#10661843 [14:44:01] \o/ [14:45:15] isaranto: mmm something that I don't get though is how important the throttling was [14:45:36] from the images it seems that it was in the order of 10ms over several (~14s) of User [14:45:47] you mean that it was just a couple of ms? [14:46:31] yes normally it is not great if the correspondent "time" on cpu for user is comparable, say 100ms (so the ratio is 1/10 of throttling) [14:46:51] in this case user is ~14s, so a tiny throttling may be expected/tolerated [14:46:55] at least in my head [14:47:16] but from your tests the latency dropped in staging? [14:47:38] (trying to get it but I don't have a ton of context) [14:49:13] we had a little throttling but it never dropped. This also probably means that we could increase the cpu cores to mitigate the issue but even in this case we saw the same throttling + prproceess increased a ton [14:50:08] okok, then it is very sensitive to that [14:50:13] good work :) [14:50:48] preprocess seems to drop to ms from being in the seconds range all the time https://grafana.wikimedia.org/goto/5NGVKe2NR?orgId=1 [14:51:27] super, I didn't expect such big impact [14:51:29] predict becomes a bit unstable though which is expected as we are now using half of cpus than before [14:51:50] well wait a sec, you also added multi-processing [14:51:55] for that drop I mean [14:52:00] it is not only throttling [14:52:20] I'm thinking we should increase cpu per container and decrease max replicas [14:53:07] could be worth testing yes [14:53:43] the blast radius if a pod goes down is bigger, so a good compromise is needed [14:54:06] oh yes you are right [14:55:32] folks I will log out for some hours. I will check your messages afterwards. [14:56:10] no worries, all good here. have a nice weekend George o/ [14:56:16] o/ [14:58:41] 5xxs have disappeared https://grafana.wikimedia.org/goto/F1VLF62Ng?orgId=1 [14:59:56] isaranto: apigw change all deployed [15:00:19] great, I tested it and it works! [15:35:17] isaranto: ...but it just caused a page for unclear reasons, so I created a revert: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1130144 [15:36:49] I +1ed do you have a link to the alert? [15:37:31] https://alerts.wikimedia.org/?q=alertname%3DGatewayBackendErrorsHigh&q=envoy_cluster_name%3Drate_limit_cluster [15:38:36] ok thanks! [15:40:25] and deployed in eqiad. I'll also deploy the revert to staging and codfw, for consistency [15:42:08] ack, thank you [15:43:52] going afk folks, have a nice weekend [16:17:23] bye Ilias! have a nice long weekend :) [16:38:32] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10662884 (10elukey) For some weird reason, after the d-i ml-serve2004 errors out with: ` Booting from Hard drive C: GRUB loading.. Welcome to GRUB! error: disk... [16:38:45] hey folks, for some reason ml-serve2004 doesn't reimage cleanly [16:38:50] it is cordoned, I'll check on monday [16:40:23] roger! [16:42:42] 06Machine-Learning-Team, 13Patch-For-Review: Migrate all Lift Wing k8s workers to Bookworm and containerd - https://phabricator.wikimedia.org/T387854#10662914 (10elukey)