[06:52:52] Good morning! [07:16:15] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 13Patch-For-Review, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9811765 (10Urbanecm_WMF) As far as I can see, this task asks for a "stealth" (or "dark mode") deployment of add a... [08:23:25] 06Machine-Learning-Team, 10Add-Link, 06Growth-Team, 13Patch-For-Review, 07User-notice: Deploy "add a link" to 18th round of wikis (en.wp and de.wp) - https://phabricator.wikimedia.org/T308144#9811861 (10Urbanecm_WMF) As far as I can see, this task asks for a "stealth" (or "dark mode") deployment of add a... [09:22:45] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9811939 (10Pginer-WMF) I created a ticket focused on the changes of the endpoints in Content and Section Translation: {T365347} [11:20:13] * isaranto afk lunch! [12:19:44] FIRING: LiftWingServiceErrorRate: ... [12:19:50] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [12:32:34] Hello ruwiki-goodfaith! [12:46:29] o/ [12:54:31] 06Machine-Learning-Team, 06serviceops, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9812372 (10elukey) @akosiaris o/ anything against the overall plan (namely copy to bookworm-wikimedia the packages) and/or concerns about `containerd` ? [12:57:54] hey Luca [13:03:02] soo the above alert is still firing. p75 latencies went ~ or above 10s and CPU peaked https://grafana-rw.wikimedia.org/d/-D2KNUEGk/kubernetes-pod-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-pod=ruwiki-goodfaith-predictor-default-00023-deployment-57b8585nph2&var-container=All&from=now-3h&to=now [13:03:54] although CPU has dropped again, we saw that no addititional replica was created and latencies are still high [13:08:46] Good morning all [13:08:54] Yeah I’ve been noticing that alarm [13:16:49] isaranto: mmm it seems that we didn't really see an increase in rps for ruwiki-goodfaith in eqiad [13:16:59] at least from what I can see, this is why the instances were not added [13:17:31] I am wondering if we have another use case of big queries, and/or some client from ores-legacy doing batch requests [13:17:48] we can do as we did for damaging and raise the min instances [13:21:20] Morning Chris! [13:21:54] elukey: ok, I'm looking into it and will also create a patch to increase min instances to 2 [13:22:59] isaranto: I can bump the instances manually to 2 in the meantime [13:24:49] done, raised to two [13:24:52] That would be great! Thanks! [13:24:58] let's see if it fixes, or if we need to jump to 4 [13:34:44] RESOLVED: LiftWingServiceErrorRate: ... [13:34:44] LiftWing service has a high rate of non 2/3/400 error code responses - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#LiftWingServiceErrorRate - https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=eqiad%20prometheus/k8s-mlserve&var-namespace=revscoring-editquality-goodfaith&var-backend=ruwiki-goodfaith-predictor-default.%2A - https://alerts.wikimedia.org/?q=alertname%3DLiftWingServiceErrorRate [14:10:50] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9812571 (10Jhancock.wm) I rotated B1 to B2 to see if the error moves with it. After booting, not getting any errors. Can we repeal it to see if the error comes back? If i... [15:37:46] kevinbazira: o/ [15:38:17] I noticed https://phabricator.wikimedia.org/T308164, just to double check - we currently have 5 pods running in prod (for each DC), are those enough for the new use case? [15:38:51] I'd ask in https://phabricator.wikimedia.org/T365347 if they have a rough estimate of the traffic that we'll receive, just to be sure [15:39:00] maybe we could do some load test and then tune the pods accordingly [15:39:38] elukey: o/ [15:40:43] okok I am going to ask about the rough traffic estimate so we can prepare accordingly [15:41:01] 10Lift-Wing, 06Machine-Learning-Team: GPU errors in hf image in ml-staging - https://phabricator.wikimedia.org/T362984#9812842 (10elukey) This task is pending T365253, once solved we could reimage ml-staging2001 to Bookworm and re-test :) [15:47:05] super [15:51:25] 06Machine-Learning-Team, 13Patch-For-Review: Update Pytorch base image to 2.3.0 - https://phabricator.wikimedia.org/T365166#9812918 (10isarantopoulos) Unfortunately pytorch package seems to get bigger and bigger after each release. Same for ROCm. | Pytorch version | ROCm version | raw image size (GB) | Comp... [16:04:51] 06Machine-Learning-Team, 06Language-Team, 07Epic: Migrate Content Translation Recommendation API to Lift Wing - https://phabricator.wikimedia.org/T308164#9812963 (10kevinbazira) Hi @Pginer-WMF, do you have an estimate of the expected traffic the Content and Section Translation features will generate on LiftW... [16:05:12] elukey: --^ [16:07:51] super [16:08:03] so we'll be more prepared when they switch etc.. [16:08:17] usually not scaling up enough is the first problem that comes up :D [16:10:40] sure sure, thank you for bringing this up. [16:10:41] I'll keep you posted on their response. [16:15:24] I created the patch for the min replicas for ruwiki-goodfaith https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1034114 [16:15:34] I'm going afk folks, have a nice evening/rest of day [16:21:28] * elukey afk! o/ [17:00:09] emmmiaaaaaillllllllls